Removing words from wordlist
The basics of this question have been asked before, but I'd like to add a few things.
I'm trying to remove words that exist in one wordlist from another; for example
list1:
1
2
3
4
5
list2:
1
2
3
4
5
red
blue
purple
green
(desired) list3:
red
blue
purple
green
The problem is in the size of the wordlists(~1billion and 50 million). Diff, fgrep and comm understandably run out of resources quite quickly. I considered writing a shell script that reads each line and uses sed or grep -v to remove lines individually but timing them shows over a minute to remove one line( fyi grep -v is faster than sed '/line/d') so its not feasible. Anyways, I'm just posting to ask if anyone has a solution to this problem. Tomorrow I'm going to see if writing a python solution is any faster but I don't have high hopes, writing it in C would probably produce a bit of a speed bump but I'm not particularly comfortable in those waters.
Re: Removing words from wordlist
Here's a perl script that I wrote up literally in less than a minute. It works, it's hard coded and poorly written, but it works.
I only tested on your sample input, not on a particularly large list. It should use as much memory as list1 uses disk space, and it will write duplicates from list2 into list3, so be careful not to have them in there (or sort -u at the end I suppose). The sorter works by loading list1 into memory (resolving duplicates) and then processing list2 line by line. If the item does not exist in list1, it is copied from list2 and added to list3 (the final list). If it does exist, it is not copied.
For any perl coders watching, I started using an array but thought a hash would be easier, I only knocked in the two lines before my head caught up with my fingers, but I left them there for examples :)
Code:
[bleh:~] gitsnik% cat sorter.pl
#!/usr/bin/env perl
open $file1, "<list1";
open $file2, "<list2";
# my @words = ();
my %words = ();
while (my $line = <$file1>) {
chomp $line;
# push (@words, $line);
$words{$line} = 1;
}
close $file1;
open $file3, ">list3";
while(my $line = <$file2>) {
chomp($line);
if ( !defined $words{$line} ) {
print $file3 "$line\n";
}
}
close $file3;
close $file2;
[bleh:~] gitsnik%
That is nasty code, feel free to clean it up or just use it as an example.
Re: Removing words from wordlist
I hate it when people double post, but this feels more like a good idea in a new one. CKing I blame you for putting the idea in my head.
This was tested locally and worked alright - it's not 100% secure or anything like that, but seems to be a good start. Basically, load the -d file into memory, check it against the -l file by traversing the linked list for each set of words, then write them to -o
Code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void help(void) {
printf(" \n"
" Dictionary Filter v0.01 \n"
" \n"
" Usage: ./filter -d <dictionary_source> -l <wordlist> -o <out_file> \n"
" \n"
" Warning: -o WILL overwrite whatever file you tell it to \n");
exit(-1);
}
typedef struct linked_list {
char word[8192];
struct linked_list *next;
} node;
int main(argc, argv)
int argc;
char * argv[];
{
FILE *fshort, *flong, *foutput;
int c;
char *dshort, *llong, *oout, line[8192];
node *head = NULL;
opterr = 0;
if ( argc != 7 ) {
help();
}
/*
* Get command line arguments
*/
while((c = getopt(argc, argv, "d:l:o:h")) != -1) {
switch (c) {
case 'd':
dshort = optarg;
break;
case 'l':
llong = optarg;
break;
case 'o':
oout = optarg;
break;
case 'h':
help();
break;
case '?':
if (optopt == 'd' || optopt == 'l' || optopt == 'o') {
printf("Option %c requires an argument.\n", optopt);
help();
}
else if ( isprint (optopt)) {
printf("Unknown option -%c\n", optopt);
help();
}
else {
help();
}
return 1;
default:
help();
}
}
/*
* Sanity checks
*/
if ( ! dshort || ! llong || ! oout ) {
help();
}
if ((fshort = fopen(dshort, "r")) == NULL ) {
fprintf(stderr, "Error: Can not read %s\n", dshort);
exit(-1);
}
if ((flong = fopen(llong, "r")) == NULL) {
fprintf(stderr, "Error: Can not read %s\n", llong);
exit(-1);
}
if ((foutput = fopen(oout, "w+")) == NULL) {
fprintf(stderr, "Error: Can not open %s for writing\n", oout);
exit(-1);
}
/*
* Now the tricky part
*
* TODO: Rework logic to remove duplicates from input file
*/
while ( fgets( line, sizeof(line), fshort) != NULL) {
if ( head == NULL ) {
head = (node *)malloc(sizeof(node));
if ( line == NULL ) {
for(c = 0; c < 8192; c++) {
line[c] = '\0';
}
}
strncpy(head->word, line, 8192);
head->next = NULL;
} else {
node *tmp = (node *)malloc(sizeof(node));
strncpy(tmp->word, line, 8192);
tmp->next = head;
head = tmp;
}
}
fclose(fshort);
/*
* No point if the input library is empty
*/
if ( head != NULL ) {
while ( fgets( line, sizeof(line), flong) != NULL) {
c = 0;
node *current = head;
do {
if ( ! strncmp(current->word, line, 8192) ) {
c = 1;
break;
}
current = current->next;
} while (current != head && current != NULL);
if ( c == 0 ) {
fprintf(foutput, "%s", line);
}
}
} else {
fprintf(stderr, "[-] Filter library is empty (-d), stopping processing\n");
fflush(stderr);
exit(1);
}
return 0;
}
Usage from your reference files above is:
./sorter -d list1 -l list2 -o list3
Re: Removing words from wordlist
Dude, two things: one; I take no responsibility for your lame double post and two; if your going to post something at least put some effort into it. Why the hell couldn't you be bothered to add CUDA support?:p
Honestly though I'm extremely impressed, I would have been happy with the first perl script but you took it a BIG step further. I'll be honest and say I'm barely treading water trying to understand it but theres a good chance this will prompt me to learn a bit more about C. I very rarely see someone put this much effort into helping a stranger and I just want to say I really appreciate it. I'll certainly have a lot of use for this program and I'd bet many other people searching the forums will find this helpful as well.
THANK-YOU
Re: Removing words from wordlist
Quote:
Originally Posted by
CKing
Dude, two things: one; I take no responsibility for your lame double post and two; if your going to post something at least put some effort into it. Why the hell couldn't you be bothered to add CUDA support?:p
I'm reading this post thinking "you ungrateful bas... ohhhh" haha.
I get kicks out of doing this sort of thing, after I wrote the perl I was thinking how to do it in C, so I re-familiarized myself with linked lists and threw it together. There are bits that I'm looking at today which look like they may be useless (specifically:
Code:
if ( line == NULL ) {
for(c = 0; c < 8192; c++) {
line[c] = '\0';
}
}
and I could do the getopt loop better, as well as the TODO). So be careful not to take my code as $DEITY-inspired gold. If it works and it is fast enough then cool, but you should almost definitely try to refine it. A b-tree rather than a linked list (or a b-tree OF linked lists) might be a better search-and-destroy method. On the whole though the C is written to follow the perl structure.
Gratitude is welcomed, but unnecessary. I'm always on the look out for 5 minute coding jobs to throw myself at, so it's pretty selfish on my part :D
Edit:I was right, that line is null check is irrelevant because we don't end up in that section of code when fgets reads no line, and an empty line gets added to the linked list anyway because we're not parsing off return characters. That actually produces an interesting output where we can have word\r\n in list1 and word\n in list2 and word will not be filtered out because the strings do not much on an invisible (or not printed for your eyes) level. Something to watch, or perhaps fix.
Re: Removing words from wordlist
There's a lot of things to take into account here:
1) Are both lists sorted?
2) Are both lists sorted according to the same sorting rules?
3) Do either lists contain duplicates?
Depending on the answers to these questions, there are quicker and slower ways of doing this. The algorithms inside the C++ <algorithm> header file might be very handy.
Even if you don't know the answers to these questions, there's a simple way of getting it done. But I'd imagine it's be a lot faster if you knew both lists were sorted and didn't contain duplicates.
Re: Removing words from wordlist
Viirchanza is, of course, correct. In perl terms a hash uses just under 70 or so bytes per entry - so if you have a hash word of "chicken", you are using 77 bytes of memory just to store that line (probably 78 if you include null termination, I'm not that certain of the internal data structures). perl has also not been set to sort the smaller hash when it begins its search, so unless it does b-tree lookups internally (I don't know) you're going to take a performance hit there.
If you instigated a b-tree read-and-load here in the TODO: Rework logic to remove duplicates from input file section rather than the crappy linked list option I've got you reduce a fairly heavy amount of processing from the list (standard search and destroy algorithms really).
Plus you should never trust code I write, it's usually utter crap.