Results 1 to 7 of 7

Thread: Removing words from wordlist

  1. #1
    Member CKing's Avatar
    Join Date
    Mar 2010
    Location
    downtown, riverfront
    Posts
    83

    Default Removing words from wordlist

    The basics of this question have been asked before, but I'd like to add a few things.

    I'm trying to remove words that exist in one wordlist from another; for example
    list1:
    1
    2
    3
    4
    5
    list2:
    1
    2
    3
    4
    5
    red
    blue
    purple
    green
    (desired) list3:
    red
    blue
    purple
    green

    The problem is in the size of the wordlists(~1billion and 50 million). Diff, fgrep and comm understandably run out of resources quite quickly. I considered writing a shell script that reads each line and uses sed or grep -v to remove lines individually but timing them shows over a minute to remove one line( fyi grep -v is faster than sed '/line/d') so its not feasible. Anyways, I'm just posting to ask if anyone has a solution to this problem. Tomorrow I'm going to see if writing a python solution is any faster but I don't have high hopes, writing it in C would probably produce a bit of a speed bump but I'm not particularly comfortable in those waters.
    A true gentleman, a good hearty guy.

  2. #2
    Very good friend of the forum Gitsnik's Avatar
    Join Date
    Jan 2010
    Location
    The Crystal Wind
    Posts
    851

    Default Re: Removing words from wordlist

    Here's a perl script that I wrote up literally in less than a minute. It works, it's hard coded and poorly written, but it works.

    I only tested on your sample input, not on a particularly large list. It should use as much memory as list1 uses disk space, and it will write duplicates from list2 into list3, so be careful not to have them in there (or sort -u at the end I suppose). The sorter works by loading list1 into memory (resolving duplicates) and then processing list2 line by line. If the item does not exist in list1, it is copied from list2 and added to list3 (the final list). If it does exist, it is not copied.

    For any perl coders watching, I started using an array but thought a hash would be easier, I only knocked in the two lines before my head caught up with my fingers, but I left them there for examples

    Code:
    [bleh:~] gitsnik% cat sorter.pl 
    #!/usr/bin/env perl
    
    open $file1, "<list1";
    open $file2, "<list2";
    
    # my @words = ();
    my %words = ();
    while (my $line = <$file1>) {
    	chomp $line;
    	# push (@words, $line);
    	$words{$line} = 1;
    }
    
    close $file1;
    
    open $file3, ">list3";
    while(my $line = <$file2>) {
    	chomp($line);
    	if ( !defined $words{$line} ) {
    		print $file3 "$line\n";
    	}
    }
    close $file3;
    close $file2;
    [bleh:~] gitsnik%
    That is nasty code, feel free to clean it up or just use it as an example.
    Last edited by Gitsnik; 03-10-2011 at 07:54 AM. Reason: removed extraneous terminal data, cleaned up english, more explanation
    Still not underestimating the power...

    There is no such thing as bad information - There is truth in the data, so you sift it all, even the crap stuff.

  3. #3
    Very good friend of the forum Gitsnik's Avatar
    Join Date
    Jan 2010
    Location
    The Crystal Wind
    Posts
    851

    Default Re: Removing words from wordlist

    I hate it when people double post, but this feels more like a good idea in a new one. CKing I blame you for putting the idea in my head.

    This was tested locally and worked alright - it's not 100% secure or anything like that, but seems to be a good start. Basically, load the -d file into memory, check it against the -l file by traversing the linked list for each set of words, then write them to -o

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    
    void help(void) {
    	printf("									\n"
    		" Dictionary Filter v0.01						\n"
    		"									\n"
    		" Usage: ./filter -d <dictionary_source> -l <wordlist> -o <out_file>	\n"
    		"									\n"
    		"	Warning: -o WILL overwrite whatever file you tell it to		\n");
    	exit(-1);
    }
    
    typedef struct linked_list {
    	char word[8192];
    	struct linked_list *next;
    } node;
    
    int main(argc, argv)
    	int argc;
    	char * argv[];
    {
    	FILE *fshort, *flong, *foutput;
    	int c;
    	char *dshort, *llong, *oout, line[8192];
    	node *head = NULL;
    
    	opterr = 0;
    
    	if ( argc != 7 ) {
    		help();
    	}
    
    	/*
    	 * Get command line arguments
    	 */
    	while((c = getopt(argc, argv, "d:l:o:h")) != -1) {
    		switch (c) {
    			case 'd':
    				dshort = optarg;
    				break;
    			case 'l':
    				llong = optarg;
    				break;
    			case 'o':
    				oout = optarg;
    				break;
    			case 'h':
    				help();
    				break;
    			case '?':
    				if (optopt == 'd' || optopt == 'l' || optopt == 'o') {
    					printf("Option %c requires an argument.\n", optopt);
    					help();
    				}
    				else if ( isprint (optopt)) {
    					printf("Unknown option -%c\n", optopt);
    					help();
    				}
    				else {
    					help();
    				}
    				return 1;
    			default:
    				help();
    		}
    	}
    
    	/*
    	 * Sanity checks
    	 */
    
    	if ( ! dshort || ! llong || ! oout ) {
    		help();
    	}
    
    	if ((fshort = fopen(dshort, "r")) == NULL ) {
    		fprintf(stderr, "Error: Can not read %s\n", dshort);
    		exit(-1);
    	}
    
    	if ((flong = fopen(llong, "r")) == NULL) {
    		fprintf(stderr, "Error: Can not read %s\n", llong);
    		exit(-1);
    	}
    
    	if ((foutput = fopen(oout, "w+")) == NULL) {
    		fprintf(stderr, "Error: Can not open %s for writing\n", oout);
    		exit(-1);
    	}
    
    	/*
    	 * Now the tricky part
    	 *
    	 * TODO: Rework logic to remove duplicates from input file
    	 */
    
    	while ( fgets( line, sizeof(line), fshort) != NULL) {
    		if ( head == NULL ) {
    			head = (node *)malloc(sizeof(node));
    			if ( line == NULL ) {
    				for(c = 0; c < 8192; c++) {
    					line[c] = '\0';
    				}
    			}
    			strncpy(head->word, line, 8192);
    			head->next = NULL;
    		} else {
    			node *tmp = (node *)malloc(sizeof(node));
    			strncpy(tmp->word, line, 8192);
    			tmp->next = head;
    			head = tmp;
    		}
    	}
    	fclose(fshort);
    
    	/*
    	 * No point if the input library is empty
    	 */
    	if ( head != NULL ) {
    		while ( fgets( line, sizeof(line), flong) != NULL) {
    			c = 0;
    
    			node *current = head;
    			do {
    				if ( ! strncmp(current->word, line, 8192) ) {
    					c = 1;
    					break;
    				}
    				current = current->next;
    			} while (current != head && current != NULL);
    
    			if ( c == 0 ) {
    				fprintf(foutput, "%s", line);
    			}
    		}
    	} else {
    		fprintf(stderr, "[-] Filter library is empty (-d), stopping processing\n");
    		fflush(stderr);
    		exit(1);
    	}
    
    	return 0;
    }
    Usage from your reference files above is:

    ./sorter -d list1 -l list2 -o list3
    Still not underestimating the power...

    There is no such thing as bad information - There is truth in the data, so you sift it all, even the crap stuff.

  4. #4
    Member CKing's Avatar
    Join Date
    Mar 2010
    Location
    downtown, riverfront
    Posts
    83

    Thumbs up Re: Removing words from wordlist

    Dude, two things: one; I take no responsibility for your lame double post and two; if your going to post something at least put some effort into it. Why the hell couldn't you be bothered to add CUDA support?

    Honestly though I'm extremely impressed, I would have been happy with the first perl script but you took it a BIG step further. I'll be honest and say I'm barely treading water trying to understand it but theres a good chance this will prompt me to learn a bit more about C. I very rarely see someone put this much effort into helping a stranger and I just want to say I really appreciate it. I'll certainly have a lot of use for this program and I'd bet many other people searching the forums will find this helpful as well.
    THANK-YOU
    A true gentleman, a good hearty guy.

  5. #5
    Very good friend of the forum Gitsnik's Avatar
    Join Date
    Jan 2010
    Location
    The Crystal Wind
    Posts
    851

    Default Re: Removing words from wordlist

    Quote Originally Posted by CKing View Post
    Dude, two things: one; I take no responsibility for your lame double post and two; if your going to post something at least put some effort into it. Why the hell couldn't you be bothered to add CUDA support?
    I'm reading this post thinking "you ungrateful bas... ohhhh" haha.

    I get kicks out of doing this sort of thing, after I wrote the perl I was thinking how to do it in C, so I re-familiarized myself with linked lists and threw it together. There are bits that I'm looking at today which look like they may be useless (specifically:
    Code:
    			if ( line == NULL ) {
    				for(c = 0; c < 8192; c++) {
    					line[c] = '\0';
    				}
    			}
    and I could do the getopt loop better, as well as the TODO). So be careful not to take my code as $DEITY-inspired gold. If it works and it is fast enough then cool, but you should almost definitely try to refine it. A b-tree rather than a linked list (or a b-tree OF linked lists) might be a better search-and-destroy method. On the whole though the C is written to follow the perl structure.

    THANK-YOU
    Gratitude is welcomed, but unnecessary. I'm always on the look out for 5 minute coding jobs to throw myself at, so it's pretty selfish on my part

    Edit:I was right, that line is null check is irrelevant because we don't end up in that section of code when fgets reads no line, and an empty line gets added to the linked list anyway because we're not parsing off return characters. That actually produces an interesting output where we can have word\r\n in list1 and word\n in list2 and word will not be filtered out because the strings do not much on an invisible (or not printed for your eyes) level. Something to watch, or perhaps fix.
    Last edited by Gitsnik; 03-11-2011 at 07:37 AM. Reason: Obvious
    Still not underestimating the power...

    There is no such thing as bad information - There is truth in the data, so you sift it all, even the crap stuff.

  6. #6
    Very good friend of the forum Virchanza's Avatar
    Join Date
    Jan 2010
    Posts
    863

    Default Re: Removing words from wordlist

    There's a lot of things to take into account here:

    1) Are both lists sorted?
    2) Are both lists sorted according to the same sorting rules?
    3) Do either lists contain duplicates?

    Depending on the answers to these questions, there are quicker and slower ways of doing this. The algorithms inside the C++ <algorithm> header file might be very handy.

    Even if you don't know the answers to these questions, there's a simple way of getting it done. But I'd imagine it's be a lot faster if you knew both lists were sorted and didn't contain duplicates.
    Last edited by Virchanza; 04-18-2011 at 01:03 AM.
    Ask questions on the open forums, that way everybody benefits from the solution, and everybody can be corrected when they make mistakes. Don't send me private messages asking questions that should be asked on the open forums, I won't respond. I decline all "Friend Requests".

  7. #7
    Very good friend of the forum Gitsnik's Avatar
    Join Date
    Jan 2010
    Location
    The Crystal Wind
    Posts
    851

    Default Re: Removing words from wordlist

    Viirchanza is, of course, correct. In perl terms a hash uses just under 70 or so bytes per entry - so if you have a hash word of "chicken", you are using 77 bytes of memory just to store that line (probably 78 if you include null termination, I'm not that certain of the internal data structures). perl has also not been set to sort the smaller hash when it begins its search, so unless it does b-tree lookups internally (I don't know) you're going to take a performance hit there.

    If you instigated a b-tree read-and-load here in the TODO: Rework logic to remove duplicates from input file section rather than the crappy linked list option I've got you reduce a fairly heavy amount of processing from the list (standard search and destroy algorithms really).

    Plus you should never trust code I write, it's usually utter crap.
    Still not underestimating the power...

    There is no such thing as bad information - There is truth in the data, so you sift it all, even the crap stuff.

Similar Threads

  1. Split words into two words from wordlist
    By jonathan11 in forum Beginners Forum
    Replies: 3
    Last Post: 10-01-2010, 04:36 AM
  2. Replies: 3
    Last Post: 09-04-2010, 04:23 PM
  3. wordlist generator from bag of words
    By protoss in forum Beginners Forum
    Replies: 0
    Last Post: 04-29-2010, 06:51 PM
  4. Removing numbers from a wordlist.
    By oracx in forum Beginners Forum
    Replies: 2
    Last Post: 02-22-2010, 11:39 PM
  5. Replies: 28
    Last Post: 10-23-2008, 10:28 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •