Page 3 of 3 FirstFirst 123
Results 21 to 29 of 29

Thread: C++ large text file dedupe/sort

  1. #21
    Junior Member SBerry's Avatar
    Join Date
    Dec 2007
    Posts
    94

    Default

    Thats exactly as i would do it. Use the STL's map container! It will tell you the particular occurence of each word and more importantly, from this you can get the words without duplicates. You could then use a priority queue to sort the data by priority. I did something similar for a project in college. It involved the use of Huffman Encoding which encodes a text file into a smaller test file in alot less bytes.

  2. #22
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    Quote Originally Posted by SBerry View Post
    Thats exactly as i would do it. Use the STL's map container! It will tell you the particular occurence of each word and more importantly, from this you can get the words without duplicates. You could then use a priority queue to sort the data by priority. I did something similar for a project in college. It involved the use of Huffman Encoding which encodes a text file into a smaller test file in alot less bytes.
    Thanks, I knew once I saw maps and how they worked I had my answer how to dedupe a wordlist. I'm curious as to what you mean by priority sorting.

    This is where I'm at so far with doing both steps. The code works but is probably rather inefficient. Just to learn vectors a little more and to get some code working for the time being I used the vector sort function to take care of any sorting. Unfortunately it doesn't sort properly, what I mean is when you sort the list it will sort capital letters first then lower case letters.

    I've been having problems creating an algorithm for string sorting. I've been trying quicksort mainly but for some reason I can't get the code to work, I've also tried insertion sort but it doesn't seem like the best thing and I still can't get the code to work. For now I am setting it aside until I understand C++ better then I can get into writing better algorithms for this sort of thing.

    Code:
    #include <iostream>
    #include <map>
    #include <string>
    #include <fstream>
    #include <vector>
    #include <algorithm>
    
    using namespace std;
    
    int main()
    {
    	ios_base::sync_with_stdio(false);
    	ofstream filtered;
    	ofstream filtered1;
    	ifstream textfile ("list.txt");
    	string text_input;
    	map<string, long int> map_data;
    	vector<string> sort_vec;
    	long int i;
    
    	if (textfile.is_open())
    	{
    		filtered.open("filtered_list.txt");
    		while( ! textfile.eof() )
    		{
    			getline (textfile, text_input);
    			map_data[text_input]++;
    
    				if (map_data[text_input] == 1)
    				{
    					filtered << text_input << '\n';
    				}
    			}
    		filtered.close();
    		textfile.close();
    		cout << "Filter Process Complete!" << endl;
    		map_data.clear();
    	}
    
    	else
    		cout << "Unable to Open file: " << endl;
    
    ifstream textfile1 ("filtered_list.txt");
    
    	if (textfile1.is_open())
    	{
    		filtered1.open("Filtered_Sorted.txt");
    		while( ! textfile1.eof() )
    		{
    			getline (textfile1, text_input);
    			sort_vec.push_back(text_input);
    		}
    			sort(sort_vec.begin(), sort_vec.end());
    
    		for (i = 0; i < sort_vec.size(); i++)
    		filtered1 << sort_vec[i] << endl;
    		cout << "Sorting Process Complete!" << endl;
    		filtered1.close();
    		textfile1.close();
    		sort_vec.clear();
    	}
    	else
    		cout << "Unable to Open file: " << endl;
    
    	system("pause");
    	return 0;
    }
    This will not handle wordlists 2GB+ I already tried it on pureh@te's wordlist. As a matter of fact notepad won't even open it. I don't understand the reason to having such large wordlists anyways since it is going to do nothing but slow your pc down when its used. I know a lot of programs have trouble handling too large of wordlists anyways. Split them up into reasonable sizes so they are easier to work with.

    I'm also unsure how large of a list vectors can hold before they overflow and crash which is another reason why I want to make my own algorithm. But this code is small and I've learned a lot from it. Even my maps declaration of long int will only hold about 2.4 billion words.

    Quote Originally Posted by level
    No cat.....sounds familiar. I think he just wants some C++ help, maybe for a school project.
    I'm not attending any schools this is just for me to learn C++.

  3. #23
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    P.S. If a mod happens to stoll across this thread could you please delete this post and change the title to "C++ large text file dedupe/sort"

    Thanks

  4. #24
    Member
    Join Date
    Jun 2007
    Posts
    218

    Default

    Originally Posted by hhmatt81

    This will not handle wordlists 2GB+ I already tried it on pureh@te's wordlist. As a matter of fact notepad won't even open it.
    I couldn't open it with notepad either, try using Xploitz's Large_File_Viewer

  5. #25
    Senior Member
    Join Date
    Apr 2007
    Posts
    3,385

    Default

    Quote Originally Posted by hhmatt81 View Post
    P.S. If a mod happens to stoll across this thread could you please delete this post and change the title to "C++ large text file dedupe/sort"

    Thanks
    Your wish is my command.

    Quote Originally Posted by level View Post
    Originally Posted by hhmatt81

    I couldn't open it with notepad either, try using Xploitz's Large_File_Viewer
    Just to clarify......Its not actually "mine". I just included it in my Masters Wordlist Collections. It's a really cool freeware program that can run under Windows and Linux both.

    @ anybody reading this>>>>Get it below.

    [CENTER][FONT=Book Antiqua][SIZE=5][B][COLOR=blue][FONT=Courier New][COLOR=red]--=[/COLOR][/FONT]Xploitz[FONT=Courier New][COLOR=red]=--[/COLOR][/FONT][/COLOR][/B][/SIZE][/FONT][FONT=Courier New][COLOR=Black][SIZE=6][B] ®[/B][/SIZE][/COLOR][/FONT][/CENTER]
    [CENTER][SIZE=4][B]Remote-Exploit.orgs Master Tutorialist.[/B][/SIZE][SIZE=6][B]™
    [/B][/SIZE]
    [URL="http://forums.remote-exploit.org/showthread.php?t=9063"][B]VIDEO: Volume #1 "E-Z No Client WEP Cracking Tutorial"[/B]
    [/URL]
    [URL="http://forums.remote-exploit.org/showthread.php?t=7872"][B]VIDEO: Volume #2 "E-Z No Client Korek Chopchop Attack Tutorial"[/B]
    [/URL]
    [URL="http://forums.remote-exploit.org/showthread.php?t=8230"][B]VIDEO: Volume #3 "E-Z WPA/WPA2 Cracking Tutorial"[/B][/URL]

    [URL="http://forums.remote-exploit.org/showthread.php?t=8041"][B]VIDEO: Volume #4 "E-Z Cracking WPA/WPA2 With Airolib-ng Databases"[/B][/URL]
    [/CENTER]

  6. #26
    Member
    Join Date
    Jun 2007
    Posts
    218

    Default

    Here's a series of C tutorials that may be of help:

    http://www.tazforum.thetazzone.com/viewtopic.php?t=6453

  7. #27
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    Quote Originally Posted by level View Post
    Here's a series of C tutorials that may be of help:

    http://www.tazforum.thetazzone.com/viewtopic.php?t=6453
    The tutorials here are more for the absolute beginner to C. Program I/O, mathematical operands, if else statements, things like that. Thanks for the offer though.

    I used a file splitter to make the lists more manageable. It isn't perfect but it works.

  8. #28
    Junior Member SBerry's Avatar
    Join Date
    Dec 2007
    Posts
    94

    Default

    You should post on some of the C++ newsgroups for detailed programming help. These guys do a lot of work with the standard template library and are very helpful if your stuck. http://groups.google.com/group/comp.lang.c++/topics

  9. #29
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    Quote Originally Posted by SBerry View Post
    You should post on some of the C++ newsgroups for detailed programming help. These guys do a lot of work with the standard template library and are very helpful if your stuck. http://groups.google.com/group/comp.lang.c++/topics
    I'll consider it thank you.

Page 3 of 3 FirstFirst 123

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •