Page 1 of 3 123 LastLast
Results 1 to 10 of 29

Thread: C++ large text file dedupe/sort

  1. #1
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default C++ large text file dedupe/sort

    I've been looking for a solution to this all over the place and I cant seem to find a solid good solution.

    What I want to do is take a large text file and remove the duplicates. (sorting is unimportant but I wouldn't mind learning how to do it both ways) Everytime I google for some sort of answer I keep getting other's crappy programs from sites I don't know and can't trust. So I went to a few of the c++ forums that I know and searched up and down for something similar to this problem but came up empty handed in every forum I came across. Am I overlooking something soo obvious that not even a programming noob has a problem with?

    Just a push in the right direction would help immensely. Right now I'm confused as to whether I should be putting the input in strings or char array's with such a big list. vectors also come to mind here but that would be one huge vector.

    I'm actually writing this is visual c++. I'm fairly new to c++ itself. Making the transition from vb to C++ right now. I'm fairly proficient in vb.

  2. #2
    Moderator KMDave's Avatar
    Join Date
    Jan 2010
    Posts
    2,281

    Default

    Do you need to write it in C++?

    What do you want to use it for?

    A basic approach would be to split the text up, delimited by spaces, put it into an array. Maybe everything in lowercase letters. Then you just take the next word, see if it is already in the array and if not append it.

    Might not be the fastest way but I don't have time to think about time optimized algorithms right now.
    Tiocfaidh ár lá

  3. #3
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    Do you need to write it in C++?
    Its not that I "need" it in c++. I'm trying to learn it instead. I have this in vb already but I think c++ would be much quicker.

    What do you want to use it for?
    Right now I want to use it for removing duplicates out of large dictionary lists. But I'm sure I can find many uses for this type of algorithm in the future.

    A basic approach would be to split the text up, delimited by spaces, put it into an array. Maybe everything in lowercase letters. Then you just take the next word, see if it is already in the array and if not append it.
    Thank You. I'll work on creating something that does just that and post if I run into problems.

    Might not be the fastest way but I don't have time to think about time optimized algorithms right now
    I understand.

  4. #4
    Senior Member streaker69's Avatar
    Join Date
    Jan 2010
    Location
    Virginville, BlueBall, Bird In Hand, Intercourse, Paradise, PA
    Posts
    3,535

    Default

    The easiest way I can think to do it would be to dump your 'text files' into an SQL database and then do a simple SQL query from whatever C++ program you're writing.

    Code:
    Select unique 'word' from list order by 'word';
    That would select only unique words and sort them at the same time. That list could then be rewritten to a text file or back into a database. It would be the quickest way to get lots of data.
    A third party security audit is the IT equivalent of a colonoscopy. It's long, intrusive, very uncomfortable, and when it's done, you'll have seen things you really didn't want to see, and you'll never forget that you've had one.

  5. #5
    My life is this forum thorin's Avatar
    Join Date
    Jan 2010
    Posts
    2,629

    Default

    Quote Originally Posted by streaker69 View Post
    The easiest way I can think to do it would be to dump your 'text files' into an SQL database and then do a simple SQL query from whatever C++ program you're writing.

    Code:
    Select unique 'word' from list order by 'word';
    That would select only unique words and sort them at the same time. That list could then be rewritten to a text file or back into a database. It would be the quickest way to get lots of data.
    I like this solution. No point re-inventing the wheel.
    I'm a compulsive post editor, you might wanna wait until my post has been online for 5-10 mins before quoting it as it will likely change.

    I know I seem harsh in some of my replies. SORRY! But if you're doing something illegal or posting something that seems to be obvious BS I'm going to call you on it.

  6. #6
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    I'm just realizing that I've mislead you guys. The title is meant to say filtering not sort. Was late last night when I wrote this.

    That is an interesting idea streaker but I don't have anything sql on my pc as far as I know. Is mysql what you had in mind or something else?

    Im not positive but I seem to remember reading somewhere that c++ has problems reading to and from a sql database, can anyone with more experience here confirm this?

  7. #7
    Senior Member streaker69's Avatar
    Join Date
    Jan 2010
    Location
    Virginville, BlueBall, Bird In Hand, Intercourse, Paradise, PA
    Posts
    3,535

    Default

    Quote Originally Posted by hhmatt81 View Post
    I'm just realizing that I've mislead you guys. The title is meant to say filtering not sort. Was late last night when I wrote this.

    That is an interesting idea streaker but I don't have anything sql on my pc as far as I know. Is mysql what you had in mind or something else?

    Im not positive but I seem to remember reading somewhere that c++ has problems reading to and from a sql database, can anyone with more experience here confirm this?
    If you're running winders, then you can download and run SQLExpress 2005, it has a max DB size of I believe 4Gb. You could also use MySQL on a *nix or even a Winders boxen. Your Database does not need to be installed on the same machine that you're running your program. As long as your program knows where the database is located. I've never heard of any problems of C++ having issues connecting to an SQL database. I'm fairly certain that I have several programs here written in C++ that are connecting to a variety of SQL based databases.
    A third party security audit is the IT equivalent of a colonoscopy. It's long, intrusive, very uncomfortable, and when it's done, you'll have seen things you really didn't want to see, and you'll never forget that you've had one.

  8. #8
    Member
    Join Date
    Jun 2007
    Posts
    218

    Default

    You can try this:

    cat wordlist.txt | sort | uniq > new_wordlist.txt

  9. #9
    Junior Member
    Join Date
    Aug 2007
    Posts
    40

    Default

    I'm curious how large of a file you are interested in parsing.

    The reason I ask is because I thought generally when you are declaring an array data structure in most programming languages it is allocated from heap memory or physical ram. So if I'm accurate in thinking this then if you were to try and parse a 10gb password file on a host that only has 2gb of physical ram you could run into issues running out of memory. I've never really needed to open a file that large into an array so I'm not really 100% certain though. Maybe someone a little deeper into programming could answer this with a little more confidence then myself.

    Second, I'd personally recommend MySQL over SQLExpress if you are interested in storing it in a database. Your solution will be easier to port to both windows and *nix operating systems and you shouldn't have to worry about any database size limits. Additionally, keep in mind that the 4gb database size limit in SQLExpress most likely includes the amount of space taken by any indexes you create as well. So you might only be able to store 3.5gb of data if you have .5 gb in indexes.

    Finally, I wouldn't expect you to have many issues trying to connect to any modern RDBMS via C++. The difference is that instead of using some type of ODBC connection like you might be use to doing with VB you will be using library files / API's provided by the RDBMS vendor. For example, Sybase has Open Server, Oracle has OCCI (Oracle C++ Call Interface), and MySQL has mysql++ which is their API for C++.

    Anyway, good luck working through this. Sounds like a fun project.

  10. #10
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    I would like to be able to use this with very large files 1GB+ if possible.

    So I started a little more research as to how I would accomplish this and came across maps. Maps give me the ability to provide an int value with the actual value. Anyways this is what I have so far.

    Code:
    #include <iostream>
    #include <map>
    #include <string>
    #include <fstream>
    
    using namespace std;
    
    int main()
    {
    
    	ofstream filtered;
    	ifstream textfile ("list.txt");
    	string text_input;
    	map<string, long int> map_data;
    
    	if (textfile.is_open())
    	{
    		filtered.open("filtered_list.txt");
    		while( ! textfile.eof() )
    		{
    			getline (textfile, text_input);
    			map_data[text_input]++;
    
    				if (map_data[text_input] == 1)
    				{
    					filtered << text_input << '\n';
    				}
    			}
    		filtered.close();
    		textfile.close();
    	}
    	else
    		cout << "Unable to Open file: ";
    
    	system("pause");
    	return 0;
    }
    I've tested this with a 16 MB file and it took very little time to parse out any duplicate values and rewrite them to a new text file. This code keeps the integrity of the file itself which means it doesn't sort the file. I will continue to work on this until I can provide options for sorting and allowing the user to declare what file to read from and write to. I'm very interested to hear everyone's input as to how I can improve this.

    Because this is a dynamic array so to speak it writes everything to the heap of the memory. But once the memory is full the computer will start writing it to hard disk which is much slower but will still accomplish what it needs to. I'm pretty sure if the memory is full and the hard disk is full the program will error out and will most likely exit in an erraneous fashion leaving the memory and parts of the HD full until it gets dumped. Or at least thats how I think its supposed to happen.

    I will be looking into how to read and write from a SQL database to decrease memory allocation after I accomplish a few things with this code first.

Page 1 of 3 123 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •