Results 1 to 10 of 29

Thread: C++ large text file dedupe/sort

Hybrid View

  1. #1
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default C++ large text file dedupe/sort

    I've been looking for a solution to this all over the place and I cant seem to find a solid good solution.

    What I want to do is take a large text file and remove the duplicates. (sorting is unimportant but I wouldn't mind learning how to do it both ways) Everytime I google for some sort of answer I keep getting other's crappy programs from sites I don't know and can't trust. So I went to a few of the c++ forums that I know and searched up and down for something similar to this problem but came up empty handed in every forum I came across. Am I overlooking something soo obvious that not even a programming noob has a problem with?

    Just a push in the right direction would help immensely. Right now I'm confused as to whether I should be putting the input in strings or char array's with such a big list. vectors also come to mind here but that would be one huge vector.

    I'm actually writing this is visual c++. I'm fairly new to c++ itself. Making the transition from vb to C++ right now. I'm fairly proficient in vb.

  2. #2
    Moderator KMDave's Avatar
    Join Date
    Jan 2010
    Posts
    2,281

    Default

    Do you need to write it in C++?

    What do you want to use it for?

    A basic approach would be to split the text up, delimited by spaces, put it into an array. Maybe everything in lowercase letters. Then you just take the next word, see if it is already in the array and if not append it.

    Might not be the fastest way but I don't have time to think about time optimized algorithms right now.
    Tiocfaidh ár lá

  3. #3
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    Do you need to write it in C++?
    Its not that I "need" it in c++. I'm trying to learn it instead. I have this in vb already but I think c++ would be much quicker.

    What do you want to use it for?
    Right now I want to use it for removing duplicates out of large dictionary lists. But I'm sure I can find many uses for this type of algorithm in the future.

    A basic approach would be to split the text up, delimited by spaces, put it into an array. Maybe everything in lowercase letters. Then you just take the next word, see if it is already in the array and if not append it.
    Thank You. I'll work on creating something that does just that and post if I run into problems.

    Might not be the fastest way but I don't have time to think about time optimized algorithms right now
    I understand.

  4. #4
    Senior Member streaker69's Avatar
    Join Date
    Jan 2010
    Location
    Virginville, BlueBall, Bird In Hand, Intercourse, Paradise, PA
    Posts
    3,535

    Default

    The easiest way I can think to do it would be to dump your 'text files' into an SQL database and then do a simple SQL query from whatever C++ program you're writing.

    Code:
    Select unique 'word' from list order by 'word';
    That would select only unique words and sort them at the same time. That list could then be rewritten to a text file or back into a database. It would be the quickest way to get lots of data.
    A third party security audit is the IT equivalent of a colonoscopy. It's long, intrusive, very uncomfortable, and when it's done, you'll have seen things you really didn't want to see, and you'll never forget that you've had one.

  5. #5
    My life is this forum thorin's Avatar
    Join Date
    Jan 2010
    Posts
    2,629

    Default

    Quote Originally Posted by streaker69 View Post
    The easiest way I can think to do it would be to dump your 'text files' into an SQL database and then do a simple SQL query from whatever C++ program you're writing.

    Code:
    Select unique 'word' from list order by 'word';
    That would select only unique words and sort them at the same time. That list could then be rewritten to a text file or back into a database. It would be the quickest way to get lots of data.
    I like this solution. No point re-inventing the wheel.
    I'm a compulsive post editor, you might wanna wait until my post has been online for 5-10 mins before quoting it as it will likely change.

    I know I seem harsh in some of my replies. SORRY! But if you're doing something illegal or posting something that seems to be obvious BS I'm going to call you on it.

  6. #6
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    I'm just realizing that I've mislead you guys. The title is meant to say filtering not sort. Was late last night when I wrote this.

    That is an interesting idea streaker but I don't have anything sql on my pc as far as I know. Is mysql what you had in mind or something else?

    Im not positive but I seem to remember reading somewhere that c++ has problems reading to and from a sql database, can anyone with more experience here confirm this?

  7. #7
    Member
    Join Date
    Jun 2007
    Posts
    218

    Default

    You can try this:

    cat wordlist.txt | sort | uniq > new_wordlist.txt

  8. #8
    Junior Member
    Join Date
    Aug 2007
    Posts
    55

    Default mmmm

    Quote Originally Posted by level View Post
    You can try this:

    cat wordlist.txt | sort | uniq > new_wordlist.txt
    you don't have to cat:

    sort words.txt | uniq > newwords.txt

    will also do the trick

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •