Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 29

Thread: C++ large text file dedupe/sort

  1. #11
    My life is this forum thorin's Avatar
    Join Date
    Jan 2010
    Posts
    2,629

    Default

    Quote Originally Posted by hhmatt81 View Post
    I'm just realizing that I've mislead you guys. The title is meant to say filtering not sort. Was late last night when I wrote this.

    That is an interesting idea streaker but I don't have anything sql on my pc as far as I know. Is mysql what you had in mind or something else?

    Im not positive but I seem to remember reading somewhere that c++ has problems reading to and from a sql database, can anyone with more experience here confirm this?
    Filter = where clause in sql, sort = sort in sql, it doesn't really matter.

    MySQL can be installed on a Windows box.

    The only thing I can see using C++ for is changing all the spaces in the text files to line feeds (so each row is a word), but there's probably easier ways to do that too.
    Edit - MySQL has tools for this, as does Oracle and I'm assuming SQL Express likely does too:
    http://dev.mysql.com/doc/refman/5.0/en/load-data.html
    http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html

    I'm sure you're interested in learning C++ or something but this really seems like re-inventing the wheel. Perhaps you're some kind of Uber Programming Personage but it seems doubtful that you're going to come up with something which is more optimized that a database w/ query parser/execution.

    Also in your posted code you tested that the input file opened successfully but you didn't check that your output file did.
    I'm a compulsive post editor, you might wanna wait until my post has been online for 5-10 mins before quoting it as it will likely change.

    I know I seem harsh in some of my replies. SORRY! But if you're doing something illegal or posting something that seems to be obvious BS I'm going to call you on it.

  2. #12
    Junior Member
    Join Date
    Aug 2007
    Posts
    40

    Default

    Quote Originally Posted by thorin View Post
    Edit - MySQL has tools for this, as does Oracle and I'm assuming SQL Express likely does too[/url]
    MS SQL and Sybase both have bcp (Bulk Copy Utility) for data imports and extracts.

  3. #13
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    Quote Originally Posted by thorin View Post
    Filter = where clause in sql, sort = sort in sql, it doesn't really matter.
    Which is why the title says C++ not SQL. I dont know how to edit this. I'm also trying to keep the title short while explaining in more detail in my post itself.

    Quote Originally Posted by thorin View Post
    MySQL can be installed on a Windows box.

    The only thing I can see using C++ for is changing all the spaces in the text files to line feeds (so each row is a word), but there's probably easier ways to do that too.
    Edit - MySQL has tools for this, as does Oracle and I'm assuming SQL Express likely does too:
    http://dev.mysql.com/doc/refman/5.0/en/load-data.html
    http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html
    I will check these out when I get the chance.

    Quote Originally Posted by thorin View Post
    I'm sure you're interested in learning C++ or something but this really seems like re-inventing the wheel.
    This is one of the ways I learn, It's not reinventing the wheel its programming it in a different language. It's more like creating a rubber wheel, or a wooden wheel, or a steel wheel. Its not reinventing it... Its just made with different components.

    Quote Originally Posted by thorin View Post
    Perhaps you're some kind of Uber Programming Personage but it seems doubtful that you're going to come up with something which is more optimized that a database w/ query parser/execution.
    This is not my intent.

    Quote Originally Posted by thorin View Post
    Also in your posted code you tested that the input file opened successfully but you didn't check that your output file did.
    Its not necessary, If the file exists it opens it and writes to the file, If the file doesn't exist it creates it then opens it for writing.

  4. #14
    My life is this forum thorin's Avatar
    Join Date
    Jan 2010
    Posts
    2,629

    Default

    Which is why the title says C++ not SQL. I dont know how to edit this. I'm also trying to keep the title short while explaining in more detail in my post itself.
    Gotcha...I was just pointing out that with the SQL solution (whether directly via DB/SQL or via C++ using SQL) it's irrelevant whether you're sorting or filtering.

    Quote Originally Posted by hhmatt81 View Post
    Its not necessary, If the file exists it opens it and writes to the file, If the file doesn't exist it creates it then opens it for writing.
    Permissions? Disk full? Disk write protect? File created on disk but not opened (RAM issue or file in use)?
    I'm a compulsive post editor, you might wanna wait until my post has been online for 5-10 mins before quoting it as it will likely change.

    I know I seem harsh in some of my replies. SORRY! But if you're doing something illegal or posting something that seems to be obvious BS I'm going to call you on it.

  5. #15
    Just burned his ISO
    Join Date
    Jun 2007
    Posts
    15

    Default

    hello,
    first, if you want make fast i/o with fstream class, is better if you remove the synchronization between stdio and fstream. just call "std::ios_base::sync_with_stdio(false);" before use fstream class.
    second if you want sort huge file (file that don't fit in memory) try to read same chunks in memory and use quick sort on they, after use merge sort on the chucks to create an huge sorted file.
    third use a std::string like key value in std::map is slow, usually if you can, try to use a type with a fast "< operator" for the key value.

  6. #16
    Very good friend of the forum hhmatt's Avatar
    Join Date
    Jan 2010
    Posts
    660

    Default

    Quote Originally Posted by conte0 View Post
    hello,
    first, if you want make fast i/o with fstream class, is better if you remove the synchronization between stdio and fstream. just call "std::ios_base::sync_with_stdio(false);" before use fstream class.
    second if you want sort huge file (file that don't fit in memory) try to read same chunks in memory and use quick sort on they, after use merge sort on the chucks to create an huge sorted file.
    third use a std::string like key value in std::map is slow, usually if you can, try to use a type with a fast "< operator" for the key value.
    Since I prefer to use "using namespace std;" I should declare it like this?

    Code:
    ios_base sync_with_stdio(false);
    Does it matter if I declare this in the global or local scope?

    I'm reading as much as I can about how to code the sorting into this which is why I haven't posted on this for a while. There's just soo many ways to do it.

    The only values I know for words like in this case are string or char.
    I came to the conclusion that since the lengths of the strings and the length of the file were unknown that it would be really difficult to use an array of chars. I think it would be necessary to use 3 dimensional arrays if I went with the char route.

  7. #17
    Junior Member
    Join Date
    Aug 2007
    Posts
    55

    Default mmmm

    Quote Originally Posted by level View Post
    You can try this:

    cat wordlist.txt | sort | uniq > new_wordlist.txt
    you don't have to cat:

    sort words.txt | uniq > newwords.txt

    will also do the trick

  8. #18
    Just burned his ISO
    Join Date
    Jun 2007
    Posts
    15

    Default

    Quote Originally Posted by hhmatt81 View Post
    Since I prefer to use "using namespace std;" I should declare it like this?

    Code:
    ios_base sync_with_stdio(false);
    yes, if you use "using namespace std;" you can remove the "std::", bat usually is not a good choice open the namespace with the statement "using namespace". so, open the namespace in the little program, bat not in big project.

    Quote Originally Posted by hhmatt81 View Post
    Does it matter if I declare this in the global or local scope?
    no, you can call sync_with_stdio(true) or sync_with_stdio(false) anyware.

    Quote Originally Posted by hhmatt81 View Post
    The only values I know for words like in this case are string or char.
    I came to the conclusion that since the lengths of the strings and the length of the file were unknown that it would be really difficult to use an array of chars. I think it would be necessary to use 3 dimensional arrays if I went with the char route.
    sure, i give you just an hint on general use of std::map(), just keep in mind when you use maps to use the fast type for key value

  9. #19
    Member
    Join Date
    Jun 2007
    Posts
    218

    Default

    No cat.....sounds familiar. I think he just wants some C++ help, maybe for a school project.

  10. #20
    Senior Member
    Join Date
    Feb 2008
    Posts
    681

    Default

    That's true, sounds like a project he's doing.
    All in all some interesting solutions.
    [FONT=Courier New][SIZE=2][FONT=Courier New]hehe...
    [/FONT][/SIZE][/FONT]

Page 2 of 3 FirstFirst 123 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •