Results 1 to 5 of 5

Thread: WordList Sorting Question

  1. #1
    Just burned his ISO
    Join Date
    Oct 2008
    Posts
    2

    Default WordList Sorting Question

    I have some word lists I got from here, and they're rather large. I want to combine, sort, and unique them. I have them combined (might have to do it again), but I'm wondering what way to sort and make it unique. Most guides I see, from here and other places, say to use "sort | uniq" to fix it. This works, but I also discovered sort has a unique option ("sort -u"). I was wondering if anyone here knew which way was faster. Seeing as how this could run for a long time, I'd like to have whichever one was going to be faster.

    If it helps, the computer I'd be running these on would be a Core 2 Duo, with Ubuntu 8.10 x64, with 2GB of Ram. Not sure if that would affect how each one would run, but more info never hurts.

  2. #2

    Default

    It isn't going to take that long, no matter which way you do it. Just go with one or the other method (I've never used sort -u myself)...you would have been finished already!

  3. #3
    Senior Member
    Join Date
    Apr 2008
    Posts
    2,008

    Default

    Quote Originally Posted by mleo2003 View Post
    I have some word lists I got from here, and they're rather large. I want to combine, sort, and unique them. I have them combined (might have to do it again), but I'm wondering what way to sort and make it unique. Most guides I see, from here and other places, say to use "sort | uniq" to fix it. This works, but I also discovered sort has a unique option ("sort -u"). I was wondering if anyone here knew which way was faster. Seeing as how this could run for a long time, I'd like to have whichever one was going to be faster.

    If it helps, the computer I'd be running these on would be a Core 2 Duo, with Ubuntu 8.10 x64, with 2GB of Ram. Not sure if that would affect how each one would run, but more info never hurts.
    The part that will take most time is sorting the words. As both options use the same sort command to do this I do not believe that either option will outperform the other significantly. However, I do believe that using the built in option in sort could potentially be a bit faster as duplicate words will be dropped during the first run through. Using sort | unique will first run through the dictionary sorting the words, and then once more hunting down any duplicates.
    -Monkeys are like nature's humans.

  4. #4
    Developer
    Join Date
    Mar 2007
    Posts
    6,124

    Default

    I believe the -u option is just a symlink to uniq anyway kind of like the -e flag in grep is just a symlink to egrep

  5. #5
    Just burned his ISO
    Join Date
    Oct 2008
    Posts
    2

    Default

    I was thinking the same thing pureh@te, so I went looking for source. According to the source on gnu.org (coreutils for those interested), I found no symlink/any kind of link between sort and uniq, but code to skip lines. I think it works the same internally as uniq would (didn't look at uniq), but with it all being inside one process, and not having to be piped from one process to the next, I'm thinking it would be at least a little bit faster.

    And as to the size, last time I checked, the combined lists I acquired were 20GB or greater. If I had smaller files, I wouldn't have cared, but this is going to be running for awhile. I didn't want to do any extra work, or else I wouldn't be doing word lists at all.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •