I know this isn`t really a backtrack specific thread but i dont know where else to post.

I have a 25Gb folder full of wordlists.

I wanted to combine and clean these so i began looking into unix commands. Obviously there are the simple ones...

Code:
Combine:

cat file1.txt file2.txt > outputfile3.txt
----------------------------------------------
Sort:

sort filename | uniq
----------------------------------------------
Remove Duplicates:

sort -u -o new_file old_file
----------------------------------------------
But that still leaves alot of junk info in the final file. and takes ages to complete. It also requires me to check up on it and run the next command etc etc.

But in my travels i found a page all about sorting and cleaning up wordlists. Removing html tags, emails etc. They gave a full run through of the commands used, but again its gonna take too much faffing about. But they did give an all-in-one set of instructions. But again, needs faffing and checking up on after each command.

Code:
AIO + Sort

    cat * > /tmp/aio-"${PWD##*/}".lst && rm * && mv /tmp/aio-"${PWD##*/}".lst ./

    tr '\r' '\n' < aio-"${PWD##*/}".lst > stage1-tmp && tr '\0' ' ' < stage1-tmp > stage1-tmp1 && tr -cd '\11\12\15\40-\176' < stage1-tmp1 > stage1-tmp && mv stage1-tmp stage1 && rm stage1-*

    htmlTags="a|b|big|blockquote|body|br|center|code|del|div|em|font|h[1-9]|head|hr|html|i|img|ins|item|li|ol|option|p|pre|s|small|span|strong|sub|sup|table|td|th|title|tr|tt|u|ul"
    cat stage1 | sed -r "s/ */ /gI;s/^[ \t]*//;s/[ \t]*$//;s/<[^>]*>//g;s/^\w.*=\"\w.*\">//;s/^($htmlTags)>//I;s/<\/*($htmlTags)$//I;s/&*/&/gI;s/"/\"/gI;s/'/'/gI;s/'/'/gI;s/</ stage2 && rm stage1

    sort -b -f -i -T "$(pwd)/" stage2 > stage3 && rm stage2
    grep -v " * .* " stage3 > stage3.1
    grep " * .* " stage3 > stage3.4
    rm stage3
    for fileIn in stage3.*; do
       cat "$fileIn" | uniq -c -d > stage3.0
       sort -b -f -i -T "$(pwd)/" -k1,1r -k2 stage3.0 > stage3 && rm stage3.0
       sed 's/^ *//;s/^[0-9]* //' stage3 >> "${PWD##*/}"-clean.lst && rm stage3
       cat "$fileIn" | uniq -u >> "${PWD##*/}"-clean.lst
       rm "$fileIn"
    done
    rm -f stage* #aio-"${PWD##*/}".lst

    wc -l "${PWD##*/}"-clean.lst
    md5sum "${PWD##*/}"-clean.lst
What i want to do is turn this into a full script i can just run and have it do all the commands one after another and give me a final result. Prefferably a script that i can just point to the folder and run. But i have no idea about scripting and wondered if there is anyone out there that could help me???

The source for this is here