The basics of this question have been asked before, but I'd like to add a few things.

I'm trying to remove words that exist in one wordlist from another; for example
list1:
1
2
3
4
5
list2:
1
2
3
4
5
red
blue
purple
green
(desired) list3:
red
blue
purple
green

The problem is in the size of the wordlists(~1billion and 50 million). Diff, fgrep and comm understandably run out of resources quite quickly. I considered writing a shell script that reads each line and uses sed or grep -v to remove lines individually but timing them shows over a minute to remove one line( fyi grep -v is faster than sed '/line/d') so its not feasible. Anyways, I'm just posting to ask if anyone has a solution to this problem. Tomorrow I'm going to see if writing a python solution is any faster but I don't have high hopes, writing it in C would probably produce a bit of a speed bump but I'm not particularly comfortable in those waters.