Using grep to find difference between two big wordlists

Question

I have one 78k lines .txt file with british words and a 5k lines .txt file with the most common british words. I want to sort out the most common words from the big list so that I have a new list with the not as common words.

I managed solve my problem in another matter, but I would really like to know, what I am doing wrong since this does not work.

I have tried the following:

//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
//But this procedure apparently gives me two empty files.

If I run just the grep without cut first, I get words that I know are in both files.

I have also tried this:

sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
//No luck either

The two text files in case anyone wants to try for them selves: https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt

See the accepted answer for [this StackOverflow question][1]. [1]: http://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another — gareth_bowles, Feb 17 '14 at 22:58
`cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt` is almost certain to give you incomplete copy of `75kfile.txt` save for `5kfile.txt`. YOu need to save the result of pipelines to a separately named file, maybe `78kfile.tmp`. Good luck. — shellter, Feb 18 '14 at 00:34
You can use vimdiff to see the difference between two files. — Nagaraju, Feb 18 '14 at 11:21

John1024 · Accepted Answer · 2014-02-17T23:24:21.277

After downloading your files, I noticed that (a) brit-a-z-sorted.txt has Microsoft line endings while 5k-most-common-sorted.txt has Unix line endings and (b) you are trying to do whole-line compare (grep -x). So, first we need to convert to a common line ending:

dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt

Now, we can use grep to remove the common words:

grep -xivFf  5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt

I also added the -F flag to assure that the words would be interpreted as a fixed strings rather than as regular expressions. This also speeds things up.

I note that there are several words in the 5k-most-common-sorted.txt file that are not in the brit-a-z-sorted.txt. For example, "British" is in the common file but not the larger file. Also the common file has "aluminum" while the larger file has only "aluminium".

What do the grep options mean? For those who are curious:

-f means read the patterns from a file.

-F means treat them as fixed patterns, not regular expressions,

-i mean ignore case.

-x means do whole-line matches

-v means invert the match. In other words, print those lines that do not match any of the patterns.

Thanks a lot. That is very helpful. How did you notice the bloody ms line endings? — Jacob Czepluch, Feb 18 '14 at 00:52
@r08o I noticed that `grep -w` (whole word pattern) worked but `grep -x` (whole line pattern) didn't. That led me to suspect that there must be some issue with invisible characters in one of the files. — John1024, Feb 18 '14 at 02:07

Using grep to find difference between two big wordlists

1 Answers1