Removing lines based on duplicate first word, ignoring case

Question

I have 1M word vectors in fasttext format (ignoring the first line containing vocab size and dim). Every line is a word followed by 300 numbers, all space separated, ex.

Word 1.00 0.50 -2.30
WORD 0.90 0.40 -2.20

How can I keep the first line a word appears in, ignoring case, and remove all further lines? For example, since Word appeared first, the line with WORD is deleted and the output is

Word 1.00 0.50 -2.30

I can use tr '[:upper:]' '[:lower:]' < wiki-news-300d-1M.vec to convert all words to lowercase, but that ruins the cases of words. I know how to remove all duplicate lines if the entire line including the numbers matches, but that is not useful here. My python solution would be to keep a dict storing the lowercase of each word, and check each line's word against that dict, but I am curious about a awk/sed (or even grep) solution.

Since there is an `awk` solution present in this question and tags should be as per answers, hence tagging `awk` back here. — RavinderSingh13, Jun 24 '21 at 13:50

score 3 · Accepted Answer · answered Jun 23 '21 at 23:22

3

Use tolower($1) as the key in an array in awk.

awk '!a[tolower($1)]++' wiki-news-300d-1M.vec

answered Jun 23 '21 at 23:22

Barmar

741,623
53
500
612

I see, so it is just a variation of https://stackoverflow.com/a/11532197/3163618 ? – qwr Jun 23 '21 at 23:27
@qwr That's checking for duplicates of the whole line, and it's not case-insensitive. – Barmar Jun 23 '21 at 23:30
But otherwise it's the same. – Barmar Jun 23 '21 at 23:30

Ed Morton · Answer 2 · 2021-06-24T00:42:44.640

With GNU sort for -s for "stable sort" and assuming the original line order doesn't need to be retained:

$ sort -k1,1 -fsu file
Word 1.00 0.50 -2.30

The difference between this and @Barmar's awk solution are that:

The awk solution will work using any awk while the sort one requires GNU sort to ensure the first duplicate is printed.
The awk solution will retain the input line order while the sort one will produce output in alphabetical order.
The awk solution will be slower than the sort one.
The awk solution will run out of memory for smaller (but still huge) input files than the sort one.

Removing lines based on duplicate first word, ignoring case

2 Answers2