1

I have 1M word vectors in fasttext format (ignoring the first line containing vocab size and dim). Every line is a word followed by 300 numbers, all space separated, ex.

Word 1.00 0.50 -2.30
WORD 0.90 0.40 -2.20

How can I keep the first line a word appears in, ignoring case, and remove all further lines? For example, since Word appeared first, the line with WORD is deleted and the output is

Word 1.00 0.50 -2.30

I can use tr '[:upper:]' '[:lower:]' < wiki-news-300d-1M.vec to convert all words to lowercase, but that ruins the cases of words. I know how to remove all duplicate lines if the entire line including the numbers matches, but that is not useful here. My python solution would be to keep a dict storing the lowercase of each word, and check each line's word against that dict, but I am curious about a awk/sed (or even grep) solution.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
qwr
  • 9,525
  • 5
  • 58
  • 102

2 Answers2

3

Use tolower($1) as the key in an array in awk.

awk '!a[tolower($1)]++' wiki-news-300d-1M.vec
Barmar
  • 741,623
  • 53
  • 500
  • 612
3

With GNU sort for -s for "stable sort" and assuming the original line order doesn't need to be retained:

$ sort -k1,1 -fsu file
Word 1.00 0.50 -2.30

The difference between this and @Barmar's awk solution are that:

  1. The awk solution will work using any awk while the sort one requires GNU sort to ensure the first duplicate is printed.
  2. The awk solution will retain the input line order while the sort one will produce output in alphabetical order.
  3. The awk solution will be slower than the sort one.
  4. The awk solution will run out of memory for smaller (but still huge) input files than the sort one.
Ed Morton
  • 188,023
  • 17
  • 78
  • 185