How to find list of words (in thousands) in list of tsv files (hundreds), with output as number of match for each string in each file, in linux?

Question

I have hundreds of tsv file with following structure (example):

GH1 123 family1
GH2 23 family2
.
.
.
GH4 45 family4
GH6 34 family6

And i have a text file with list of words (thousands):

GH1
GH2
GH3
.
.
.
GH1000

I want to get output which contain number of each words occurred in each file like this

 GH1 GH2 GH3 ... GH1000
filename1 1 1 0... 4
.
.
.
filename2 2 3 1... 0

I try this code but it gives me zero only

for file in *.tsv; do
    echo $file >> output.tsv
    cat fore.txt | while read line; do
        awk -F "\\t" '{print $1}' $file | grep -wc $line >>output.tsv
        echo "\\t">>output.tsv;
    done ;
done

score 0 · Answer 1 · answered Dec 27 '19 at 12:05

0

Use the following script.

Just put sdtout to output.txt file.

#!/bin/bash

while read p; do
    echo -n "$p "
done <words.txt

echo ""
for file in *.tsv; do
    echo -n "$file = "
    while read p; do
        COUNT=$(sed 's/$p/$p\n/g' $file | grep -c "$p")
        echo -n "$COUNT     "   
    done <words.txt
    echo ""
done

answered Dec 27 '19 at 12:05

Hamza Bilal

11
2

That's horribly inefficient. – tripleee Dec 27 '19 at 13:05
Whats the complexity you are offering? – Hamza Bilal Dec 27 '19 at 13:19
This code also produces tabular format output bu the output results are zero, means no match for the words in tsv file. – Hitesh Tikariha Dec 28 '19 at 03:36

tripleee · Answer 2 · 2019-12-27T20:30:24.130

Here is a simple Awk script which collects a list like the one you describe.

awk 'BEGIN { printf "\t" }
    NR==FNR { a[$1] = n = FNR;
        printf "\t%s", $1; next }
    FNR==1 {
        if(f) { printf "%s", f;
            for (i=1; i<=n; i++)
                printf "\t%s", 0+b[i] }
        printf "\n"
        delete b
        f = FILENAME }
    $1 in a { b[$1]++ }' fore.txt *.tsv /etc/motd

To avoid repeating the big block in END, we add a short sentinel file at the end whose only purpose is to supply a file after the last whose counts will not be reported.

The shell's while read loop is slow and inefficient and somewhat error-prone (you basically always want read -r and handling incomplete text files is hairy); in addition, the brute-force method will require reading the word file once per iteration, which incurs a heavy I/O penalty.

Hi, i tried this code, it gives the result in tabular format but count result are zero for all. — Hitesh Tikariha, Dec 28 '19 at 03:28
Does your input file have DOS carriage returns? Take them out and try again. See also https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings — tripleee, Dec 28 '19 at 08:15

How to find list of words (in thousands) in list of tsv files (hundreds), with output as number of match for each string in each file, in linux?

2 Answers2