0

I have a list of words I need to check in more one hundred text files.

My list of word's file named : word2search.txt.

This text file contains N word :

Word1
Word2
Word3
Word4
Word5
Word6
Wordn

So far I've done this bash file :

#!/bin/bash

listOfWord2Find=/home/mobaxterm/MyDocuments/word2search.txt

while IFS= read -r listOfWord2Find
do
    echo "$listOfWord2Find"
    grep -l -R "$listOfWord2Find" /home/mobaxterm/MyDocuments/txt/*.txt
    echo "================================================================="
done <"$listOfWord2Find" 

The result does not satisfy me, I can hardly exploit the result

Word1
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
/home/mobaxterm/MyDocuments/txt/file2.txt
/home/mobaxterm/MyDocuments/txt/file3.txt
=================================================================
Word2
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
=================================================================
Word3
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file4.txt
/home/mobaxterm/MyDocuments/txt/file5.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
=================================================================
Word4
/home/mobaxterm/MyDocuments/txt/new 6.txt
/home/mobaxterm/MyDocuments/txt/file1.txt
=================================================================
Word5
/home/mobaxterm/MyDocuments/txt/new 6.txt
=================================================================

This is what i want to see :

/home/mobaxterm/MyDocuments/txt/file1.txt : Word1, Word2, Word3, Word4
/home/mobaxterm/MyDocuments/txt/file2.txt : Word1
/home/mobaxterm/MyDocuments/txt/file3.txt : Word1
/home/mobaxterm/MyDocuments/txt/file4.txt : Word3
/home/mobaxterm/MyDocuments/txt/file5.txt : Word3
/home/mobaxterm/MyDocuments/txt/new 6.txt : Word1, Word2, Word3, Word4, Word5, Word6

I do not understand why my script doesnt show me the Word6(there are files which contains this word6). It stops at word5. To avoid this issue, I've added a new line blablabla (I'm sure to not find this occurence).

If you can help me on this subject :) Thank you.

Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
raoh
  • 47
  • 5
  • 1
    I suggest to postprocess this output from GNU `grep`: `grep -Hf /home/mobaxterm/MyDocuments/word2search.txt /home/mobaxterm/MyDocuments/txt/*.txt` – Cyrus Jan 28 '22 at 16:42
  • You might be interested in grep's -f option. – Shawn Jan 28 '22 at 16:43
  • @Cyrus : I've tried and it doesnt work as expected. I have nothing displayed on the console when I executed : ```grep -Hf /home/mobaxterm/MyDocuments/word2search.txt /home/mobaxterm/MyDocuments/txt/*.txt``` – raoh Jan 28 '22 at 16:59
  • re: *stops at word5* ... depending on how `word2search.txt.` was populated and/or edited, I'm wondering if there could be some unwanted non-printing characters in the file? I'd want to review the output from `od -c word2search.txt` and see if there are any characters other than `a-z`, `0-9`, `` and `\n` – markp-fuso Jan 28 '22 at 17:13
  • If your files have DOS line endings, fix that first. See https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings – tripleee Jan 28 '22 at 20:28

3 Answers3

2

Another much more elegant approach to search all words on each file. One file at a time.

Use grep command multi pattern option -f, --file=FILE, and print matched lines with -o, --only-matching

Then to pipe massage the resulting words into csv list.

Like this:

script.sh

#!/bin/bash

for currFile in $*; do
  matched_words_list=$(grep --only-matching --file=$WORDS_LIST $currFile |sort|uniq|awk -vORS=', ' 1|sed "s/, $//")
  printf "%s : %s\n" "$currFile" "$matched_words_list"
done

script.sh output

Passing words list file in environment variable: WORDS_LIST

Passing inspected files list as arguments list input.*.txt

export WORDS_LIST=./words.txt; ./script.sh input.*.txt
input.1.txt : word1, word2
input.2.txt : word4
input.3.txt :

Explanation:

using words.txt:

word2
word1
word5
word4

using input.1.txt:

word1
word2
word3
word3
word1
word3

And pipe massage the grep command

grep --file=words.txt -o input.1.txt |sort|uniq|awk -vORS=, 1|sed s/,$//
word1,word2

output 1

List all matched words from words.txt in inspected file input.1.txt

grep --file=words.txt -o input.1.txt
word1
word2
word1

 

output 2

List all matched words from words.txt in inspected file input.1.txt

Than sort the output words list

grep --file=words.txt -o input.1.txt|sort
word1
word1
word2

output 3

List all matched words from words.txt in inspected file input.1.txt

Than sort the output words list

Than remove duplicate words

grep --file=words.txt -o input.1.txt|sort|uniq
word1
word2

output 4

List all matched words from words.txt in inspected file input.1.txt

Than sort the output words list

Than remove duplicate words

Than create a csv list from the unique words

grep --file=words.txt -o input.1.txt|sort|uniq|awk -vORS=, 1
word1,word2,

output 5

List all matched words from words.txt in inspected file input.1.txt

Than sort the output words list

Than remove duplicate words

Than create a csv list from the unique words

Than remove trailing , from csv list

grep --file=words.txt -o input.1.txt|sort|uniq|awk -vORS=, 1|sed s/,$//
word1,word2
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
  • Hello Dudi Boy, Thank you for your help and explanation. I have a little bit issue : There is no CSV file created and in my console i have only the name of the file without words found : ```input1.txt : "blank" then input2.txt : "blank"``` any idea ? – raoh Jan 31 '22 at 09:05
  • Ok Nevermind, I found where the problem was with my "blank": ```dos2unix words.txt``` :) – raoh Jan 31 '22 at 09:19
0

The suggest strategy is to scan each line once with all words.

Suggest to write gawk script, which is standard Linux awk

script.awk

FNR == NR { # Only in first file having match words list
  matchWordsArr[++wordsCount] = $0; # read match words into ordered array
  matchedWordInFile[wordsCount] = 0; # reset matchedWordInFile array
}

FNR != NR { # Read line in inspected file
  for (i in matchWordsArr) { # scan line for all match words
    if ($0 ~ matchWordsArr[i]) matchedWordInFile[i]++; # if word is mached increment respective matchedWordInFile[i]
  }
}

ENDFILE{ # on each file read completion
  if (FNR != NR) { # if not first file
    outputLine = sprintf("%s: ", FILENAME); # assign outputLine header to current fileName
    for (i in matchWordsArr) { # iterate over matched words
      if (matchedWordInFile[i] == 0) continue; # skip unmatched words
      outputLine = sprintf("%s%s%s", outputLine, seprator, matchWordsArr[i]); # append matched word to outputLine
      matchedWordInFile[i] = 0; # reset matched words array
      seprator = ","; # set words list seperator ","
    }
    print outputLine;
  } 
  outputLine = seprator = ""; # reset words list seperator "" and outputLine
}

input.1.txt:

word1
word2
word3

input.2.txt:

word3
word4
word5

input.3.txt:

word3
word7
word8

words.txt

word2
word1
word5
word4

running:

$ awk -f script.awk words.txt input.*.txt
input.1.txt: word2,word1
input.2.txt: word5,word4
input.3.txt:
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
0

Just grep:

grep -f list.txt input.*.txt

-f FILENAME allows to use a file with patterns for grep to search.

If you want to display the filename along with the match, pass -H in addition to that:

grep -Hf list.txt input.*.txt
hek2mgl
  • 152,036
  • 28
  • 249
  • 266