10

In one text file, I have 150 words. I have another text file, which has about 100,000 lines.

How can I check for each of the words belonging to the first file whether it is in the second or not?

I thought about using grep, but I could not find out how to use it to read each of the words in the original text.

Is there any way to do this using awk? Or another solution?

I tried with this shell script, but it matches almost every line:

#!/usr/bin/env sh
cat words.txt | while read line; do  
    if grep -F "$FILENAME" text.txt
    then
        echo "Se encontró $line"
    fi
done

Another way I found is:

fgrep -w -o -f "words.txt" "text.txt"
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
ocslegna
  • 113
  • 1
  • 2
  • 9
  • I'm flagging this as off topic - it really sounds like a question that belongs on Super User, not here, since this is more a question of how to use built in command line tools than how to program. – ArtOfWarfare Jan 22 '14 at 15:48
  • 1
    @ArtOfWarfare This post it not off topic. You misunderstood this.. – hek2mgl Jan 22 '14 at 15:51
  • @hek2mgl - Take a look at the answers. Notice how none of them include a single line of programming. Why? Because it's not a programming question - it's a question about how to use the built in command line tools on Linux. Thus it's a question for SuperUser, not StackOverflow. – ArtOfWarfare Jan 22 '14 at 15:55
  • My apologies, I did not think of putting it on SuperUser, I hope you know understand what happened! – ocslegna Jan 22 '14 at 15:56
  • 3
    @ArtOfWarfare I do the whole day shell coding, and `awk`. (currently). Would you say that I'm not doing programming? Would you say bash and awk aren't programming languages? That's nonsense – hek2mgl Jan 22 '14 at 15:57
  • 2
    @ocslegna This question is perfect on-topic - and was therefore up-voted :) – hek2mgl Jan 22 '14 at 15:58
  • 2
    I agree it's as on-topic as any of the hundreds of similar questions we see, but we do need to see some sample input and expected output. @ocslegna - be careful with the answer you select to make sure they operate on WORDs and not STRINGs or even worse REGEXPs or you'll find `the` or `a.r` in the first file matching `theatre` in the second. – Ed Morton Jan 22 '14 at 17:19
  • @hek2mgl - No, they are not programming languages. They are scripting languages. – ArtOfWarfare Jan 22 '14 at 18:47
  • @ArtOfWarfare There are even (simple) C compilers in `awk`. Wouldn't you call a C compiler a program, even if it is simple and not optimized? I don't want to be right in this discussion, but you shouldn't underestimate the amount of "programs" written in such languages – hek2mgl Jan 22 '14 at 19:29
  • 2
    @ArtOfWarfare you are very confused if you think interpreting vs compiling a program makes a difference to whether or not the language that program is written in is a programming language. If you have some other distinction in mind, do tell. You'd still be wrong though :-). – Ed Morton Jan 22 '14 at 19:58
  • @hek2mgl - I could write a C compiler in Excel, too. That doesn't make questions about how to misuse Excel in this way a valid topic for StackOverflow. Shell SCRIPTS (notice the word script? Nobody ever refers to C code as being a script) aren't a valid topic for StackOverflow. Just because others have snuck by without being flagged doesn't make them anymore valid. Just out of curiosity, if this isn't the difference between SuperUser and StackOverflow, then what is? – ArtOfWarfare Jan 22 '14 at 20:37
  • 2
    @ArtOfWarfare SuperUser is for questions about installing, configuring or maintaining existent software. Like a web server or a video player, or even an operating system. Stackoverflow is for questions related to something the OP will develop on it's own. But however, shell scripts in general are on-topic on both sides, Stackoverflow and SuperUser, because they could be used for both topics. *This* questions is about general shell "scripting" or lets say shell "programming", that's why it is on topic here. – hek2mgl Jan 22 '14 at 20:44
  • @hek2mgl - What you're describing is ServerFault, not SuperUser. – ArtOfWarfare Jan 23 '14 at 00:40

2 Answers2

10

You can use grep -f:

grep -Ff "first-file" "second-file"

OR else to match full words:

grep -w -Ff "first-file" "second-file"

UPDATE: As per the comments:

awk 'FNR==NR{a[$1]; next} ($1 in a){delete a[$1]; print $1}' file1 file2
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Cool! Didn't know that! I was about to suggest somethin like: `grep -E $(cat search | tr '\n' '|') text ` :) – hek2mgl Jan 22 '14 at 15:50
  • Thank you @anubhava! Your answer was helpful. – ocslegna Jan 22 '14 at 16:02
  • 2
    This is looking for strings so that's good but will match `the` to `theatre` - is that desirable? – Ed Morton Jan 22 '14 at 17:24
  • 1
    Yes `-w` option can be added to make sure complete word is matches (if so desired). – anubhava Jan 22 '14 at 17:52
  • with: `fgrep -w -o -f "first-file" "second-file` Returns all words were found, but they are repeated. How do I show them only once? – ocslegna Jan 22 '14 at 19:06
  • So you only want to show a matching line from second file only first time? – anubhava Jan 22 '14 at 19:15
  • I want to see if the words of text1 are present in the second. – ocslegna Jan 22 '14 at 19:22
  • 1
    Right but I just want to understand the output you need. So just list of words from `text1` that are present in second right? – anubhava Jan 22 '14 at 19:30
  • @anubhava Exactly, in `text1` i got 150 reserved words(red hat i.e) and in the second file `****.sql` i got 100.000 lines and i just only want to know is if the words from file1 are present in the second. – ocslegna Jan 22 '14 at 19:33
  • @anubhava Do you know why this is working in a fedora server but in a red hat serv don't? – ocslegna Jan 22 '14 at 20:13
  • 1
    Are the fixes exactly same on both servers? (check with `cat -vte file` command) – anubhava Jan 22 '14 at 20:15
  • Yes, its the same: `$` in both servers – ocslegna Jan 22 '14 at 20:22
  • It could be due to different awk versions, I guess. Is it not showing any output on red hat? – anubhava Jan 22 '14 at 20:27
  • 1
    @anubhava In the beggining, no. I cp and edit the same .sql file and wrote a new one. I think it didn´t at the beggining because the .sql file was stored in an ftp and another server before red hat. Now, for the moment, is working. – ocslegna Jan 22 '14 at 20:32
  • 1
    Be aware that a direct invocation as either `egrep` or `fgrep` is deprecated, but is provided to allow historical applications that rely on them to run unmodified. (source `man grep`) – kvantour May 20 '19 at 08:11
  • 1
    Good point @kvantour I updated answer to use `grep -Ff` instead of `fgrep` in this 5 year old answer. – anubhava May 20 '19 at 09:12
3

Use grep like this:

grep -f firstfile secondfile

SECOND OPTION

Thank you to Ed Morton for pointing out that the words in the file "reserved" are treated as patterns. If that is an issue - it may or may not be - the OP can maybe use something like this which doesn't use patterns:

File "reserved"

cat
dog
fox

and file "text"

The cat jumped over the lazy
fox but didn't land on the
moon at all.
However it did land on the dog!!!

Awk script is like this:

awk 'BEGIN{i=0}FNR==NR{res[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,res[j]))print $0}' reserved text

with output:

The cat jumped over the lazy
fox but didn't land on the
However it did land on the dog!!!

THIRD OPTION

Alternatively, it can be done quite simply, but more slowly in bash:

while read r; do grep $r secondfile; done < firstfile 
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432