How to find words from one file in another file?

Question

In one text file, I have 150 words. I have another text file, which has about 100,000 lines.

How can I check for each of the words belonging to the first file whether it is in the second or not?

I thought about using grep, but I could not find out how to use it to read each of the words in the original text.

Is there any way to do this using awk? Or another solution?

I tried with this shell script, but it matches almost every line:

#!/usr/bin/env sh
cat words.txt | while read line; do  
    if grep -F "$FILENAME" text.txt
    then
        echo "Se encontró $line"
    fi
done

Another way I found is:

fgrep -w -o -f "words.txt" "text.txt"

I'm flagging this as off topic - it really sounds like a question that belongs on Super User, not here, since this is more a question of how to use built in command line tools than how to program. — ArtOfWarfare, Jan 22 '14 at 15:48
@ArtOfWarfare This post it not off topic. You misunderstood this.. — hek2mgl, Jan 22 '14 at 15:51
@hek2mgl - Take a look at the answers. Notice how none of them include a single line of programming. Why? Because it's not a programming question - it's a question about how to use the built in command line tools on Linux. Thus it's a question for SuperUser, not StackOverflow. — ArtOfWarfare, Jan 22 '14 at 15:55
My apologies, I did not think of putting it on SuperUser, I hope you know understand what happened! — ocslegna, Jan 22 '14 at 15:56
@ArtOfWarfare I do the whole day shell coding, and `awk`. (currently). Would you say that I'm not doing programming? Would you say bash and awk aren't programming languages? That's nonsense — hek2mgl, Jan 22 '14 at 15:57
@ocslegna This question is perfect on-topic - and was therefore up-voted :) — hek2mgl, Jan 22 '14 at 15:58
I agree it's as on-topic as any of the hundreds of similar questions we see, but we do need to see some sample input and expected output. @ocslegna - be careful with the answer you select to make sure they operate on WORDs and not STRINGs or even worse REGEXPs or you'll find `the` or `a.r` in the first file matching `theatre` in the second. — Ed Morton, Jan 22 '14 at 17:19
@hek2mgl - No, they are not programming languages. They are scripting languages. — ArtOfWarfare, Jan 22 '14 at 18:47
@ArtOfWarfare There are even (simple) C compilers in `awk`. Wouldn't you call a C compiler a program, even if it is simple and not optimized? I don't want to be right in this discussion, but you shouldn't underestimate the amount of "programs" written in such languages — hek2mgl, Jan 22 '14 at 19:29
@ArtOfWarfare you are very confused if you think interpreting vs compiling a program makes a difference to whether or not the language that program is written in is a programming language. If you have some other distinction in mind, do tell. You'd still be wrong though :-). — Ed Morton, Jan 22 '14 at 19:58
@hek2mgl - I could write a C compiler in Excel, too. That doesn't make questions about how to misuse Excel in this way a valid topic for StackOverflow. Shell SCRIPTS (notice the word script? Nobody ever refers to C code as being a script) aren't a valid topic for StackOverflow. Just because others have snuck by without being flagged doesn't make them anymore valid. Just out of curiosity, if this isn't the difference between SuperUser and StackOverflow, then what is? — ArtOfWarfare, Jan 22 '14 at 20:37
@ArtOfWarfare SuperUser is for questions about installing, configuring or maintaining existent software. Like a web server or a video player, or even an operating system. Stackoverflow is for questions related to something the OP will develop on it's own. But however, shell scripts in general are on-topic on both sides, Stackoverflow and SuperUser, because they could be used for both topics. *This* questions is about general shell "scripting" or lets say shell "programming", that's why it is on topic here. — hek2mgl, Jan 22 '14 at 20:44
@hek2mgl - What you're describing is ServerFault, not SuperUser. — ArtOfWarfare, Jan 23 '14 at 00:40

anubhava · Accepted Answer · 2019-05-20T09:11:35.043

10

You can use grep -f:

grep -Ff "first-file" "second-file"

OR else to match full words:

grep -w -Ff "first-file" "second-file"

UPDATE: As per the comments:

awk 'FNR==NR{a[$1]; next} ($1 in a){delete a[$1]; print $1}' file1 file2

edited May 20 '19 at 09:11

answered Jan 22 '14 at 15:46

anubhava

761,203
64
569
643

1

Cool! Didn't know that! I was about to suggest somethin like: `grep -E $(cat search | tr '\n' '|') text ` :) – hek2mgl Jan 22 '14 at 15:50
Thank you @anubhava! Your answer was helpful. – ocslegna Jan 22 '14 at 16:02
2

This is looking for strings so that's good but will match `the` to `theatre` - is that desirable? – Ed Morton Jan 22 '14 at 17:24
1

Yes `-w` option can be added to make sure complete word is matches (if so desired). – anubhava Jan 22 '14 at 17:52
with: `fgrep -w -o -f "first-file" "second-file` Returns all words were found, but they are repeated. How do I show them only once? – ocslegna Jan 22 '14 at 19:06
So you only want to show a matching line from second file only first time? – anubhava Jan 22 '14 at 19:15
I want to see if the words of text1 are present in the second. – ocslegna Jan 22 '14 at 19:22
1

Right but I just want to understand the output you need. So just list of words from `text1` that are present in second right? – anubhava Jan 22 '14 at 19:30
@anubhava Exactly, in `text1` i got 150 reserved words(red hat i.e) and in the second file `****.sql` i got 100.000 lines and i just only want to know is if the words from file1 are present in the second. – ocslegna Jan 22 '14 at 19:33
@anubhava Do you know why this is working in a fedora server but in a red hat serv don't? – ocslegna Jan 22 '14 at 20:13
1

Are the fixes exactly same on both servers? (check with `cat -vte file` command) – anubhava Jan 22 '14 at 20:15
Yes, its the same: `$` in both servers – ocslegna Jan 22 '14 at 20:22
It could be due to different awk versions, I guess. Is it not showing any output on red hat? – anubhava Jan 22 '14 at 20:27
1

@anubhava In the beggining, no. I cp and edit the same .sql file and wrote a new one. I think it didn´t at the beggining because the .sql file was stored in an ftp and another server before red hat. Now, for the moment, is working. – ocslegna Jan 22 '14 at 20:32
1

Be aware that a direct invocation as either `egrep` or `fgrep` is deprecated, but is provided to allow historical applications that rely on them to run unmodified. (source `man grep`) – kvantour May 20 '19 at 08:11
1

Good point @kvantour I updated answer to use `grep -Ff` instead of `fgrep` in this 5 year old answer. – anubhava May 20 '19 at 09:12

Mark Setchell · Answer 2 · 2014-01-22T18:17:23.580

Use grep like this:

grep -f firstfile secondfile

SECOND OPTION

Thank you to Ed Morton for pointing out that the words in the file "reserved" are treated as patterns. If that is an issue - it may or may not be - the OP can maybe use something like this which doesn't use patterns:

File "reserved"

cat
dog
fox

and file "text"

The cat jumped over the lazy
fox but didn't land on the
moon at all.
However it did land on the dog!!!

Awk script is like this:

awk 'BEGIN{i=0}FNR==NR{res[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,res[j]))print $0}' reserved text

with output:

The cat jumped over the lazy
fox but didn't land on the
However it did land on the dog!!!

THIRD OPTION

Alternatively, it can be done quite simply, but more slowly in bash:

while read r; do grep $r secondfile; done < firstfile

This is looking for regexps and so will match both `the` and `a.r` to `theatre` - is that desirable? — Ed Morton, Jan 22 '14 at 17:25

How to find words from one file in another file?

2 Answers2

Linked

Related