How to find lines containing any string from another file?

Question

I have 2 csv files. File A, with multiple columns. File B, with one column. eg.:

File A:

chr1 100000 100022 A C GeneX
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

File B:

GeneY
GeneZ

I would want my output to be:

chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

I have tried using grep (which crashes) and others.
I am certain there must be a very simple answer to this that I just can't see!

Which platform are you on if `grep` crashes? How big are the files that you're working with? You said that you got an 'out of memory' error when you tried `grep -f FileB FileA`. Your best bet in that case is probably to split `FileB` into sections small enough to be processed without `grep` crashing. The obvious disadvantage of this is that you will end up with rows in the result set that are out of order compared with the original `FileA`. If two words from `FileB` can appear in a single line, then you could also end up with repeats. — Jonathan Leffler, Jan 20 '15 at 04:58
Does `sed` work any better? What about Perl? If neither `sed` nor `grep` nor Perl works, then you may be able to find a better way to encode the information and write your own processing. But that's something of a last resort, depending on a lot of factors not yet described in the question. — Jonathan Leffler, Jan 20 '15 at 04:59
Bad luck. Please identify the platform you're working on, and the sizes of the two files (line count and size in bytes for both files would be useful). — Jonathan Leffler, Jan 20 '15 at 05:34
I've been trying to use unix in a bash terminal. File A is just 1 column of 1500 lines. File B is 1.2M, with 5800. — Chris Dias, Jan 20 '15 at 05:53
Which version of Unix? Those are tiny files! I was assuming you meant millions of records in the list of names, and gigabytes of in the main file. OK; so maybe they aren't tiny, but they are not, by any stretch of the imagination, big. Maybe you need to get GNU `grep` installed? It will be quicker and simpler than most of the alternatives. (I just tried doing `grep -f FileA` with a file containing 1500 generated lines such as `GZX6274256PQA` (a seven digit random number sandwiched between two constant strings) and it started up without a problem on my Mac, using BSD `grep`, rather than GNU. — Jonathan Leffler, Jan 20 '15 at 05:55
Yes, they are not that big, which is why I am struggling. I'm on Darwin Kernel Version 13.4.0. — Chris Dias, Jan 20 '15 at 21:07
So that's Mac OS X Mavericks 10.9.5, I guess. I was able to run `grep -f FileA` with a similar file (new set of random numbers, different sandwiching letters) without problems. That's got 16 GiB main memory; I don't know if you're memory constrained -- the memory pressure on my machine is non-existent (11 GiB used, so 5 GiB available) -- see Activity Monitor / Memory tab. Have you rebooted since you ran into trouble? (I hate suggesting that, but it can help surprisingly/depressingly often.) — Jonathan Leffler, Jan 20 '15 at 21:35
See this post: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) — codeforester, Mar 03 '18 at 18:38

score 3 · Answer 1 · answered Jan 20 '15 at 02:51

3

Use grep -f

grep -f FileB FileA

answered Jan 20 '15 at 02:51

Amit

19,780
6
46
54

Thanks Amit. Unfortunately I get an out of memory error message with grep when I try to run this on large datasets. – Chris Dias Jan 20 '15 at 03:42
1

@ChrisDias - You can try after setting locale with `export LC_ALL=C`. Source - http://stackoverflow.com/a/11777835 – Amit Jan 20 '15 at 03:56

Jotne · Answer 2 · 2015-01-20T07:45:23.807

0

Here is how to do it with awk

awk 'FNR==NR {a[$0];next} {for (i in a) if (i~$1) print i}' FileA FileB
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

Or like this:

awk 'FNR==NR {a[$0];next} ($NF in a)' FileB FileA
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

edited Jan 20 '15 at 07:45

answered Jan 20 '15 at 07:39

Jotne

40,548
12
51
55

Thanks Jotne. I tried this, but got an empty output. I have the files as both .csv and tab delimited .txt; neither worked. – Chris Dias Jan 20 '15 at 21:08
@ChrisDias It does works fine with your data above. Then your data differs in format. – Jotne Jan 20 '15 at 21:09

How to find lines containing any string from another file?

2 Answers2