5

I have 2 csv files. File A, with multiple columns. File B, with one column. eg.:

File A:

chr1 100000 100022 A C GeneX
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

File B:

GeneY
GeneZ

I would want my output to be:

chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

I have tried using grep (which crashes) and others.
I am certain there must be a very simple answer to this that I just can't see!

whoan
  • 8,143
  • 4
  • 39
  • 48
Chris Dias
  • 67
  • 1
  • 7
  • Which platform are you on if `grep` crashes? How big are the files that you're working with? You said that you got an 'out of memory' error when you tried `grep -f FileB FileA`. Your best bet in that case is probably to split `FileB` into sections small enough to be processed without `grep` crashing. The obvious disadvantage of this is that you will end up with rows in the result set that are out of order compared with the original `FileA`. If two words from `FileB` can appear in a single line, then you could also end up with repeats. – Jonathan Leffler Jan 20 '15 at 04:58
  • Does `sed` work any better? What about Perl? If neither `sed` nor `grep` nor Perl works, then you may be able to find a better way to encode the information and write your own processing. But that's something of a last resort, depending on a lot of factors not yet described in the question. – Jonathan Leffler Jan 20 '15 at 04:59
  • Thanks. I haven't been able to get sed to work. – Chris Dias Jan 20 '15 at 05:33
  • Bad luck. Please identify the platform you're working on, and the sizes of the two files (line count and size in bytes for both files would be useful). – Jonathan Leffler Jan 20 '15 at 05:34
  • I've been trying to use unix in a bash terminal. File A is just 1 column of 1500 lines. File B is 1.2M, with 5800. – Chris Dias Jan 20 '15 at 05:53
  • Which version of Unix? Those are tiny files! I was assuming you meant millions of records in the list of names, and gigabytes of in the main file. OK; so maybe they aren't tiny, but they are not, by any stretch of the imagination, big. Maybe you need to get GNU `grep` installed? It will be quicker and simpler than most of the alternatives. (I just tried doing `grep -f FileA` with a file containing 1500 generated lines such as `GZX6274256PQA` (a seven digit random number sandwiched between two constant strings) and it started up without a problem on my Mac, using BSD `grep`, rather than GNU. – Jonathan Leffler Jan 20 '15 at 05:55
  • Yes, they are not that big, which is why I am struggling. I'm on Darwin Kernel Version 13.4.0. – Chris Dias Jan 20 '15 at 21:07
  • So that's Mac OS X Mavericks 10.9.5, I guess. I was able to run `grep -f FileA` with a similar file (new set of random numbers, different sandwiching letters) without problems. That's got 16 GiB main memory; I don't know if you're memory constrained -- the memory pressure on my machine is non-existent (11 GiB used, so 5 GiB available) -- see Activity Monitor / Memory tab. Have you rebooted since you ran into trouble? (I hate suggesting that, but it can help surprisingly/depressingly often.) – Jonathan Leffler Jan 20 '15 at 21:35
  • See this post: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) – codeforester Mar 03 '18 at 18:38

2 Answers2

3

Use grep -f

grep -f FileB FileA
Amit
  • 19,780
  • 6
  • 46
  • 54
  • Thanks Amit. Unfortunately I get an out of memory error message with grep when I try to run this on large datasets. – Chris Dias Jan 20 '15 at 03:42
  • 1
    @ChrisDias - You can try after setting locale with `export LC_ALL=C`. Source - http://stackoverflow.com/a/11777835 – Amit Jan 20 '15 at 03:56
0

Here is how to do it with awk

awk 'FNR==NR {a[$0];next} {for (i in a) if (i~$1) print i}' FileA FileB
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

Or like this:

awk 'FNR==NR {a[$0];next} ($NF in a)' FileB FileA
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ
Jotne
  • 40,548
  • 12
  • 51
  • 55
  • Thanks Jotne. I tried this, but got an empty output. I have the files as both .csv and tab delimited .txt; neither worked. – Chris Dias Jan 20 '15 at 21:08
  • @ChrisDias It does works fine with your data above. Then your data differs in format. – Jotne Jan 20 '15 at 21:09