print columns matching a header value

Question

I have two csv files.

File 1:

id,site,longitude,latitude             
**9936**,north,18.2,62.8              
5856,north,17.4914,63.0167             
**1298**,north,18.177,62.877

File 2:

chr,loc,4678,**1298**,2295,**9936**,7354             
chr1,849,0,0,0,0,0,             
chr1,3481,1,1,0,1,1                             
chr1,3491,0,2,0,2,0,

I would like to match ids from column1 in file1 to rows in file2 (as highlighted with **) and if the match print the row and the corresponding lines

Output:

chr,loc,**1298**,**9936**            
chr1,849,0,0             
chr1,3481,1,1                             
chr1,3491,0,2

I've been trying this in python

import csv

f1 = file('inFile.csv', 'rb')                 
f2 = file('inFile2.csv', 'rb')               
f3 = file('outFile.csv', 'wb')                           
c1 = csv.reader(f1)            
c2 = csv.reader(f2)                 
c3 = csv.writer(f3)              

matched_rows = [ row for row in c2 if row[2:6] in c1]           
for row in matched_rows:                                                  
    c3writerow[matched_rows]

but unfortunately it doesn't work.

*"it doesn't work"* isn't a terribly helpful problem description. What precisely is the problem? I'm guessing a `SyntaxError`. — jonrsharpe, Jul 15 '14 at 13:06
It is not clear to me how the output is to be produced. What are the rules governing the output of lines 2, 3 and 4, for example? — Martijn Pieters, Jul 15 '14 at 13:07
no error message comes up. I was trying to get it to look at rows 2:6 in file2 and see if it matches a value in column 1 of file 1 and then print the matched row. — user3816990, Jul 15 '14 at 13:14
@user3816990: are you perhaps talking about **columns** here? The headers for columns 3 and 5 match, so you include columns 0, 1 3 and 5 in the output? — Martijn Pieters, Jul 15 '14 at 13:31
@user3816990: if that's the case, then your output doesn't match your sample input; the last line should be `chr1,3491,2,2` in that case. — Martijn Pieters, Jul 15 '14 at 13:42

Martijn Pieters · Accepted Answer · 2014-07-15T13:42:10.590

You'll need to load the column from file 1 first, and store it in a format that makes looking up values efficient. A set will do here:

with open('inFile.csv', 'rb') as ids_file:
    reader = csv.reader(ids_file)
    next(reader, None)  # skip the first row
    ids = {r[0] for r in reader}

Now you can test for your matching columns:

from operator import itemgetter

with open('inFile2.csv', 'rb') as f2, file('outFile.csv', 'wb') as outf:
    reader = csv.reader(f2)
    writer = csv.writer(outf)

    headers = next(reader, [])
    # produce indices for what headers are present in the ids set
    matching_indices = [i for i, header in enumerate(headers[2:], 2) if header in ids]
    selector = itemgetter(0, 1, *matching_indices)
    # write selected columns to output file
    writer.writerow(selector(headers))
    writer.writerows(selector(row) for row in reader)

Demo on your sample data:

First, producing a set of the first column:

>>> ids_file = '''\
... id,site,longitude,latitude
... 9936,north,18.2,62.8
... 5856,north,17.4914,63.0167
... 1298,north,18.177,62.877
... '''.splitlines()
>>> reader = csv.reader(ids_file)
>>> next(reader, None)
['id', 'site', 'longitude', 'latitude']
>>> ids = {r[0] for r in reader}
>>> ids
set(['5856', '9936', '1298'])

Then using that data to produce a selector using operator.itemgetter():

>>> from operator import itemgetter
>>> f2 = '''\
... chr,loc,4678,1298,2295,9936,7354
... chr1,849,0,0,0,0,0,
... chr1,3481,1,1,0,1,1
... chr1,3491,0,2,0,2,0,
... '''.splitlines()
>>> reader = csv.reader(f2)
>>> headers = next(reader, [])
>>> matching_indices = [i for i, header in enumerate(headers[2:], 2) if header in ids]
>>> matching_indices
[3, 5]
>>> selector = itemgetter(0, 1, *matching_indices)

Now you can use that object to select just the columns you want, to write to the output CSV file:

>>> selector(headers)
('chr', 'loc', '1298', '9936')
>>> selector(next(reader))
('chr1', '849', '0', '0')
>>> selector(next(reader))
('chr1', '3481', '1', '1')
>>> selector(next(reader))
('chr1', '3491', '2', '2')

Thank you very much. What exactly do you mean by a set, I tried following the demo but got a syntax error near unexpected token `(' for producing the set. also a syntax error for 'ids = {r[0] for r in reader}' and for 'with open('inFile2.csv', 'rb') as f2, file('outFile.csv', 'wb') as outf:' — user3816990, Jul 15 '14 at 14:24
@user3816990: are you using Python 2.6? Then use `ids = set(r[0] for r in reader)` instead. See [Why doesn't this set comprehension work?](http://stackoverflow.com/q/11488459) — Martijn Pieters, Jul 15 '14 at 14:25
@user3816990: and you need to nest the `with` statements in that case, so put them on separate lines nested inside one another. See [Does, With open() not works with python 2.6](http://stackoverflow.com/a/12138346) — Martijn Pieters, Jul 15 '14 at 14:26
Thank you Martijn, it worked perfectly and you were superbly clear in explaining how the code worked so that I could follow and learn! — user3816990, Jul 15 '14 at 14:44

print columns matching a header value

1 Answers1