Reading selected column only from CSV file, when all other columns are guaranteed to be identical

Question

I have a bunch of CSV files that Im trying to concatenate into one single csv file . The CSV files are separated by a single space and look like this:

'initial', 'pos', 'orientation', 'ratio'
'chr', '106681', '+', '0.06'
'chr', '106681', '+', '0.88'
'chr', '106681', '+', '0.01'
'chr', '106681', '+', '0.02'

As you can see, all the values are the same except for the ratio. The concatenated file I am creating will look like this:

'filename','initial', 'pos', 'orientation', 'ratio1','ratio2','ratio3'
'jon' , 'chr', '106681', '+', '0.06' , '0.88' ,'0.01'

So basically, ill be iterating through each file, storing only one value of the initial , pos, orientation but all the values of the ratio and updating the table in the concatenated file. This is proving much more confusing than i though it would be. I have the following piece of code to read the csv files:

concatenated_file  = open('josh.csv', "rb")
reader = csv.reader(concatenated_file)

for row in reader:
    print row

which gives:

['chrom', 'pos', 'strand', 'meth_ratio']
['chr2', '106681786', '+', '0.06']
['chr2', '106681796', '+', '0.88']
['chr2', '106681830', '+', '0.01']
['chr2', '106681842', '+', '0.02']

It would be really helpful if some one can show me how to store only one value of the initial , pos, orientation (because they remain same) but all the values of the ratio

How are you going to fit four different ratio values into three columns named `ratio1`, `ratio2`, and `ratio3`? — abarnert, Sep 18 '14 at 00:02
I want to extract only one value of `initial`,`pos`,`orientation` from each file and all the values of `ratio` and update the table in the concatenated file as shown above — letsc, Sep 18 '14 at 00:02
@keimina: That isn't even necessary here if the entire file is a group, which is what it sounds like… — abarnert, Sep 18 '14 at 00:03
@abarnert - Sorry about that. I may have overlooked that when i was formatting the question. Every value of `ratio` will go into a diff column — letsc, Sep 18 '14 at 00:05
@letsc: So, are you sure there are exactly 3 rows per file (even though your examples show 4)? Or do you want to just write more columns than there are headers? — abarnert, Sep 18 '14 at 00:11
*"how to store only one value of initial, pos, orientation (because they remain same) but all the values of the ratio"?* Why? Unless you're trying to save memory. This format is kind of an abuse of CSV format. Anyway, `pandas.read_csv()` and some logical indexing are better (and faster). — smci, Sep 18 '14 at 00:14

abarnert · Answer 1 · 2014-09-18T19:02:55.987

First put it in English terms.

You have to read all those other fields from somewhere, so it might as well be from the first row.

Then, having done that, you need to read the last column from each subsequent row and pack it onto the end of the new row, while ignoring the rest.

So, to turn that into Python:

with open(outpath, 'wb') as outfile:
    writer = csv.writer(outfile)
    for inpath in paths:
        with open(inpath, 'rb') as infile:
            reader = csv.reader(infile)

            # Read all values (including the ratio) from first row
            new_row = next(reader)

            # For every subsequent row...
            for row in reader:
                # ... read the ratio, pack it on, ignore the rest
                new_row.append(row[-1])

            writer.writerow(new_row)

I'm not sure the comments actually add anything; I think my Python is easier to follow than my English. :)

It's worth knowing that what you're trying to do here is called "denormalization". From what I can tell, your data will end up with an arbitrary number of ratio columns per row, all of which have the same "meaning", so each row isn't really a value anymore, but a collection of values.

Denormalization is generally considered bad, for a variety of reasons. There are cases where denormalized data is easier or faster to work with—as long as you know that you're doing it, and why, it can be a useful thing to do. Wikipedia has a nice article on database normalization that explains the issues; you might want to read it so you understand what you're doing here, and can make sure that it's the right thing to do.

This doesnt work. as I get the error `_csv.reader' object has no attribute '__getitem__'` at the line `new_row.append(reader[-1])` — letsc, Sep 18 '14 at 00:23
This is a one-liner with [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html), people... — smci, Sep 18 '14 at 00:31
@smci: No it isn't. `read_csv` can't treat the first row specially, throw away all columns but the last from every other row, and append them onto the end of the first row, then write the resulting row out to another CSV. It can turn the `csv.reader(open(…))` into a one-liner, but that already _is_ a one-liner if you want it to be—and, unlike the pandas version, it's still lazy rather than reading the whole thing into memory. — abarnert, Sep 18 '14 at 18:55

score 1 · Accepted Answer · edited May 23 '17 at 12:21

1

This is a one-liner with pandas.read_csv(). And we can even drop the quoting too:

import pandas as pd

csva = pd.read_csv('a.csv', header=0, quotechar="'", delim_whitespace=True)

csva['ratio']
0    0.06
1    0.88
2    0.01
3    0.02
Name: ratio, dtype: float64

A couple of points:

actually your separator is comma + whitespace. In that sense it's not plain-vanilla CSV. See "How to make separator in read_csv more flexible?"
note we dropped the quoting on numeric fields, by setting quotechar="'"
if you really insist on saving memory (don't), you can drop all other columns of csva than 'ratio', after you do the read_csv. See the pandas doc.

edited May 23 '17 at 12:21

Community

1
1

answered Sep 18 '14 at 00:22

smci

32,567
20
113
146

I gave a a space for formatting purposes. I dint know it would have an effect on the CSV file. The actual files only have the comma and no space – letsc Sep 18 '14 at 00:26
It worked fine for me. If you don't need `quotechar="'"` or `delim_whitespace=True` then omit them. Print out `csva` and `csva.columns` and see what you get, to make sure the header row is beign parsed correctly. – smci Sep 18 '14 at 00:39
Ok. What do you mean "Got it to work"? What did you change? It works as-is for me. Unless you changed the file-format somehow. – smci Sep 18 '14 at 00:53
got rid of `quotechar` and `delim_whitespace` – letsc Sep 18 '14 at 02:19
But you *want* quotechar. You want to treat numbers as numbers, not strings. You can always quote them again when you write them back out with [to_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) – smci Sep 18 '14 at 03:15

Reading selected column only from CSV file, when all other columns are guaranteed to be identical

2 Answers2

Linked