4

I have a data file (two header rows, index column, tab-separated) like this:

Hybridization REF   TCGA-2V-A95S-01A-11R-A37G-13    TCGA-2V-A95S-01A-11R-A37G-13    TCGA-2V-A95S-01A-11R-A37G-13    TCGA-2Y-A9GS-01A-12R-A38M-13    TCGA-2Y-A9GS-01A-12R-A38M-13    TCGA-2Y-A9GS-01A-12R-A38M-13    TCGA-2Y-A9GT-01A-11R-A38M-13    TCGA-2Y-A9GT-01A-11R-A38M-13    TCGA-2Y-A9GT-01A-11R-A38M-13
miRNA_ID    read_count  reads_per_million_miRNA_mapped  cross-mapped    read_count  reads_per_million_miRNA_mapped  cross-mapped    read_count  reads_per_million_miRNA_mapped  cross-mapped
hsa-let-7a-1    17377   4045.749542 N   47187   7077.368096 N   31765   8956.551210 N
hsa-let-7a-2    34913   8128.517796 N   94766   14213.530526    Y   64148   18087.355487    N
hsa-let-7a-3    17496   4073.455371 N   47683   7151.760928 N   31782   8961.344580 N
hsa-let-7b  33546   7810.249993 N   46089   6912.683963 N   64948   18312.925799    N
hsa-let-7c  1349    314.077006  N   12185   1827.573913 Y   14075   3968.627681 N
hsa-let-7d  1735    403.946335  N   1763    264.424523  N   1176    331.588359  N

or simplified:

Sample A A A B B B C C C D ...
Gene   x y z x y z x y z x ...
gene1
gene2    [data matrix]
...

I'd like to read the content of the read_count columns into a DataFrame. Using df = pd.read_csv("file.csv", sep='\t', header=[0, 1], index_col=0, usecols=["read_count"]) raises a ValueError: cannot specify usecols when specifying a multi-index header.

Is there a work-around/a proper solution? I'd like to avoid reading the entire file when actually only a portion of it is useful.

Joe
  • 12,057
  • 5
  • 39
  • 55
pylipp
  • 181
  • 3
  • 15
  • What's the relationship between the first row header and the second row header? I'm not sure that you're creating the dataframe in the best way... how big is the file? – Evan Jan 23 '18 at 00:40
  • The first row contains sample IDs (three times the same, repeatedly), the second row repeatedly contains three variables. The file size is several GB... See the edited question for a simplified file structure. – pylipp Jan 23 '18 at 01:08

1 Answers1

0

Try this:

import pandas as pd
df = pd.read_csv("file.csv", sep='\t', header=[0, 1], index_col=0)
df2 = df.xs('read_count', axis = 1, level = 1)

Copied from here: pandas multiindex - how to select second level when using columns?

I also tried using pd.melt, but didn't get a good solution.

Note that if the file is an appreciable fraction of the machine's memory, reading it in may cause issues. You can try setting chunksize if performance is an issue. I believe, but am not sure, that pandas reads the entire file into memory when read_csv is called, even if only a portion of the file is then stored in the df.

Evan
  • 2,121
  • 14
  • 27
  • Ok, thank you. I have to look into the pandas docs how memory-costly `read_csv` is. I know about the `chunksize` parameter, however I was hoping to be even cleverer during read-in. – pylipp Jan 23 '18 at 02:29
  • Does the above transformation work for your data? Accessing the columns and dealing with the file size are (I think) two separate issues. You may be able to read from the CSV into an HDFStore, although I haven't personally done that (yet). – Evan Jan 23 '18 at 03:43
  • It does work, thanks again. I'm using `chunksize`, too. – pylipp Jan 23 '18 at 12:04