0

Python newbie and stackoverflow posting newbie.

My goal is to create a python script which will take two files entered at the command line and drop columns from the first file if column headers are present within the second file, and write the output to a new file.

I've tried several approaches to this, and currently I am attempting to use Pandas DataFrame.drop

On a very small test set, I can achieve the removal of columns by manually specifying headers in a string (thanks to Delete column from pandas DataFrame), but can't figure out how to import a list of column headers from a file and format them correctly for DataFrame.drop.

I have two files One large: quarter of a million rows and up to 21,000 columns The columns relate to samples, and the rows relate to genetic markers.

I also have a smaller file containing up to a 1000 sample IDs, which correspond to the column headers in the large file. These relate to columns that I wish to drop from the large file.

I have attempted many things (creating lists, creating labels), one example below, but failed.

I would be grateful if anyone could point me in the right direction.

large file

    Name     Chr     Position     8077686010_R04C02.GType     8077686010_R04C02.X     8077686010_R04C02.Y     8131566005_R01C02.GType     8131566005_R01C02.X     8131566005_R01C02.Y
exm-rs1000026     21     38934599     NC     0.0144234     1.112413     NC     0.01250324     1.084685
exm-rs1000053     2     12790328     NC     0.04906762     1.495594     NC     0.07344548     1.552252
exm-rs1000110     9     117908721     NC     0.02433169     1.314785     NC     0.05954991     1.356415
exm-rs1000113     5     150240076     NC     0.015468     0.793373     NC     0.02498361     0.8621324
exm-rs1000158     20     36599904     NC     0.01016421     0.7593179     NC     0.4537758     0.5095596
exm-rs1000192     16     6747139     NC     0.01774782     0.8661015     NC     0.01103768     0.9004255
exm-rs1000203     14     40896108     NC     0.7707067     0.006222768     NC     0.7400684     0.003768863

smallerfile

8077686010_R04C02.GType
8077686010_R04C02.X
8077686010_R04C02.Y

outfile

   Name     Chr     Position     8131566005_R01C02.GType     8131566005_R01C02.X     8131566005_R01C02.Y
exm-rs1000026     21     38934599     NC     0.01250324     1.084685
exm-rs1000053     2     12790328     NC     0.07344548     1.552252
exm-rs1000110     9     117908721     NC     0.05954991     1.356415
exm-rs1000113     5     150240076     NC     0.02498361     0.8621324
exm-rs1000158     20     36599904     NC     0.4537758     0.5095596
exm-rs1000192     16     6747139     NC     0.01103768     0.9004255
exm-rs1000203     14     40896108     NC     0.7400684     0.003768863

Working code

import pandas as pd
import numpy as np

outfile = open("myout.txt", "w")

largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)

new_data = largefile.drop(['8077686010_R04C02.GType','8077686010_R04C02.X','8077686010_R04C02.Y',], axis=1)

new_data.to_csv(outfile,sep="\t")

Failing code - one of many

import pandas as pd
import numpy as np

outfile = open("myout.txt", "w")

largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)

dropcols = open("smallerfile",'r').read().split('\n')
new_data = largefile.drop(dropcols, axis=1)

new_data.to_csv(outfile,sep="\t")

List generated

['8131566005_R01C02.GType', '8131566005_R01C02.X', '8131566005_R01C02.Y', '8131566013_R02C01.GType', '8131566013_R02C01.X', '8131566013_R02C01.Y', '']

Output

Traceback (most recent call last):
File "my.py", line 59, in <module>
new_data = largefile.drop(dropcolslst, axis=1)
File "/usr/lib/pymodules/python2.7/pandas/core/generic.py", line 174, in drop
new_axis = axis.drop(labels)
File "/usr/lib/pymodules/python2.7/pandas/core/index.py", line 881, in drop
raise ValueError('labels %s not contained in axis' % labels[mask])
ValueError: labels ["] not contained in axis
Community
  • 1
  • 1
drevansa
  • 3
  • 1
  • 3
  • i would make sure to strip() the first line after the read() and before the split() – Ryan G Aug 07 '14 at 14:38
  • Thank you. I couldn't figure out how to do that. Are there more memory efficient ways of achieving the column drop? I am hitting memory issues on my poor laptop. – drevansa Aug 10 '14 at 01:22

2 Answers2

1

To get your code to work all you need to do is drop the empty string from your dropcols list. Something like this:

dropcols = [x for x in dropcols if x != '']

Of if you want to handle the case where your dropcols list works even if you specify a column not in the larger dataframe, you could do something like this - taking the intersection of your dropcols and the columns in the dataframe.

dropcols = set(dropcols) & set(largefile.columns)
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • Thank you very much for both solutions - I've spent days trying to get this to work - I wondered about the empty string - but I thought the issue was to do with data structure. – drevansa Aug 06 '14 at 22:59
  • +1 for the intersection method! great way to preserve usability in cases where ID's in smaller file are not guaranteed to be in the column header of the large file. – Ryan G Aug 07 '14 at 14:40
0

A more memory efficient way to do this. The key is to apply usecols in pd.read_csv.

import pandas as pd
import numpy as np
dropcols = open("smallerfile",'r').read().split('\n')
cols = open("large",'r').read().rstrip().split('\t')
usecols = [ i for i in range(cols) if cols[i] not in dropcols]

Tell pd.read_csv to only load usecols and specify date type as object. Next save the loaded file.

largefile = pd.read_csv('large',sep='\t',header=0,index_col=0, usecols=usecols, dtype='object')
with open("myout.txt", "w") as outfile:
    largefile.to_csv(outfile,sep="\t")
dct
  • 26
  • 3
  • Thank you very much for that, I will give this a try. I had resorted to splitting my files into managable chunks in order to process them, else used awk to drop columns. – drevansa Jun 10 '15 at 12:51