Python newbie and stackoverflow posting newbie.
My goal is to create a python script which will take two files entered at the command line and drop columns from the first file if column headers are present within the second file, and write the output to a new file.
I've tried several approaches to this, and currently I am attempting to use Pandas DataFrame.drop
On a very small test set, I can achieve the removal of columns by manually specifying headers in a string (thanks to Delete column from pandas DataFrame), but can't figure out how to import a list of column headers from a file and format them correctly for DataFrame.drop.
I have two files One large: quarter of a million rows and up to 21,000 columns The columns relate to samples, and the rows relate to genetic markers.
I also have a smaller file containing up to a 1000 sample IDs, which correspond to the column headers in the large file. These relate to columns that I wish to drop from the large file.
I have attempted many things (creating lists, creating labels), one example below, but failed.
I would be grateful if anyone could point me in the right direction.
large file
Name Chr Position 8077686010_R04C02.GType 8077686010_R04C02.X 8077686010_R04C02.Y 8131566005_R01C02.GType 8131566005_R01C02.X 8131566005_R01C02.Y
exm-rs1000026 21 38934599 NC 0.0144234 1.112413 NC 0.01250324 1.084685
exm-rs1000053 2 12790328 NC 0.04906762 1.495594 NC 0.07344548 1.552252
exm-rs1000110 9 117908721 NC 0.02433169 1.314785 NC 0.05954991 1.356415
exm-rs1000113 5 150240076 NC 0.015468 0.793373 NC 0.02498361 0.8621324
exm-rs1000158 20 36599904 NC 0.01016421 0.7593179 NC 0.4537758 0.5095596
exm-rs1000192 16 6747139 NC 0.01774782 0.8661015 NC 0.01103768 0.9004255
exm-rs1000203 14 40896108 NC 0.7707067 0.006222768 NC 0.7400684 0.003768863
smallerfile
8077686010_R04C02.GType
8077686010_R04C02.X
8077686010_R04C02.Y
outfile
Name Chr Position 8131566005_R01C02.GType 8131566005_R01C02.X 8131566005_R01C02.Y
exm-rs1000026 21 38934599 NC 0.01250324 1.084685
exm-rs1000053 2 12790328 NC 0.07344548 1.552252
exm-rs1000110 9 117908721 NC 0.05954991 1.356415
exm-rs1000113 5 150240076 NC 0.02498361 0.8621324
exm-rs1000158 20 36599904 NC 0.4537758 0.5095596
exm-rs1000192 16 6747139 NC 0.01103768 0.9004255
exm-rs1000203 14 40896108 NC 0.7400684 0.003768863
Working code
import pandas as pd
import numpy as np
outfile = open("myout.txt", "w")
largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)
new_data = largefile.drop(['8077686010_R04C02.GType','8077686010_R04C02.X','8077686010_R04C02.Y',], axis=1)
new_data.to_csv(outfile,sep="\t")
Failing code - one of many
import pandas as pd
import numpy as np
outfile = open("myout.txt", "w")
largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)
dropcols = open("smallerfile",'r').read().split('\n')
new_data = largefile.drop(dropcols, axis=1)
new_data.to_csv(outfile,sep="\t")
List generated
['8131566005_R01C02.GType', '8131566005_R01C02.X', '8131566005_R01C02.Y', '8131566013_R02C01.GType', '8131566013_R02C01.X', '8131566013_R02C01.Y', '']
Output
Traceback (most recent call last):
File "my.py", line 59, in <module>
new_data = largefile.drop(dropcolslst, axis=1)
File "/usr/lib/pymodules/python2.7/pandas/core/generic.py", line 174, in drop
new_axis = axis.drop(labels)
File "/usr/lib/pymodules/python2.7/pandas/core/index.py", line 881, in drop
raise ValueError('labels %s not contained in axis' % labels[mask])
ValueError: labels ["] not contained in axis