Say I have 55 files with 2 columns per file and a different number of rows for each file. I have concatenated them using the following code.
path = r'/data/user/files'
files = os.listdir(path)
file_score = [os.path.join(path,i) for i in files if i.endswith('tped')]
score = [pd.read_csv(x, sep='\t',header=0) for x in file_score]
score = pd.concat(score,axis=1)
Now the outputted score
data frame looks as follows,
gene file1 gene file2 gene file3 gene file4 gene file5
0 A1BG 5.014479 A1BG 6.268099 A1BG 5.014479 A1BG 5.014479 A1BG 5.014479 ... A1BG 6.268099 A1BG 5.014479 A1BG 5.014479 A1BG 5.014479 A1BG 5.014479
1 A1BG-AS1 7.082578 A1BG-AS1 7.082578 A1BG-AS1 7.082578 A1BG-AS1 7.082578 A1BG-AS1 7.082578 ... A1BG-AS1 7.082578 A1BG-AS1 7.082578 A1BG-AS1 7.082578 A1BG-AS1 7.082578 A1BG-AS1 7.082578
2 A1CF NaN A2M -2.851459 A2M -2.851459 A2M -2.851459 A2M -2.851459 ... A2M -2.604416 A1CF NaN A2M -2.851459 A2M -2.851459 A2M -2.851459
3 A2M -11.405835 A2ML1 -0.007012 A2ML1 -0.010518 A2ML1 -0.010518 A2ML1 -0.007012 ... A2ML1 -0.007012 A2M -2.851459 A2ML1 -0.010518 A2ML1 5.705464 A2ML1 -0.007012
4 A2ML1 0.569222 AAAS NaN AAAS -3.693289 A4GALT NaN AAAS NaN ... A3GALT2 1.174647 A2ML1 -0.007012 A3GALT2 -0.141380 A4GALT NaN A4GALT NaN
What I need is the gene
column as my index and the file*
columns as the columns for my final data frame. The genes
columns are different for each file
value. However, I need it as an index and fill the missing values for each file
column with zeros.
I am not sure how can I achieve this. The simple set_index
is not working for me.
Any suggestions are appreciated. Thanks