-1

Say I have 55 files with 2 columns per file and a different number of rows for each file. I have concatenated them using the following code.

path       = r'/data/user/files' 
files      = os.listdir(path)
file_score   = [os.path.join(path,i) for i in files if i.endswith('tped')]
score   = [pd.read_csv(x, sep='\t',header=0) for x in file_score]
score   = pd.concat(score,axis=1)

Now the outputted score data frame looks as follows,

  gene  file1   gene    file2   gene    file3   gene    file4   gene    file5   
0   A1BG    5.014479    A1BG    6.268099    A1BG    5.014479    A1BG    5.014479    A1BG    5.014479    ... A1BG    6.268099    A1BG    5.014479    A1BG    5.014479    A1BG    5.014479    A1BG    5.014479
1   A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    ... A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578    A1BG-AS1    7.082578
2   A1CF    NaN A2M -2.851459   A2M -2.851459   A2M -2.851459   A2M -2.851459   ... A2M -2.604416   A1CF    NaN A2M -2.851459   A2M -2.851459   A2M -2.851459
3   A2M -11.405835  A2ML1   -0.007012   A2ML1   -0.010518   A2ML1   -0.010518   A2ML1   -0.007012   ... A2ML1   -0.007012   A2M -2.851459   A2ML1   -0.010518   A2ML1   5.705464    A2ML1   -0.007012
4   A2ML1   0.569222    AAAS    NaN AAAS    -3.693289   A4GALT  NaN AAAS    NaN ... A3GALT2 1.174647    A2ML1   -0.007012   A3GALT2 -0.141380   A4GALT  NaN A4GALT  NaN

What I need is the gene column as my index and the file* columns as the columns for my final data frame. The genes columns are different for each file value. However, I need it as an index and fill the missing values for each file column with zeros. I am not sure how can I achieve this. The simple set_index is not working for me.

Any suggestions are appreciated. Thanks

ARJ
  • 2,021
  • 4
  • 27
  • 52
  • Your answer seems to be here -> https://stackoverflow.com/questions/28097222/pandas-merge-two-dataframes-with-different-columns. – Pab Oct 08 '21 at 09:15
  • For some reason, that solution does not work for me. However, I could make my code work for me with this small change, `[pd.read_csv(x, sep='\t',index_col=0) for x in file_score]` – ARJ Oct 08 '21 at 09:18

1 Answers1

1
path       = r'/data/user/files' 
files      = os.listdir(path)
file_score   = [os.path.join(path,i) for i in files if i.endswith('tped')]
score   = [pd.read_csv(x, sep='\t',header=0) for x in file_score]
score   = pd.concat(score,axis=1)

# This should remove all duplicated columns
# Only columns with duplicated names, not values
score = score.loc[:,~score.columns.duplicated()]
# to set the index with the genes column
score.index = score['gene']
 # to fill all N/As with 0
score = score.fillna(0)
5idneyD
  • 204
  • 1
  • 8
  • That is not working for me. Then gives multiple gene values as a list for row in the index column. – ARJ Oct 08 '21 at 09:00
  • I've just made a change, hopefully will work, gets rid of all the duplicates before setting the index – 5idneyD Oct 08 '21 at 09:10
  • 1
    I appreciate your time. I made a small change in the this line and now its fine. `[pd.read_csv(x, sep='\t',index_col=0) for x in file_score]` – ARJ Oct 08 '21 at 09:16