2

I have two files: file1.txt:

ID  Gene    ShortName   TSS
A   ENS1S   Gm16088 TSS82763
B   ENS2S   Gm26206 TSS81070
C   ENS3S   Rp1 TSS11475
D   ENS4S   Gm22848 TSS18078
E   ENS5S   Sox17   TSS56047,TSS74369

file2.txt:

ID  Type    Condition
B   Normal  2
J   Cancer  1
K   Cancer  2
A   Normal  3

My desired output is: file1.txt then add the values from file2 that match the first column only:

ID  Gene    ShortName   TSS Type    Condition
A   ENS1S   Gm16088 TSS82763    Normal  3
B   ENS2S   Gm26206 TSS81070    Normal  2
C   ENS3S   Rp1 TSS11475        
D   ENS4S   Gm22848 TSS18078    
E   ENS5S   Sox17   TSS56047,TSS74369       

hence, the Type and Condition columns of file2.txt will be added. if value is in file1 but not in file2, it will be replaced by just empty cell. if value is in file2 but not file1, it will be ignored. here is what I tried so far and it is not working: Inputting 2 data frames then trying to use data merge or join:

 df1 =  pd.read_csv("file1.txt", index_col=0, sep="\t")
 df2 = pd.read_csv("file2.txt", index_col=0, sep="\t")

result2 = pd.merge(df1, df2, on=df1.index, how ="left")
result2.to_csv("Merged.xls", sep="\t")  

I also tried pd.concat with axis 1 but that also didnto work.

I then tried:

  with open('file1.txt') as f:
       r = csv.reader(f, delimiter='\t')
       dict1 = {row[0]: row for row in r}

  with open('file2.txt') as f:
       r = csv.reader(f, delimiter='\t')
       dict2= {row[0]: row for row in r}

  keys =  set(dict1.keys() + dict2.keys()) #i saw this on stackoverlow, i am not sure why it is sorting the keys by alphabetical order and i am unable to unsort (any side tip on that?)

 with open('output.csv', 'wb') as f:
       w = csv.writer(f, delimiter='\t')
       w.writerows([[key, '\t',dict1.get(key),'\t', dict2.get(key)]
           for key in keys])

This also did not give the desired output and there were lot of "'" between strings.. Is there any suggested method? I know how to merge to data frames if it has same # of rows and index , but i couldnt do it if i only want to do it using first file as a standard index. I know how to do it in R using the merge function then by.x and by.y, but R messes up all my header names (the ones up are just an example). So it is best to do it in python.

BioProgram
  • 684
  • 2
  • 13
  • 28

2 Answers2

3

Reading your files with sep='\t' didn't parse properly for, but sep='\s+' did for your sample lines, and then the standard merge gives your desired result:

df1 = pd.read_csv('text1.txt', sep='\s+')
df2 = pd.read_csv('text2.txt', sep='\s+')
df1.merge(df2, on='ID', how='left')

  ID   Gene ShortName                TSS    Type  Condition
0  A  ENS1S   Gm16088           TSS82763  Normal          3
1  B  ENS2S   Gm26206           TSS81070  Normal          2
2  C  ENS3S       Rp1           TSS11475     NaN        NaN
3  D  ENS4S   Gm22848           TSS18078     NaN        NaN
4  E  ENS5S     Sox17  TSS56047,TSS74369     NaN        NaN

You can of course also move 'ID' to the index and use .join(), .concat(), or .merge(left_index=True, right_index=True) with the appropriate settings for left merge for each.

Stefan
  • 41,759
  • 13
  • 76
  • 81
  • The df1.merge works, except when I use: result1 = df1.merge(df2, on='ID', how='left') ; result1.to_csv("Merge1.csv") ; i end up with an additional column in the beginning that has numbers 0 to 4. How can i get rid of that? Also, Is it possible to add the Type and Condition column AFTER ID immediately? It would make more sense to add it there for me than at the end. Should i post that as a different question or is it an easy step? Appreciate the responses – BioProgram Jan 01 '16 at 01:12
  • 1
    Do to_csv(path, index=False) to get rid of the numbers. – Stefan Jan 01 '16 at 01:14
  • For the column order, do df.loc[:, ['col1', 'col2'..] to get desired order. – Stefan Jan 01 '16 at 01:16
  • Great, that works. Any idea about how to have Type and Condition added early on right after ID and without being sorted? – BioProgram Jan 01 '16 at 01:16
  • so you mean for result1.loc[:, column name order] , right? I tried that but instead of using column names (as these can change), i tried using column numbers as such: final_result = result1.loc[:,[1,5,2,3,4]] in order to order the columns of result1 as 1, 5,2, 3,then the 4th. however that gave this error: 'None of [[1, 5, 2, 3, 4]] are in the [columns]' . cant i do it by numbers? – BioProgram Jan 01 '16 at 01:21
  • 1
    Use .iloc for numbers. – Stefan Jan 01 '16 at 01:43
  • Thanks, that didn't work. Kept giving errors for syntax. but i tried something else : newdf2=result1.ix[:, [0,4,5,1,2,3]] and that worked. Thanks aa lot Stefan. – BioProgram Jan 01 '16 at 01:46
  • Happy to help. Do you think you could accept the answer or do you need anything else? – Stefan Jan 01 '16 at 02:30
  • Hi stefan, yes of course i will accept it. I also was able to figure out the iloc syntax: newdf3=result1.iloc[:,(0,4,5,1,2,3)] – BioProgram Jan 01 '16 at 08:22
  • A quick followup: I am trying to set up the columns in a consecutive pattern using a range of columns as the end column will be a variable in my case. Here is what I tried and it is not working: newdf4=result1.ix[:, [0,4,5,1:3]] #the 1:3 is causing an error. i also tried : newdf3=result1.iloc[:,(0,4,5,1:33)] #that also didn't work. I also tried: list_of_ints = list(range(1,4)) ; newdf6=result1.iloc[:,(0,4,5, list_of_ints)] and that also didn't work. Any idea how to fix this to have a combination of a list of integers, as well as a range of values for the columns to be extracted? – BioProgram Jan 01 '16 at 19:15
  • If you are trying to use 1:33 as a `label` as opposed to a `slice` including `integer` position from `1` to `33`, than you should probably use it as a `string` - `'1:33`. See the documentation which is quite specific as to allowed inputs for the three basic selection methods: `loc`, `iloc`, and `ix`. – Stefan Jan 01 '16 at 19:21
  • I have checked the doc. I am using it as follows: instead of using result1.idx[:, [0,4,5,1,2,3]] # to have the columns reordered as #0,4,5,1,2,3; I want to have the rage for 1:3 (as the '3' will change depending on the number of files i am working with- i want to adapt this into another script. Does it make sense? How would you advise this can be done? – BioProgram Jan 01 '16 at 19:23
  • Would you mind posting this as a new question so you can show your data and desired output along with it? Not sure I fully understand what you are trying to accomplish. – Stefan Jan 01 '16 at 19:28
  • Yup. Will do this in a minute. – BioProgram Jan 01 '16 at 19:29
  • http://stackoverflow.com/questions/34559396/rearranging-a-non-consecutive-order-of-columns-in-pandas-dataframe – BioProgram Jan 01 '16 at 19:41
0

You can use join to merge on the indices:

In [11]: df1
Out[11]:
     Gene ShortName                TSS
ID
A   ENS1S   Gm16088           TSS82763
B   ENS2S   Gm26206           TSS81070
C   ENS3S       Rp1           TSS11475
D   ENS4S   Gm22848           TSS18078
E   ENS5S     Sox17  TSS56047,TSS74369

In [12]: df2
Out[12]:
      Type  Condition
ID
B   Normal          2
J   Cancer          1
K   Cancer          2
A   Normal          3

In [13]: df1.join(df2, how="outer")
Out[13]:
     Gene ShortName                TSS    Type  Condition
ID
A   ENS1S   Gm16088           TSS82763  Normal          3
B   ENS2S   Gm26206           TSS81070  Normal          2
C   ENS3S       Rp1           TSS11475     NaN        NaN
D   ENS4S   Gm22848           TSS18078     NaN        NaN
E   ENS5S     Sox17  TSS56047,TSS74369     NaN        NaN
J     NaN       NaN                NaN  Cancer          1
K     NaN       NaN                NaN  Cancer          2

In [14]: df1.join(df2, how="left")
Out[14]:
     Gene ShortName                TSS    Type  Condition
ID
A   ENS1S   Gm16088           TSS82763  Normal          3
B   ENS2S   Gm26206           TSS81070  Normal          2
C   ENS3S       Rp1           TSS11475     NaN        NaN
D   ENS4S   Gm22848           TSS18078     NaN        NaN
E   ENS5S     Sox17  TSS56047,TSS74369     NaN        NaN

Note: You can fill in the NaN with empty string using fillna, but I like to leave them blank (see this post).

That doesn't get you the following line:

K   ENS4S   Gm22848 TSS18078    Cancer  2

but I don't understand how you'd get that (ENS4S is from D and Cancer 2 is from K).

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • For some reason, join is giving me error: ValueError: columns overlap but no suffix specified: Index([u'ID'], dtype='object'). In regards to this option though, it looks like outer ensures keeping all samples, while left will use the df1 (i.e. the one on the left?) as reference, Is this what's supposed to do? Do you know how to come around the error? – BioProgram Jan 01 '16 at 01:09
  • You are right about K. it was an error. I fixed that now. – BioProgram Jan 01 '16 at 01:10
  • 1
    @BioProgram if ID is the index then join works, if it's not you need to use merge. – Andy Hayden Jan 01 '16 at 05:01