I have two files: file1.txt:
ID Gene ShortName TSS A ENS1S Gm16088 TSS82763 B ENS2S Gm26206 TSS81070 C ENS3S Rp1 TSS11475 D ENS4S Gm22848 TSS18078 E ENS5S Sox17 TSS56047,TSS74369
file2.txt:
ID Type Condition B Normal 2 J Cancer 1 K Cancer 2 A Normal 3
My desired output is: file1.txt then add the values from file2 that match the first column only:
ID Gene ShortName TSS Type Condition A ENS1S Gm16088 TSS82763 Normal 3 B ENS2S Gm26206 TSS81070 Normal 2 C ENS3S Rp1 TSS11475 D ENS4S Gm22848 TSS18078 E ENS5S Sox17 TSS56047,TSS74369
hence, the Type and Condition columns of file2.txt will be added. if value is in file1 but not in file2, it will be replaced by just empty cell. if value is in file2 but not file1, it will be ignored. here is what I tried so far and it is not working: Inputting 2 data frames then trying to use data merge or join:
df1 = pd.read_csv("file1.txt", index_col=0, sep="\t")
df2 = pd.read_csv("file2.txt", index_col=0, sep="\t")
result2 = pd.merge(df1, df2, on=df1.index, how ="left")
result2.to_csv("Merged.xls", sep="\t")
I also tried pd.concat with axis 1 but that also didnto work.
I then tried:
with open('file1.txt') as f:
r = csv.reader(f, delimiter='\t')
dict1 = {row[0]: row for row in r}
with open('file2.txt') as f:
r = csv.reader(f, delimiter='\t')
dict2= {row[0]: row for row in r}
keys = set(dict1.keys() + dict2.keys()) #i saw this on stackoverlow, i am not sure why it is sorting the keys by alphabetical order and i am unable to unsort (any side tip on that?)
with open('output.csv', 'wb') as f:
w = csv.writer(f, delimiter='\t')
w.writerows([[key, '\t',dict1.get(key),'\t', dict2.get(key)]
for key in keys])
This also did not give the desired output and there were lot of "'" between strings.. Is there any suggested method? I know how to merge to data frames if it has same # of rows and index , but i couldnt do it if i only want to do it using first file as a standard index. I know how to do it in R using the merge function then by.x and by.y, but R messes up all my header names (the ones up are just an example). So it is best to do it in python.