for example I have two files:
file1:
id chrom start end strand
g1 11 98566330 98566433 -
g2 11 98566295 98566433 -
g3 11 98566581 98566836 -
file2
id chrom start end strand gene_id gene_name somecol1...somecol10
g1 11 98566330 98566433 - ENSMUSG00000017210 Med24
g2 11 98566295 98566433 - ENSMUSG00000017210 Med24
g3 11 98566581 98566836 - ENSMUSG00000017210 Med24
desired output
id chrom start end strand gene_id gene_namesomecol1...somecol10
g1 11 98566330 98566433 - ENSMUSG00000017210 Med24
g2 11 98566295 98566433 - ENSMUSG00000017210 Med24
g3 11 98566581 98566836 - ENSMUSG00000017210 Med24
What I am bascially trying to do is get match id column from both files and if there is a match then print/write some columns from file1 and file2 in a new file ( my current code)
with open('~/outfile.txt', 'w') as w:
for id1 in c1: #c1 is list where i append each line from file1
for id2 in d1: #d2 is list where i append each line from file2
if id1[0] in id2[0]: #is this condition faster (condition1)
# if id1[0] == id2[0]:#or this condition is faster (condition2)
out = ('\t'.join(id2[0:6]),id1[1],id1[2],id2[9],id2[10])
w.write('\t'.join(out) + '\n')
the issue is this code works as desired with condition2 but it is very slow may be because I am trying to match each line id1[0] == id2[0]
between both the list c1 and d1 and also because file2 has like ~500000 rows.
currently i could come up with only two conditions that I am trying to learn that might make the code faster
is there better logic to use that will increase the speed.
EDIT:
I need match file col0 (id) with file2 col(id) and if it is true then slice elements in col0:6, col[1,2] from file1, and col9,10 from file2
desired output
id(file2) chrom(file2) start(file2) end(file2) strand(file2) gene_id(file2) gene_name(file2)somecol1(file1)...somecol10(file1)
g1 11 98566330 98566433 - ENSMUSG00000017210 Med24
g2 11 98566295 98566433 - ENSMUSG00000017210 Med24
g3 11 98566581 98566836 - ENSMUSG00000017210 Med24