I'm working on a script to find the intersection between large csv files based on the contents of only two specific columns in each file which are : Query ID and Subject ID.
A set of files are pairs of Left and Right for each species , every single file looks something like this:
Similarity (%) Query ID Subject ID
100.000000 BRADI5G01462.1_1 BRADI5G16060.1_36
90.000000 BRADI5G02480.1_5 NCRNA_11838_6689
100.000000 BRADI5G06067.1_8 NCRNA_32597_1525
90.000000 BRADI5G08380.1_12 NCRNA_32405_1776
100.000000 BRADI5G09460.2_17 BRADI5G16060.1_36
90.909091 BRADI5G10680.1_20 NCRNA_2505_6156
Right files are always longer and larger in size than Left one's !!
Here's the code snippet I have so far :
import csv
with open('#Left(Brachypodium_Japonica).csv', 'r',newline='') as Afile, open('#Right(Brachypodium_Japonica).csv', 'r',newline='') as Bfile, open('Intrsc-(Brachypodium_Japonica).csv','w',newline='') as Intrsct:
reader1=csv.reader(Afile,delimiter="\t",skipinitialspace=True)
next(reader1,None)
reader2=csv.reader(Bfile,delimiter="\t",skipinitialspace=True)
next(reader2,None)
Intrsct = csv.writer(Intrsct, delimiter="\t",skipinitialspace=True)
Intrsct.writerow(["Query ID","Subject ID","Left Similarity (%)","Right Similarity (%)"])
for row1 ,row2 in zip(Afile,Bfile):
if ((row1[1] in row2[1] and row1[2] in row2[2])):
Intrsct.writerow([row1.strip().split('\t')[1],row1.strip().split('\t')[2],row1.strip().split('\t')[0],row2.strip().split('\t')[0]])
The code above is iterating over the records of the two files simulatively and searches for contents of row(1),row(2) of first file in row(1),row(2) of the second file ; by which i.e. column-wise (compares Query ID in both files as well as Subject ID) and prints the matches on a new file in a certain order .
Th results are not exactly what I was expecting ; obviously it finds the matches for the first wanted column only ... I tried to trace back the procedure manually and find that BRADI5G02480.1_5
for instance exist in both files but not NCRNA_11838_6689
which only exists on Left side Not the Right!!
Aren't they supposed to be mirror reflection aside from the numerical values ?!
I have used this thread to write the script but it compares line by line and doesn't check the rest of the column content's for matches .
Also , I found this but it uses dictionaries and lists which isn't suitable for my file's size .
To handle the simulatively iteration thing I used this thread , but what was mentioned there about handling variant sized files wasn't really clear to me so I haven't tried it yet !!
I would really appreciate it if someone could tell me what am missing here , is the code correct or I'm using the in
condition wrong ?!
Please , I really need help with this ... thanks in advance :)