0

I'm working on a script to find the intersection between large csv files based on the contents of only two specific columns in each file which are : Query ID and Subject ID.

A set of files are pairs of Left and Right for each species , every single file looks something like this:

Similarity (%)  Query ID    Subject ID
100.000000  BRADI5G01462.1_1    BRADI5G16060.1_36
90.000000   BRADI5G02480.1_5    NCRNA_11838_6689
100.000000  BRADI5G06067.1_8    NCRNA_32597_1525
90.000000   BRADI5G08380.1_12   NCRNA_32405_1776
100.000000  BRADI5G09460.2_17   BRADI5G16060.1_36
90.909091   BRADI5G10680.1_20   NCRNA_2505_6156

Right files are always longer and larger in size than Left one's !!

Here's the code snippet I have so far :

import csv
with open('#Left(Brachypodium_Japonica).csv', 'r',newline='') as Afile, open('#Right(Brachypodium_Japonica).csv', 'r',newline='') as Bfile, open('Intrsc-(Brachypodium_Japonica).csv','w',newline='') as Intrsct:
    reader1=csv.reader(Afile,delimiter="\t",skipinitialspace=True)
    next(reader1,None)
    reader2=csv.reader(Bfile,delimiter="\t",skipinitialspace=True)
    next(reader2,None)
    Intrsct = csv.writer(Intrsct, delimiter="\t",skipinitialspace=True)
    Intrsct.writerow(["Query ID","Subject ID","Left Similarity (%)","Right Similarity (%)"])
    for row1 ,row2 in zip(Afile,Bfile):
            if ((row1[1] in row2[1] and row1[2] in row2[2])):
                Intrsct.writerow([row1.strip().split('\t')[1],row1.strip().split('\t')[2],row1.strip().split('\t')[0],row2.strip().split('\t')[0]])

The code above is iterating over the records of the two files simulatively and searches for contents of row(1),row(2) of first file in row(1),row(2) of the second file ; by which i.e. column-wise (compares Query ID in both files as well as Subject ID) and prints the matches on a new file in a certain order .

Th results are not exactly what I was expecting ; obviously it finds the matches for the first wanted column only ... I tried to trace back the procedure manually and find that BRADI5G02480.1_5 for instance exist in both files but not NCRNA_11838_6689 which only exists on Left side Not the Right!!

Aren't they supposed to be mirror reflection aside from the numerical values ?!

I have used this thread to write the script but it compares line by line and doesn't check the rest of the column content's for matches .

Also , I found this but it uses dictionaries and lists which isn't suitable for my file's size .

To handle the simulatively iteration thing I used this thread , but what was mentioned there about handling variant sized files wasn't really clear to me so I haven't tried it yet !!

I would really appreciate it if someone could tell me what am missing here , is the code correct or I'm using the in condition wrong ?!

Please , I really need help with this ... thanks in advance :)

Community
  • 1
  • 1
Bara'a
  • 53
  • 2
  • 8

1 Answers1

0

The following solution is a copy of my answer given to your other question, and should hopefully give you an idea how to integrate it with your current solution.

The script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.

import csv

files = ["input1.csv", "input2.csv"]
ldata = []

for file in files:
    with open(file, "r") as f_input:
        csv_input = csv.reader(f_input, delimiter="\t", skipinitialspace=True)
        set_rows = set()
        for row in csv_input:
            set_rows.add(tuple(row))
        ldata.append(set_rows)

with open("Intersection(Brachypodium_Japonica).csv", "wb") as f_output:
    csv_output = csv.writer(f_output, delimiter="\t", skipinitialspace=True)
    csv_output.writerows(set.intersection(*ldata))

You will need to add your file name mangling. This format made it easier to test. Tested using Python 2.7.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97