Intersection of Large csv Files

Question

I have two large csv files of thousands of entries, each file contains two columns of ID's of the form :

BRADI5G01462.1_1    NCRNA_34654_1853
BRADI5G01462.1_1    NCRNA_34398_1942
BRADI5G01462.1_1    NCRNA_2871_1959

I've tried this, but it's not giving the expected results:

import csv
files=["#Left(Brachypodium_Japonica).csv","#Right(Brachypodium_Japonica).csv"]
for i in range(len(files)):
    name=files[i][files[i].find("#")+1:files[i].find(".")]
    with open(files[i],"r",newline='') as source:
         rdr= csv.reader( source,delimiter="\t",skipinitialspace=True )
         with open("@"+name+".csv","w",newline='') as result:
              wtr= csv.writer( result,delimiter="\t",skipinitialspace=True )
              for r in rdr:
                  wtr.writerow( (r[1],r[2]) )


l1 = set(open('@Left(Brachypodium_Japonica).csv'))
l2 = set(open('@Right(Brachypodium_Japonica).csv'))
open('Intersection(Brachypodium_Japonica).csv', 'w').writelines(l1 & l2)

what is the most efficient pythonic way to find the intersection between both files ?! by which i.e. whole match of the two columns in both files .

I've asked this question before, but no one bothered to help.

I'm really stuck in this and desperately need help that would be highly appreciated.

Edit:
File 1 (Left) input sample:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055

File 2 (Right) input sample:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI4G45180.1_240  OS03T0103800-01_2473
BRADI2G12470.2_487  OS04T0470600-00_3504
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78330.1_878  OS06T0155600-01_4411

Intersection file of Left & Right:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI1G73670.1_850  OS11T0481500-01_6621

@Sirko No, it's not ordered and the two files are NOT necessarily of the same size — Bara'a, Jul 20 '15 at 11:21
I have not tried it myself but I have read in many places that for working on large sets like yours numpy package is often suggested. — Marco, Jul 20 '15 at 11:24
@Marco Numpy is for numerical data, and my data is of string type !! — Bara'a, Jul 20 '15 at 11:26
Hi, could you give us a little more informations please? For example: what (maximum) sizes are the datasets and on which system will it work (ie: linux 64bit, windows 32bit...), does it have to be done in "one operation" or are you allowed to make a first iteration to "refine" the inputs. — bufh, Jul 20 '15 at 11:29
Proper indentation of python code is important when trying to read it and identify loops and blocks. Also, the indentation is actually used by python. The line after `for`, `name=` and the with block don't look right. — Paul, Jul 20 '15 at 11:35
@bufh The maximum data size is 10000 record in each file, it will work on Windows 7 64bit, and yes it has to be done in one operation to avoid pipeline complexity . Thanks :) — Bara'a, Jul 20 '15 at 11:39
This sounds like a job for sqlite, or some other sql database as it is a simple join on equality. — Paul, Jul 20 '15 at 11:41
you could also try https://docs.python.org/2/library/difflib.html — Anentropic, Jul 20 '15 at 11:42
@Paul I've corrected the indentation, sorry about that. Would you please explain more about sqlite and how can I use it here ?! — Bara'a, Jul 20 '15 at 11:43
@Bara'a using [sqlite](https://docs.python.org/2/library/sqlite3.html) or a [dbm](https://docs.python.org/2/library/anydbm.html) would require another steps, and for some reason you didn't want "pipeline complexity" (?). — bufh, Jul 20 '15 at 12:57

score 0 · Answer 1 · answered Jul 20 '15 at 11:56

0

When i looked in to your code it looked like this,

->you have two files

->you write them again into two files

->finally add them to the Intersection file

Lets remove that second step

import csv
files=["#Left(Brachypodium_Japonica).csv","#Right(Brachypodium_Japonica).csv"]
for i in files:
    with open(i,"r",newline='') as source:
        rdr= csv.reader( source,delimiter="\t",skipinitialspace=True )
        with open('Intersection(Brachypodium_Japonica).csv',"aw",newline='') as result:
            wtr= csv.writer( result,delimiter="\t",skipinitialspace=True )
            for r in rdr:
                wtr.writerow( (r[0],r[1]) )

the steps:

->read the first file and add it to the intersection

->read the next file and continue appending

try this and if does not work then do let me know. I have no idea why did u use that newline and skip but if you know why, then u may continue with it. i have rewritten the code in a proper way following your format.

I tried it on my system and it worked if it is not working do point out the traceback or the error.

answered Jul 20 '15 at 11:56

Ja8zyjits

1,433
16
31

Hhhhh you might haven't read the original post I wrote earlier 4 months ago. I did rewrite the files into new ones, to eliminate the first column that used to represent the similarity in order to use the `set` and `intersection` function without much trouble. BTW, it's NOT about appending, it's about finding the intersection between two large files of different sizes !! – Bara'a Jul 20 '15 at 12:06
@Bara'a I shall try a fix for that soon. – Ja8zyjits Jul 20 '15 at 12:23
I would really appreciate that if you could help, I'm waiting for an answer for over 4 months :( – Bara'a Jul 20 '15 at 12:32
@Bara'a can you show a small snippet of your expected input of two files and expected output in the single file. edit it in your question. – Ja8zyjits Jul 20 '15 at 12:34
Please, check the edit in the thread ... Thank you for your help – Bara'a Jul 20 '15 at 15:07

Martin Evans · Answer 2 · 2015-07-20T12:48:00.043

The following script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.

import csv

files = ["input1.csv", "input2.csv"]
ldata = []

for file in files:
    with open(file, "r") as f_input:
        csv_input = csv.reader(f_input, delimiter="\t", skipinitialspace=True)
        set_rows = set()
        for row in csv_input:
            set_rows.add(tuple(row))
        ldata.append(set_rows)

with open("Intersection(Brachypodium_Japonica).csv", "wb") as f_output:
    csv_output = csv.writer(f_output, delimiter="\t", skipinitialspace=True)
    csv_output.writerows(set.intersection(*ldata))

You will need to add your file name mangling. This format made it easier to test. Tested using Python 2.7.

he had the idea of dictionary and list but the OP says since it is about 1000 and more lines for each file, he opted out of it. — Ja8zyjits, Jul 20 '15 at 12:42

score 0 · Answer 3 · answered Jul 20 '15 at 12:52

here i'm making some assumptions, code will be more complex if they are wrong:

the index will be the "two" fields you say are similar between the datasets (in my example the fields used for the index are the two first on the line)
the "index" is uniq within a dataset (appear no more than once in left.csv and no more than once in right.csv)
you want to minimise memory usage
the separator (\t) is not present within the fields

import mmap

indexes = {}

left_fp = open('left.csv', 'r')
left = mmap.mmap(left_fp.fileno(), 0, access=mmap.ACCESS_READ)

while True:
        start = left.tell()
        line = left.readline()
        if not line: break
        # extract only the two columns you check
        cells = line.split('\t')[0:2]
        # store line position in left file
        indexes['\t'.join(cells)] = (start, left.tell() - start)


output = open('output.csv', 'w')

for line in open('right.csv'):
        # recreate the key
        cells = line.split('\t')[0:2]
        pos = left_indexes[key]
        if pos:
                # got the left line position
                left.seek(pos[0], 0)
                # write it
                output.write(left.read(pos[1]))
                # write right row
                output.write(line)

output.close()

score 0 · Answer 4 · answered Dec 13 '18 at 09:29

I just had a similar issue where I needed to find the intersection of two rather large CSV files (74M lines vs. 4.5M lines). The basic idea is to read the smaller file and build a lookup table which can be used while iterating over the larger file.

The fastest way I found was using jq command line tool.

smaller.csv:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055

larger.csv:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055
...

Prepare a lookup table of the smaller.csv in form of a json dict and write it to smaller.json:

$ jq -s -R 'split("\n")|map({key:.,value:1})|from_entries' smaller.csv | tee smaller.json
{
  "BRADI5G16060.1_36   OS08T0547100-02_5715": 1,
  "BRADI3G00440.1_243  OS03T0274400-01_2650": 1,
  "BRADI3G58610.1_438  OS01T0112500-01_899": 1,
  "BRADI1G73670.1_850  OS11T0481500-01_6621": 1,
  "BRADI1G78150.1_870  OS02T0543300-00_2055": 1
}

Now iterate the larger.csv and filter with the smaller.json as a lookup table:

$ jq -R -r --slurpfile F smaller.json 'select($F[0][.])' larger.csv | tee intersection.csv
BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI1G73670.1_850  OS11T0481500-01_6621

Intersection of Large csv Files

4 Answers4