1

I have two large csv files of thousands of entries, each file contains two columns of ID's of the form :

BRADI5G01462.1_1    NCRNA_34654_1853
BRADI5G01462.1_1    NCRNA_34398_1942
BRADI5G01462.1_1    NCRNA_2871_1959

I've tried this, but it's not giving the expected results:

import csv
files=["#Left(Brachypodium_Japonica).csv","#Right(Brachypodium_Japonica).csv"]
for i in range(len(files)):
    name=files[i][files[i].find("#")+1:files[i].find(".")]
    with open(files[i],"r",newline='') as source:
         rdr= csv.reader( source,delimiter="\t",skipinitialspace=True )
         with open("@"+name+".csv","w",newline='') as result:
              wtr= csv.writer( result,delimiter="\t",skipinitialspace=True )
              for r in rdr:
                  wtr.writerow( (r[1],r[2]) )


l1 = set(open('@Left(Brachypodium_Japonica).csv'))
l2 = set(open('@Right(Brachypodium_Japonica).csv'))
open('Intersection(Brachypodium_Japonica).csv', 'w').writelines(l1 & l2)

what is the most efficient pythonic way to find the intersection between both files ?! by which i.e. whole match of the two columns in both files .

I've asked this question before, but no one bothered to help.

I'm really stuck in this and desperately need help that would be highly appreciated.

Edit:
File 1 (Left) input sample:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055

File 2 (Right) input sample:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI4G45180.1_240  OS03T0103800-01_2473
BRADI2G12470.2_487  OS04T0470600-00_3504
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78330.1_878  OS06T0155600-01_4411

Intersection file of Left & Right:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI1G73670.1_850  OS11T0481500-01_6621
Community
  • 1
  • 1
Bara'a
  • 53
  • 2
  • 8
  • Is the input ordered? – Sirko Jul 20 '15 at 11:19
  • @Sirko No, it's not ordered and the two files are NOT necessarily of the same size – Bara'a Jul 20 '15 at 11:21
  • I have not tried it myself but I have read in many places that for working on large sets like yours numpy package is often suggested. – Marco Jul 20 '15 at 11:24
  • @Marco Numpy is for numerical data, and my data is of string type !! – Bara'a Jul 20 '15 at 11:26
  • Hi, could you give us a little more informations please? For example: what (maximum) sizes are the datasets and on which system will it work (ie: linux 64bit, windows 32bit...), does it have to be done in "one operation" or are you allowed to make a first iteration to "refine" the inputs. – bufh Jul 20 '15 at 11:29
  • Proper indentation of python code is important when trying to read it and identify loops and blocks. Also, the indentation is actually used by python. The line after `for`, `name=` and the with block don't look right. – Paul Jul 20 '15 at 11:35
  • @bufh The maximum data size is 10000 record in each file, it will work on Windows 7 64bit, and yes it has to be done in one operation to avoid pipeline complexity . Thanks :) – Bara'a Jul 20 '15 at 11:39
  • This sounds like a job for sqlite, or some other sql database as it is a simple join on equality. – Paul Jul 20 '15 at 11:41
  • you could also try https://docs.python.org/2/library/difflib.html – Anentropic Jul 20 '15 at 11:42
  • @Paul I've corrected the indentation, sorry about that. Would you please explain more about sqlite and how can I use it here ?! – Bara'a Jul 20 '15 at 11:43
  • @Bara'a using [sqlite](https://docs.python.org/2/library/sqlite3.html) or a [dbm](https://docs.python.org/2/library/anydbm.html) would require another steps, and for some reason you didn't want "pipeline complexity" (?). – bufh Jul 20 '15 at 12:57

4 Answers4

0

When i looked in to your code it looked like this,

->you have two files

->you write them again into two files

->finally add them to the Intersection file

Lets remove that second step

import csv
files=["#Left(Brachypodium_Japonica).csv","#Right(Brachypodium_Japonica).csv"]
for i in files:
    with open(i,"r",newline='') as source:
        rdr= csv.reader( source,delimiter="\t",skipinitialspace=True )
        with open('Intersection(Brachypodium_Japonica).csv',"aw",newline='') as result:
            wtr= csv.writer( result,delimiter="\t",skipinitialspace=True )
            for r in rdr:
                wtr.writerow( (r[0],r[1]) )

the steps:

->read the first file and add it to the intersection

->read the next file and continue appending

try this and if does not work then do let me know. I have no idea why did u use that newline and skip but if you know why, then u may continue with it. i have rewritten the code in a proper way following your format.

I tried it on my system and it worked if it is not working do point out the traceback or the error.

Ja8zyjits
  • 1,433
  • 16
  • 31
  • Hhhhh you might haven't read the original post I wrote earlier 4 months ago. I did rewrite the files into new ones, to eliminate the first column that used to represent the similarity in order to use the `set` and `intersection` function without much trouble. BTW, it's NOT about appending, it's about finding the intersection between two large files of different sizes !! – Bara'a Jul 20 '15 at 12:06
  • @Bara'a I shall try a fix for that soon. – Ja8zyjits Jul 20 '15 at 12:23
  • I would really appreciate that if you could help, I'm waiting for an answer for over 4 months :( – Bara'a Jul 20 '15 at 12:32
  • @Bara'a can you show a small snippet of your expected input of two files and expected output in the single file. edit it in your question. – Ja8zyjits Jul 20 '15 at 12:34
  • Please, check the edit in the thread ... Thank you for your help – Bara'a Jul 20 '15 at 15:07
0

The following script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.

import csv

files = ["input1.csv", "input2.csv"]
ldata = []

for file in files:
    with open(file, "r") as f_input:
        csv_input = csv.reader(f_input, delimiter="\t", skipinitialspace=True)
        set_rows = set()
        for row in csv_input:
            set_rows.add(tuple(row))
        ldata.append(set_rows)

with open("Intersection(Brachypodium_Japonica).csv", "wb") as f_output:
    csv_output = csv.writer(f_output, delimiter="\t", skipinitialspace=True)
    csv_output.writerows(set.intersection(*ldata))

You will need to add your file name mangling. This format made it easier to test. Tested using Python 2.7.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • he had the idea of dictionary and list but the OP says since it is about 1000 and more lines for each file, he opted out of it. – Ja8zyjits Jul 20 '15 at 12:42
0

here i'm making some assumptions, code will be more complex if they are wrong:

  • the index will be the "two" fields you say are similar between the datasets (in my example the fields used for the index are the two first on the line)
  • the "index" is uniq within a dataset (appear no more than once in left.csv and no more than once in right.csv)
  • you want to minimise memory usage
  • the separator (\t) is not present within the fields

import mmap

indexes = {}

left_fp = open('left.csv', 'r')
left = mmap.mmap(left_fp.fileno(), 0, access=mmap.ACCESS_READ)

while True:
        start = left.tell()
        line = left.readline()
        if not line: break
        # extract only the two columns you check
        cells = line.split('\t')[0:2]
        # store line position in left file
        indexes['\t'.join(cells)] = (start, left.tell() - start)


output = open('output.csv', 'w')

for line in open('right.csv'):
        # recreate the key
        cells = line.split('\t')[0:2]
        pos = left_indexes[key]
        if pos:
                # got the left line position
                left.seek(pos[0], 0)
                # write it
                output.write(left.read(pos[1]))
                # write right row
                output.write(line)

output.close()
bufh
  • 3,153
  • 31
  • 36
0

I just had a similar issue where I needed to find the intersection of two rather large CSV files (74M lines vs. 4.5M lines). The basic idea is to read the smaller file and build a lookup table which can be used while iterating over the larger file.

The fastest way I found was using jq command line tool.

smaller.csv:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055

larger.csv:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055
...

Prepare a lookup table of the smaller.csv in form of a json dict and write it to smaller.json:

$ jq -s -R 'split("\n")|map({key:.,value:1})|from_entries' smaller.csv | tee smaller.json
{
  "BRADI5G16060.1_36   OS08T0547100-02_5715": 1,
  "BRADI3G00440.1_243  OS03T0274400-01_2650": 1,
  "BRADI3G58610.1_438  OS01T0112500-01_899": 1,
  "BRADI1G73670.1_850  OS11T0481500-01_6621": 1,
  "BRADI1G78150.1_870  OS02T0543300-00_2055": 1
}

Now iterate the larger.csv and filter with the smaller.json as a lookup table:

$ jq -R -r --slurpfile F smaller.json 'select($F[0][.])' larger.csv | tee intersection.csv
BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI1G73670.1_850  OS11T0481500-01_6621
muhqu
  • 12,329
  • 6
  • 28
  • 30