0

I have a large csv file (~250M lines) with the following structure

ID1, ID2, value
A, B, 5
B,C, 8
C,B, 4

I want to get a table which tells me if the pair (ID1,ID2) is reciprocated in the file. So the output should be something like:

ID1, ID2, Reciprocity
A,B,0
B,C,1
C,B,1

I would do this by creating a dictionary and checking if the key ID2+ID1 is in the dictionary, but the dictionary becomes larger than my RAM. I've tried using networkx but can't create the graph because I also run out of RAM.

What is an option that doesn't require loading the whole file into the RAM but is also not prohibitively long in terms of reading from the hard drive in a loop?

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
Alexis Eggermont
  • 7,665
  • 24
  • 60
  • 93
  • 2
    What do you mean by "reciprocated?" And what code have you tried so far? – Russia Must Remove Putin Mar 11 '14 at 00:08
  • The line B,C, (value) is reciprocated if the line C,B (othervalue) exists in the CSV file – Alexis Eggermont Mar 11 '14 at 00:10
  • 1
    Why would you want the entire file into ram, you are using python, use generators and output the result to a file as you read the other big file continuously. You would need mechanisms to jump inside a file but yeah, loading the entire file is ludicrous, and besides there is no need. – Claudiordgz Mar 11 '14 at 00:14
  • Are you checking for just the presence of a reciprocal pair? Or do you need to count occurrences? – alecbz Mar 11 '14 at 00:18
  • @alecbenzer, that explains it, still, outputting the set of the pairs directly to a file in disk would solve his problems. Just open the file, jump the were you want to edit, edit, and close. Repeat. You can even change the output file to a database and get great results – Claudiordgz Mar 11 '14 at 00:19
  • @Claudiordgz I believe he is trying to trade memory space for time complexity. If he can load everything into a dict then the time complexity for the operation would be just O(N). If he has to search for each item then it's O(N**2). Or did I miss something? Alexis, can you confirm? Also you will probably get more constructive answers if you share the code you have already used. – KobeJohn Mar 11 '14 at 00:30
  • 1
    why not load the file with one command to a database (e.g sqlite) and then run a query to output all the lines for which ID1||ID2=ID2||ID1 (add indexes on ID1,ID2 to speed up the process) ? – Amnon Mar 11 '14 at 00:36
  • @Amnon this is an excellent idea. – Claudiordgz Mar 11 '14 at 01:29

3 Answers3

1

Probably the best and most scalable solution would be to import the CSV into an SQLite database. Then create a new table for "Reciprocity" and from there, find all of the potential pairs following this approximate pseudocode:

Load CSV into a database with Table DATA
Create RECIPROCITY Table with columns ID1,ID2,Reciprocal
Iterate through each row R in DATA:
    Let A,B = DATA.ID1, DATA.ID2
    Search RECIPROCITY for A,B
      If A,B doesnt exist add a new row
    Search RECIPROCITY for B,A
      If B,A exists update add update RECIPROCITY.Reciprocal for A,B and B,A
Community
  • 1
  • 1
1

Are you using a UNIX-ish operating system? Here is a way to list "reciprocal pairs":

$ cat data.txt
A,B, 5
B,C, 8
C,B, 4
$ cat data.txt |awk -F',' '{ if ($1<$2) print $1" "$2; else print $2" "$1}' | sort |uniq -c | awk '$1>1 {print $2" "$3}'
B C
Aric
  • 24,511
  • 5
  • 78
  • 77
0

Try using shelve:

import shelve

pairs = shelve.open('myshelf')

with open('data', 'r') as f:
    for line in f:
        id1, id2, value = [s.strip() for s in line.split(',')]
        pairs[id1 + id2] = True

with open('processed', 'w') as f:
    for (id1, id2) in pairs:
        if id2 + id1 in pairs:
            f.write('%s, %s, 1\n' % (id1, id2))
        else:
            f.write('%s, %s, 0\n' % (id1, id2))

pairs.close()
alecbz
  • 6,292
  • 4
  • 30
  • 50