0

I have a large A.csv file (~5 Gb) with several columns. One of the columns is Model. There is another large B.csv file (~15 Gb) with Vendor, Name and Model columns.

Two questions:

1) How can I create result file that combines all columns from A.csv and corresponding Vendor and Name from B.csv (join on Model). The trick is - how to do it when my RAM is 4 Gb only, and I'm using python.

2) How can I create a sample (say, 1 Gb) result file that combines random subsample from A.csv (all columns) joined with Vendor and Name from B.csv. The trick is, again, in 4 Gb of RAM.

I know how to do it in pandas, but 4 Gb is limiting factor I can't overcome (

night_bat
  • 3,212
  • 5
  • 16
  • 19
  • they're just csv files. read a line from each file, combine them, write out the new line. unless your lines are multi-gigs each, there's no reason you couldn't handle a near infinite sized file, since you only ever have one line of each in ram at any given time. – Marc B May 25 '15 at 15:00
  • It's the join that seems to be the problem – canyon289 May 25 '15 at 15:00
  • I'm pretty sure he needs to do the join itself before writing anything anywhere. You can't just take line 1 from A and line 1 from B. You have to map corresponding lines from each file first. – konart May 25 '15 at 15:02
  • I agree with Konart. Perhaps the best way is to create a map of the join column of both files and then join one line at a time that way. Once its joined into one 20 gig csv then random sampling would work. – canyon289 May 25 '15 at 15:04
  • Can you tell what's the maximum number of rows with the same `Model`? – Stefan Pochmann May 25 '15 at 15:27
  • You can try to load the files in a database into two tables after that to unload the combined data into csv using a sql join. – valentin May 25 '15 at 16:39

3 Answers3

2

Here's an idea:

Step 1: Sort the two files by Model. Mergesort would be good for this. Split each file into smaller files small enough to sort in RAM, then after sorting each of them, merge them into one large sorted file. See my answer to an earlier question for a good way to merge multiple already sorted files. Update: See end of my answer for example/code.

Step 2: Join the two files by Model. Again similar to the merging step of Mergesort, traverse the two sorted files "in parallel", advancing each file appropriately, and join by matching Model values.

Pseudocode for step 2:

open the two sorted files A and B
blockA = read block of same-model rows from A
blockB = read block of same-model rows from B
while True:
    while model of blockA differs from model of blockB:
        if model of blockA is smaller:
            blockA = read block of same-model rows from A
            quit if there isn't any (i.e. end of file reached)
        else:
            blockB = read block of same-model rows from B
            quit if there isn't any (i.e. end of file reached)
    output the cross product of blockA and blockB

Another idea:

If there are relatively few Models, then it might be better to separate the rows into files by Model. For example, store rows in files A_Model1.csv, A_Model2.csv, etc and B_Model1.csv, B_Model2.csv, etc. Then take the cross product of A_Model1.csv and B_Model1.csv, of A_Model2.csv and B_Model2.csv, etc.


For question 2, I'd just count the number of rows, use random.sample to pick row numbers, and then fetch those rows.

>>> import random
>>> number_of_rows = 100
>>> number_of_sample_rows = 10
>>> sorted(random.sample(range(number_of_rows), number_of_sample_rows))
[6, 18, 23, 32, 41, 44, 58, 59, 91, 96]

(then go through the file and fetch those rows)

Update: Here's code/demo for Step 2 from above. I made three files B1.csv, B2.csv and B3.csv:

Vendor,Name,Model
vfoo,nhi,m1
vbar,nho,m4
vbaz,nhe,m7

Vendor,Name,Model
vZ,nX,m2
vY,nZ,m6
vX,nY,m8

Vendor,Name,Model
v,n3,m3
v,na,m5
v,n_,m9

Here's the merging result file Bmerged.csv:

Vendor,Name,Model
vfoo,nhi,m1
vZ,nX,m2
v,n3,m3
vbar,nho,m4
v,na,m5
vY,nZ,m6
vbaz,nhe,m7
vX,nY,m8
v,n_,m9

And here's the code:

import csv, heapq

filenames = ('B1.csv', 'B2.csv', 'B3.csv')

# Prepare the input streams
files = list(map(open, filenames))
readers = [iter(csv.reader(file)) for file in files]
headers = list(map(next, readers))
def model_and_row(row):
    return row[2], row
model_and_row_streams = [map(model_and_row, reader) for reader in readers]

# Merge them into the output file
with open('Bmerged.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(headers[0])
    for _, row in heapq.merge(*model_and_row_streams):
        writer.writerow(row)

# Close the input files
for file in files:
    file.close()

Note that I'm using Python 3. In Python 2, you'd need to use itertools.imap(model_and_row, reader) in order to not read the whole files into memory at once.

Community
  • 1
  • 1
Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
0

As @Marc B said, reading one row at a time is the solution. About the join I would do the following (pseudocode: I don't know python).

  1. "Select distinct Model from A" on first file A.csv

Read all rows, search for Model field and collect distinct values in a list/array/map

  1. "Select distinct Model from B" on second file B.csv

Same operation as 1, but using another list/array/map

  1. Find matching models

Compare the two lists/arrays/maps finding only matching models (they will be part of the join)

  1. Do the join

Reading rows of file A which match model, read all the rows of file B which match same model and write a file C with join result. To this for all models.

Note: it's not particularly optimized.

For point 2 just choose a subset of matching models and/or read a part of rows of file A and/or B with maching models.

Fil
  • 1,032
  • 13
  • 29
  • Fields (and keys) can be unique or not in their files. If Model fields don't share the same value set the join will retur empty. Otherwise you can join tables even without primary keys defined or fields with not unique values as it can happen with cvs files (maybe the result will not be so useful or will be redundant). I don't know if I answered your question :P – Fil May 25 '15 at 15:23
-1

Read the file line by line in Python. This is a really quick and easy way: example

output = open("outputfile.csv", "a")
lines = []
for line in open("file.csv", "r"):
    lines.append(line)
    if len(lines) == 1000000:
        output.writelines(lines)
        del lines[:]
if bool(lines):
    output.writelines(lines)

Adjust the length of the array in the if statement according to available RAM

rassa45
  • 3,482
  • 1
  • 29
  • 43