Here's an idea:
Step 1: Sort the two files by Model. Mergesort would be good for this. Split each file into smaller files small enough to sort in RAM, then after sorting each of them, merge them into one large sorted file. See my answer to an earlier question for a good way to merge multiple already sorted files. Update: See end of my answer for example/code.
Step 2: Join the two files by Model. Again similar to the merging step of Mergesort, traverse the two sorted files "in parallel", advancing each file appropriately, and join by matching Model values.
Pseudocode for step 2:
open the two sorted files A and B
blockA = read block of same-model rows from A
blockB = read block of same-model rows from B
while True:
while model of blockA differs from model of blockB:
if model of blockA is smaller:
blockA = read block of same-model rows from A
quit if there isn't any (i.e. end of file reached)
else:
blockB = read block of same-model rows from B
quit if there isn't any (i.e. end of file reached)
output the cross product of blockA and blockB
Another idea:
If there are relatively few Models, then it might be better to separate the rows into files by Model. For example, store rows in files A_Model1.csv
, A_Model2.csv
, etc and B_Model1.csv
, B_Model2.csv
, etc. Then take the cross product of A_Model1.csv
and B_Model1.csv
, of A_Model2.csv
and B_Model2.csv
, etc.
For question 2, I'd just count the number of rows, use random.sample
to pick row numbers, and then fetch those rows.
>>> import random
>>> number_of_rows = 100
>>> number_of_sample_rows = 10
>>> sorted(random.sample(range(number_of_rows), number_of_sample_rows))
[6, 18, 23, 32, 41, 44, 58, 59, 91, 96]
(then go through the file and fetch those rows)
Update: Here's code/demo for Step 2 from above. I made three files B1.csv, B2.csv and B3.csv:
Vendor,Name,Model
vfoo,nhi,m1
vbar,nho,m4
vbaz,nhe,m7
Vendor,Name,Model
vZ,nX,m2
vY,nZ,m6
vX,nY,m8
Vendor,Name,Model
v,n3,m3
v,na,m5
v,n_,m9
Here's the merging result file Bmerged.csv:
Vendor,Name,Model
vfoo,nhi,m1
vZ,nX,m2
v,n3,m3
vbar,nho,m4
v,na,m5
vY,nZ,m6
vbaz,nhe,m7
vX,nY,m8
v,n_,m9
And here's the code:
import csv, heapq
filenames = ('B1.csv', 'B2.csv', 'B3.csv')
# Prepare the input streams
files = list(map(open, filenames))
readers = [iter(csv.reader(file)) for file in files]
headers = list(map(next, readers))
def model_and_row(row):
return row[2], row
model_and_row_streams = [map(model_and_row, reader) for reader in readers]
# Merge them into the output file
with open('Bmerged.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(headers[0])
for _, row in heapq.merge(*model_and_row_streams):
writer.writerow(row)
# Close the input files
for file in files:
file.close()
Note that I'm using Python 3. In Python 2, you'd need to use itertools.imap(model_and_row, reader)
in order to not read the whole files into memory at once.