This will do what you ask for without multiprocessing, partly because you most likely don't need it.
A simple benchmark made option 3 a winner in speed.
Option 1:
import csv
import random
starting_points = [random.randint(0, 5) for i in range(3)]
read_nbr_of_lines = 2
for sp in starting_points:
print('random starting line: %s'%sp)
read_lines = 0
with open('large_csv.csv') as cf:
lines = csv.reader(cf)
for nbr, line in enumerate(lines):
if nbr < sp - 1: continue
read_lines += 1
if read_lines > read_nbr_of_lines: break
print(nbr, line)
Probably this will turn out to be slow on large amounts of data, but I don't really see the point in even trying to get around this with your wish to start at a random point while using the csv-module.
You can get around the fact of reading files from byte 0 by doing the seeding for starting point on byte with f.seek(start_byte)
followed by reading a chunk of bytes in the file with f.read(my_chunk_size)
. In that case to get a fresh line you will have to find the rows on your own via new_line_char after your random starting point, do your own parser for the lines and keep a counter on how many lines you have read.
Option 2:
If your file is less than 1GB which is what you have stated.
Install numpy on your computer, read the file in one go.
Select your 1e6 lines by indexing into the complete set of lines.
The below will have dtype=np.float64
, if you want to keep the integers there are ways to do that as well. For that I suggest studying the docs of numpy.
import random
import numpy as np
mycsv = np.genfromtxt('large_csv.csv', delimiter=',')
starting_lines = [random.randint(0, 5) for i in range(3)]
read_nbr_of_lines = 2
for sl in starting_lines:
print('lines %s to %s'%(sl, sl+read_nbr_of_lines-1))
print(mycsv[sl:sl+read_nbr_of_lines])
Option 3:
I got a bit curious on the linecache, so I made one solution for that as well.
Updated with a proper generator set-up.
import linecache as lc
import csv
import random
starting_lines = [random.randint(1, 10) for i in range(3)]
read_nbr_of_lines = 2
for sl in starting_lines:
iterator = (lc.getline('large_csv.csv', i) for
i in range(sl, sl+read_nbr_of_lines))
mycsv = csv.reader(iterator)
print('lines %s to %s'%(sl, sl+read_nbr_of_lines-1))
for row in mycsv:
print(row)
Simple Benchmark (Py36):
A csv with 3.5M lines, startline 1M, 2M, 3M and reading 0.5M lines. To make it somewhat fair with numpy the others have a line converting all read rows to a list of floats.
Results:
=====================================
random starting line: 1000000
last_line 1499999 [1.0, 1172.0, 4.0, 1260759205.0]
random starting line: 2000000
last_line 2499999 [1.0, 1263.0, 2.0, 1260759151.0]
random starting line: 3000000
last_line 3499999 [3499999.0, 1287.0, 2.0, 1260759187.0]
option 1 timing: 13.678 seconds
=====================================
random starting line: 1000000
last_line 1499999 [ 1.50000000e+06 1.26300000e+03 2.00000000e+00 1.26075915e+09]
random starting line: 2000000
last_line 2499999 [ 2.50000000e+06 1.28700000e+03 2.00000000e+00 1.26075919e+09]
random starting line: 3000000
last_line 3499999 [ 3.50000000e+06 1.29300000e+03 2.00000000e+00 1.26075915e+09]
option 2 timing: 23.453 seconds
=====================================
lines 1000000 to 1500000
last_line 1500000 [1500000.0, 1263.0, 2.0, 1260759151.0]
lines 2000000 to 2500000
last_line 2500000 [2500000.0, 1287.0, 2.0, 1260759187.0]
lines 3000000 to 3500000
last_line 3500000 [3500000.0, 1293.0, 2.0, 1260759148.0]
option timing: 7.338 seconds
=====================================