1

I have to read 10,00,000 lines from a csv file(692 MB) consisting of 26,00,000 lines and 4 columns,in multiple threads,each of which starts from a random line and stop when I reach 1 million lines

My attempt:

from multiprocessing.pool import ThreadPool as Pool
import linecache
import random
import csv
from random import randint
from time import sleep

csvfile=csv.reader(open('sample.csv'))


def process_line(l):
  sleep(randint(0,3))
  print (l)
def get_random_line():    
  lines_to_get=random.randint(0,2600000)
  line = linecache.getline('sample.csv', lines_to_get)

  for lines_to_get, line in enumerate(csvfile):
      print (line)

      if lines_to_get >= 1000000:
        break

    yield (line)

f = get_random_line()

t = Pool(processes=3)

for i in f:
  t.map(process_line, (i,))


t.close()

But in the result,the lines are not starting from random,it starting from the first line itself every time.

Result

['1', '31', '2.5', '1260759144']
['1', '1029', '3.0', '1260759179']
['1', '1061', '3.0', '1260759182']
['1', '1129', '2.0', '1260759185']
['1', '1172', '4.0', '1260759205']
['1', '1263', '2.0', '1260759151']
['1', '1287', '2.0', '1260759187']
['1', '1293', '2.0', '1260759148']
['1', '1339', '3.5', '1260759125']

The requirement is strictly that I should start from a random line everytime

Najma
  • 11
  • 3
  • why do you have to read 10 x 1 million chunks randomly in a 26 million line file? There is a quite large probability that you will read same lines multiple times unless you put in logic to adjust your random choise, and then it's no longer random... Anyways, why not look at iterators and chunking the readprocess and read what you actually need. Write an algorithm that reads small chunks in one process in a way you are satisfied with, make sure that this works, then solve the the issue with reading large amounts of data with envoking the smaller chunk reader in subprocesses or pools. – ahed87 Dec 02 '17 at 07:31
  • or if you are ok with dependencies, take a look at pandas. one example of similar problem formulation found [here](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas). – ahed87 Dec 02 '17 at 07:42
  • @ahed87 Thanks for that..but can can you tell me how to get random chunks everytime? – Najma Dec 02 '17 at 08:03
  • I don't think you can, you will need to go through the whole file, and just keep what you want, and break the reading when you have got what you want. I still don't understand why you limit yourself to your stated problem formulation of how to read the file? How big is your file and what type of data does it contain? numbers? strings? how many columns? what data in a line are you looking for? All that could lead to a specific answer on how to read efficiently. why do you need random parts of the file? – ahed87 Dec 02 '17 at 08:11
  • @ahed87I have added the information required – Najma Dec 02 '17 at 08:17

3 Answers3

1

This will do what you ask for without multiprocessing, partly because you most likely don't need it.

A simple benchmark made option 3 a winner in speed.

Option 1:

import csv

import random

starting_points = [random.randint(0, 5) for i in range(3)]

read_nbr_of_lines = 2

for sp in starting_points:
    print('random starting line: %s'%sp)
    read_lines = 0
    with open('large_csv.csv') as cf:
        lines = csv.reader(cf)
        for nbr, line in enumerate(lines):
            if nbr < sp - 1: continue
            read_lines += 1
            if read_lines > read_nbr_of_lines: break
            print(nbr, line)

Probably this will turn out to be slow on large amounts of data, but I don't really see the point in even trying to get around this with your wish to start at a random point while using the csv-module.

You can get around the fact of reading files from byte 0 by doing the seeding for starting point on byte with f.seek(start_byte) followed by reading a chunk of bytes in the file with f.read(my_chunk_size). In that case to get a fresh line you will have to find the rows on your own via new_line_char after your random starting point, do your own parser for the lines and keep a counter on how many lines you have read.

Option 2: If your file is less than 1GB which is what you have stated. Install numpy on your computer, read the file in one go. Select your 1e6 lines by indexing into the complete set of lines. The below will have dtype=np.float64, if you want to keep the integers there are ways to do that as well. For that I suggest studying the docs of numpy.

import random
import numpy as np
mycsv = np.genfromtxt('large_csv.csv', delimiter=',')    
starting_lines = [random.randint(0, 5) for i in range(3)]
read_nbr_of_lines = 2

for sl in starting_lines:
    print('lines %s to %s'%(sl, sl+read_nbr_of_lines-1))
    print(mycsv[sl:sl+read_nbr_of_lines])

Option 3: I got a bit curious on the linecache, so I made one solution for that as well. Updated with a proper generator set-up.

import linecache as lc
import csv
import random

starting_lines = [random.randint(1, 10) for i in range(3)]
read_nbr_of_lines = 2

for sl in starting_lines:
    iterator = (lc.getline('large_csv.csv', i) for
                i in range(sl, sl+read_nbr_of_lines))
    mycsv = csv.reader(iterator)
    print('lines %s to %s'%(sl, sl+read_nbr_of_lines-1))
    for row in mycsv:
        print(row)

Simple Benchmark (Py36):

A csv with 3.5M lines, startline 1M, 2M, 3M and reading 0.5M lines. To make it somewhat fair with numpy the others have a line converting all read rows to a list of floats.

Results:

=====================================
random starting line: 1000000
last_line 1499999 [1.0, 1172.0, 4.0, 1260759205.0]
random starting line: 2000000
last_line 2499999 [1.0, 1263.0, 2.0, 1260759151.0]
random starting line: 3000000
last_line 3499999 [3499999.0, 1287.0, 2.0, 1260759187.0]
option 1 timing: 13.678 seconds
=====================================
random starting line: 1000000
last_line 1499999 [  1.50000000e+06   1.26300000e+03   2.00000000e+00   1.26075915e+09]
random starting line: 2000000
last_line 2499999 [  2.50000000e+06   1.28700000e+03   2.00000000e+00   1.26075919e+09]
random starting line: 3000000
last_line 3499999 [  3.50000000e+06   1.29300000e+03   2.00000000e+00   1.26075915e+09]
option 2 timing: 23.453 seconds
=====================================
lines 1000000 to 1500000
last_line 1500000 [1500000.0, 1263.0, 2.0, 1260759151.0]
lines 2000000 to 2500000
last_line 2500000 [2500000.0, 1287.0, 2.0, 1260759187.0]
lines 3000000 to 3500000
last_line 3500000 [3500000.0, 1293.0, 2.0, 1260759148.0]
option  timing: 7.338 seconds
=====================================
ahed87
  • 1,240
  • 10
  • 10
0

have you tried seeding your random number generator before running it? with a code like this:

import time
random.seed(time.time())

add it before any random number generation

Danii-Sh
  • 447
  • 1
  • 4
  • 18
0

As far as I can understand:

line = linecache.getline('sample.csv', lines_to_get)

this is getting you the random line and storing it.

Immediately after this in for loop you are replacing this "line" variable with the first line of the csvfile.

for lines_to_get, line in enumerate(csvfile):
      print (line)

This is causing you to lose the random line whhc you have set earlier.

kmcodes
  • 807
  • 1
  • 8
  • 20