Read random lines from huge CSV file

Question

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows to iterate sequentially in the file.

It's very memory consuming to read the all file into memory to use some random choosing and it's very time consuming to go trough all the file and discard some values and choose others, so is there any way to choose some random line from the CSV file and read only that line?

I tried without success:

import csv

with open('linear_e_LAN2A_F_0_435keV.csv') as file:
    reader = csv.reader(file)
    print reader[someRandomInteger]

A sample of the CSV file:

331.093,329.735
251.188,249.994
374.468,373.782
295.643,295.159
83.9058,0
380.709,116.221
352.238,351.891
183.809,182.615
257.277,201.302
61.4598,40.7106

maybe duplicate http://stackoverflow.com/questions/10605532/get-random-line-from-csv-file-in-python-and-find-corresponding-word-like-a-quiz — VP., May 30 '12 at 15:59
@VP I believe the solutions there required loading the entire file into memory. — Andrew Buss, May 30 '12 at 16:01
@VP No, in that thread you read the all file into memory before choosing the random line and that's exactly what I need to avoid. — jbssm, May 30 '12 at 16:01
What is the format of the file? Are all lines the same length? Are there literal commas in any of the lines? — Andrew Buss, May 30 '12 at 16:05
@Andre All the lines have 2 decimal values separated by a comma. But the values may have more or less numbers. I updated the question with a sample of the file. — jbssm, May 30 '12 at 16:09
See http://stackoverflow.com/questions/232237/whats-the-best-way-to-return-a-random-line-in-a-text-file-using-c — Mark Ransom, May 30 '12 at 16:14
What's your longest line? What's your shortest line? Do you have 1 billion lines? Or is the file or 15MB long? Something doesn't add up :) — Maria Zverina, May 30 '12 at 16:15
@MarkRansom: I considered this too, but it will return a highly variable number of records. — Andrew Buss, May 30 '12 at 16:15
Do you need to do this more than once? Does the data change between executions? If the csv file changes are the changes only additions or are there also edits and deletions? — Stephen Paulger, May 30 '12 at 16:16
@AndreBoos, sorry it's been a long time since I looked at that question/answer. It only returns *one* line from the file. I can fix it to do a million though. — Mark Ransom, May 30 '12 at 16:21
@MariaZverina longest line will be 15 characters, smallest 3. The file is really 15 GB and used to do some statistical analysis that will take about 1 to 10 million random lines at the time. — jbssm, May 30 '12 at 16:23
Maybe you could write an intermediate problem to rewrite the file into a fixed record format so that each line is exactly 15 bytes. Then you could pick a random number `line` from 0 to `n_lines`, and then `f.seek(15*line); readline()` — Andrew Buss, May 30 '12 at 16:26
How about putting each line of the CSV file into a record in a database and then sampling the records in the database? — the wolf, May 30 '12 at 17:06
After thinking a bit more about it, it looks like the only way possible without iterating trough the all file, is by using Maria Zverina method and making sure all lines have an equal number of chars. — jbssm, May 30 '12 at 18:23
If you can control the creation of the data, have you thought about putting these values in a sqlite3 database? The data will be smaller, and selecting random rows from the total is fast and trivial. — dawg, May 30 '12 at 21:51
@drewk That seems the most advanced solution yes, the thing is, unfortunately I won't be the only one using this numbers to do science and I'm really certain most of the other people will have no idea what is an sql database much less how to use it. — jbssm, May 30 '12 at 23:55

Maria Zverina · Accepted Answer · 2012-05-30T20:08:45.567

34

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

Let's assume (in this case) we have min=3 and max=15

1) Find the length (Lp) of the previous line.

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

If we don't keep the line, we do another random pick until our dice roll comes good. :-)

edited May 30 '12 at 20:08

answered May 30 '12 at 16:03

Maria Zverina

10,863
3
44
61

4

Clever, but this will provide biased results on files with variable-length lines. – Andrew Buss May 30 '12 at 16:04
because the first readline will almost always land in the middle of a line. – parselmouth May 30 '12 at 16:06
4

@thg435: That does not fix anything. A line after a long line will be disproportionately represented compared to one after a short line. Furthermore, the first line will _never_ be read. – Andrew Buss May 30 '12 at 16:07
Actually AndreBoss is correct :) ... really long lines are more likely to get selected than really short line ... but if you sampling 15GB file ... maybe you are willing to accept this bias? – Maria Zverina May 30 '12 at 16:07
@Maria thanks, this works, but like Andre says, it has some statistical bias towards big lines. – jbssm May 30 '12 at 16:13
1

+1 for fixing the last line bug before I could post a comment ;) – Stephen Paulger May 30 '12 at 16:14
Have added extra few lines so that first line will be also selected and you won't go ever the end of file. The critique of biased selection still holds - if you absolutely need unbiased selection than you'll need to rewrite the file in fixed record format. – Maria Zverina May 30 '12 at 16:14
@AndreBoos: Looking at the sample OP provided, I don't think there will be issues. In any case, `seek` is the only way to achieve what they want in one pass. – georg May 30 '12 at 16:23
thg435: The sampling will remain biased, although it may not be a problem, depending on jbssm's purposes. Two passes is not an unreasonable approach, however. – Andrew Buss May 30 '12 at 16:29
I believe that you can do unbiased selection if you know minimum length of the line - as discussed above. But the code will be left as exercise the the reader :-) – Maria Zverina May 30 '12 at 16:38
3

@MariaZverina I think that to make an unbiased selection you need to know not only the minimum length of the line, but also the frequency of each line length in the file. Anyway, in the file, the only values that have less than 7 characters are the 0. I will try to convert my file to have 0 replaced by 0.00000 and then your routine should work perfectly. – jbssm May 30 '12 at 17:44
2

@MariaZverina, agree with jbssm's comment. To correct for the bias, you would need to know the distribution of the line lengths, then normalize, accordingly. Your normalization technique works if the distribution of line lengths 3 ... 15 is uniform, but to KNOW that this distribution is in fact uniform, we would need to read the whole file. Which puts us back to the same starting square that if we're going to read the file at least once, why not write it back out into something more amenable to fast manipulation in the future (e.g. in fixed field or as an sqlite3 table, etc.) – parselmouth May 30 '12 at 19:18
1

@parselmouth I beg to differ ... it will work regardless of uniformity. The length of the preceding line gives the probability that the line itself will be selected and it Lp/F (F = file length). Hence if you normalise the chance of each line getting selected to min/F you're done. Consider extreme case of 1 line with 3 chars and all remaining lines are 15. Then with my scheme each line has 3/F chance of getting selected. Line lengths are not uniformly distributed yet each line has equal chance of being selected. As only 20% of advantaged lines will be selected. – Maria Zverina May 30 '12 at 20:08
2

@MariaZverina ...Ok...I "did the math", and the probabilities do work out to be uniform using your second step filter. I stand corrected. (takes his hat off and bows to you) – parselmouth May 30 '12 at 20:31
Stanislav Yaglo (@stanislav-yaglo) came up with an elegant solution that will work when number of lines to be selected is relatively small. You can achieve a random unbiased line by repeatedly seeking to a random point in the file until you hit carriage return. Then simply read off the following line. :) – Maria Zverina Jul 02 '13 at 14:26
Note that seeking the file in random order will thrash disk (not as much of a problem with SSD storage, but sill unfriendly to some storage backends). When collecting many sample lines, it makes sense to collect batches of random offsets and sort the batches before fetching the lines. That way the backend storage sees a more linear progression of access. For a million lines, most machines will easily have enough memory to do this with just one batch of sorted random offsets. – Walker Hale IV Nov 11 '15 at 17:32

Shawn Chin · Answer 2 · 2012-05-30T16:59:43.313

10

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it

Assuming you don't need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.

This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you'll only need to keep track of the number of entries each time the dataset changes).

chances_selected = desired_num_results / total_entries
for line in csv.reader(file):
   if random() < chances_selected:
        result.append(line)

edited May 30 '12 at 16:59

answered May 30 '12 at 16:17

Shawn Chin

84,080
19
162
191

Yes, reservoir sampling. But how do they find `total_entries`? – georg May 30 '12 at 16:18
2

@thg435 hence the statement *"... and know the number of lines in your CSV file"*. Such a scheme works if you're doing the sampling often and you need only count the dataset size once. – Shawn Chin May 30 '12 at 16:22
@thg435 And thanks for the proper term. I couldn't for the life of me recall it. – Shawn Chin May 30 '12 at 16:23
aha, didn't notice that. BTW, if all lines are of similar length, you can estimate `total_entries = filesize / length_of_first_line` – georg May 30 '12 at 16:29
Using `for line in CSV result` will load the file into memory anyway, which should be avoided. – Andrew Buss May 30 '12 at 16:32

score 7 · Answer 3 · edited May 23 '17 at 12:02

You can use a variation of the probabilistic method for choosing a random line in a file.

Instead of just keeping a single number that gets chosen, you can keep a buffer of size C. For each line number, n, in the file with N lines, you want to choose that line with probability C/n (rather than the original 1/n. If the number is selected, you then choose a random location from the C-length buffer to evict.

Here's how it works:

import random

C = 2
fpath = 'somelines.txt'
buffer = []

f = open(fpath, 'r')
for line_num, line in enumerate(f):
    n = line_num + 1.0
    r = random.random()
    if n <= C:
        buffer.append(line.strip())
    elif r < C/n:
        loc = random.randint(0, C-1)
        buffer[loc] = line.strip()

This requires a single pass through the file (so it's linear time) and returns exactly C lines from the file. Each line will have probability C/N of being selected.

To verify that the above works, I created a file with 5 lines containing a,b,c,d,e. I ran the code 10,000 times with C=2. This should produce about an even distribution of the 5 choose 2 (so 10) possible choices. The results:

a,b: 1046
b,c: 1018
b,e: 1014
a,c: 1003
c,d: 1002
d,e: 1000
c,e: 993
a,e: 992
a,d: 985
b,d: 947

Hi, thanks. But that way you have to read the all file before and that takes a lot for this huge file. — jbssm, May 30 '12 at 18:17
@jbssm You have to read through the entire file at least once to get an unbiased result. — jterrace, May 30 '12 at 18:30

score 5 · Answer 4 · answered Mar 03 '16 at 23:36

5

If you want to grab random lines many times (e.g., mini-batches for machine learning), and you don't mind scanning through the huge file once (without loading it into memory), then you can create a list of line indeces and use seek to quickly grab the lines (based off of Maria Zverina's answer).

# Overhead:
# Read the line locations into memory once.  (If the lines are long,
# this should take substantially less memory than the file itself.)
fname = 'big_file'
s = [0]
linelocs = [s.append(s[0]+len(n)) or s.pop(0) for n in open(fname)]
f = open(fname) # Reopen the file.

# Each subsequent iteration uses only the code below:
# Grab a 1,000,000 line sample
# I sorted these because I assume the seeks are faster that way.
chosen = sorted(random.sample(linelocs, 1000000))
sampleLines = []
for offset in chosen:
  f.seek(offset)
  sampleLines.append(f.readline())
# Now we can randomize if need be.
random.shuffle(sampleLines)

answered Mar 03 '16 at 23:36

Marctar

59
1
4

Just noticed that this is the code for what @parselmouth already described. – Marctar Mar 03 '16 at 23:49
2

Based on your answer, I built a little package for getting reading arbitrary lines in a file. Check it out [here](https://github.com/jegesh/python-random-access-file-reader) – ygesher Feb 15 '17 at 16:55
Cool. I actually needed this again, so I'll use you package. Thanks! – Marctar Mar 15 '17 at 22:54
1

since my last comment I made a pip package out of it: [random-access-file-reader](https://pypi.python.org/pypi?name=random-access-file-reader&version=0.3.0&:action=display) – ygesher Mar 19 '17 at 06:53

score 2 · Answer 5 · answered May 30 '12 at 16:05

2

If the lines are truly .csv format and NOT fixed field, then no, there's not. You can crawl through the file once, indexing the byte offsets for each line, then when later needed only use the index set, but there's no way to a priori predict the exact location of the line-terminating \n character for arbitrary csv files.

answered May 30 '12 at 16:05

parselmouth

1,598
8
8

I was afraid that this was the case, unless we know more about the possible values of each line. – Andrew Buss May 30 '12 at 16:09
@AndrewBuss There is ,check out my answer https://stackoverflow.com/questions/10819911/read-random-lines-from-huge-csv-file-in-python/67206594#67206594 – iouvxz Apr 22 '21 at 04:07

Thomas K · Answer 6 · 2012-09-22T11:23:58.443

2

Another solution is possible if you know the total number of lines - generate 1 million random numbers (random.sample(xrange(n), 1000000)) up to the total number of lines as a set, then use:

for i, line in enumerate(csvfile):
    if i in lines_to_grab:
        yield line

This will get you exactly 1 million lines in an unbiased way, but you need to have the number of lines beforehand.

edited Sep 22 '12 at 11:23

answered May 30 '12 at 16:41

Thomas K

39,200
7
84
86

But that way I still have to read the all file up to the line I actually want to read. – jbssm May 30 '12 at 16:43
@jbssm: You have to run through the entire file, yes, but you don't have to load it all into memory. – Thomas K May 30 '12 at 16:46
Will work, but with the additional memory overheads of a million numbers. And you'll still need to make a pass through the data file. – Shawn Chin May 30 '12 at 16:46
1

Yep. That's about 30MB as a set on my computer, so it's probably not a deal breaker. – Thomas K May 30 '12 at 16:51
Use `random.sample` to create `lines_to_grab`. – Mark Ransom May 30 '12 at 18:30
@ShawnChin: there is no "additional memory overheads of a million numbers". You never need to store it in memory. It is I/O the file twice, which should not be a problem in many cases. – Causality Sep 21 '12 at 19:12
@ThomasK, the code should be random.sample(xrange(N), K), to sample K numbers (no replicates) in range 0-(N-1). – Causality Sep 21 '12 at 19:19
@causality you'd need to store `lines_to_grab` in memory. How else would you do `if i in lines_to_grab`? – Shawn Chin Sep 21 '12 at 19:36
@Causality: Well spotted, I've fixed the brackets. – Thomas K Sep 22 '12 at 11:24

dawg · Answer 7 · 2012-05-30T22:24:27.530

If you can place this data in a sqlite3 database, selecting some number of random rows is trivial. You will not need to pre-read or pad lines in the file. Since sqlite data files are binary, you data file will be 1/3 to 1/2 smaller than CSV text.

You can use a script like THIS to import the CSV file or, better still, just write your data to a database table in the first place. SQLITE3 is part of the Python distribution.

Then use these statements to get 1,000,000 random rows:

mydb='csv.db'
con=sqlite3.connect(mydb)

with con:
    cur=con.cursor()
    cur.execute("SELECT * FROM csv ORDER BY RANDOM() LIMIT 1000000;")

    for row in cur.fetchall():
        # now you have random rows...

Than you Drew, it seems the most advanced solution but unfortunately I won't be the only one using these numbers to do science and I'm really certain most of the other people will have no idea what is an sql database, much less how to use it. — jbssm, May 30 '12 at 23:57

score 0 · Answer 8 · answered May 30 '12 at 16:45

0

You can rewrite the file with fixed-length records, and then perform random access on the intermediate file later:

ifile = file.open("inputfile.csv")
ofile = file.open("intermediatefile.csv",'w')
for line in ifile:
    ofile.write(line.rstrip('\n').ljust(15)+'\n')

Then, you can do:

import random
ifile = file.open("intermediatefile.csv")
lines = []
samples = random.sample(range(nlines))
for sample in samples:
    ifile.seek(sample)
    lines.append(ifile.readline())

Requires more disk space, and the first program may take some time to run, but it allows unlimited later random access to records with the second.

answered May 30 '12 at 16:45

Andrew Buss

1,532
2
14
23

I understand your point and it looks valid, but the conversion routine you gave me doesn't work. Looking at my file, all values have 7 chars (counting the .), except for the 0 values. So I actually only need to transform the 0 into 0.00000 and it should work. – jbssm May 30 '12 at 17:41
Oh, I see; that makes sense. Do you have any control over the program that outputs that data? It's probably possible to change the output format to something more regular. – Andrew Buss May 30 '12 at 18:02
I do, but it's written in C++ and after much searching I realized there is no way to do that with printf for every value. So I will have to do the conversion in python probably. – jbssm May 30 '12 at 18:15
It should be doable try somethine like: "%09.4f" % (1/3.0) .... That will give your four fixed decimal points ... and will not wrap for any number under 10000.0 – Maria Zverina May 30 '12 at 20:12
@jbssm also consider exponential format: "%.4e" % (1/3.0) this will give you five significant digits and fixed width of 10 chars. Unless you need to format negative numbers as well .... in which case use the + modifier for printf. And reading exponential formated floats is just as easy as basic floats in python. – Maria Zverina May 30 '12 at 20:18

score 0 · Answer 9 · answered May 30 '12 at 17:38

0

# pass 1, count the number of rows in the file
rowcount = sum(1 for line in file)
# pass 2, select random lines
file.seek(0)
remaining = 1000000
for row in csv.reader(file):
    if random.randrange(rowcount) < remaining:
        print row
        remaining -= 1
    rowcount -= 1

answered May 30 '12 at 17:38

Mark Ransom

299,747
42
398
622

You still have to iterate over each line this way to get to a line your random variable tells you to read. After thinking a bit more about it, the only way possible without iterating trough the all file, is by using Maria Zverina method and making sure all line have and equal number of chars. – jbssm May 30 '12 at 18:22
@jbssm, if each line has an equal number of characters this becomes trivial - just multiply the random line number times the line size, and seek to that point in the file. – Mark Ransom May 30 '12 at 18:24
Yes, I will write some file conversion program and then use that way then. Thanks. – jbssm May 30 '12 at 18:25

score 0 · Answer 10 · edited Sep 14 '15 at 12:46

In this method, we generate a random number set whose number of elements is equal to the number of lines to be read, with its range being the number of rows present in the data. It is then sorted from smallest to largest and stored.

Then the csv file is read line by line and a line_counter is in place to denote the row number. This line_counter is then checked with the first element of the sorted random number list and if they are same then that specific line is written into the new csv file and the first element is removed from the list and the previously second element takes the place of the first and the cycle continues.

import random
k=random.sample(xrange(No_of_rows_in_data),No_of_lines_to_be_read)
Num=sorted(k)    
line_counter = 0

with open(input_file,'rb') as file_handle:
    reader = csv.reader(file_handle)
    with open(output_file,'wb') as outfile:
            a=csv.writer(outfile)
            for line in reader:
                line_counter += 1
                if line_counter == Num[0]:
                a.writerow(line)
                Num.remove(Num[0])
                if len(Num)==0:
                break

Python depends on indentation; your code in the innermost loop (`for line in reader`) is a Python indentation error. It isn't clear whether `Num.remove(Num[0])` should be indented or not — the `a.writerow(line)` must be indented, as must the `break`. — Jonathan Leffler, May 07 '18 at 04:24

score 0 · Answer 11 · answered May 06 '20 at 09:49

If you can use pandas and numpy, I have posted a solution in another question that is pandas specific but very efficient:

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 1000000
batch_size = 5000

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

For more details, please see the other answer.

iouvxz · Answer 12 · 2021-04-22T04:10:20.613

def random_line(path, hint=1):
    with open(path, mode='rb') as file:
        import random
        while file.seek(random.randrange(file.seek(-2, 2))) and not file.readline(hint).endswith(b'\n'):
            pass
        return file.readline().decode().strip()

This is what i wrote for reading a random line from a very large file .

The time complexity is O(k) ,k is the average length of the lines in the text file.

The hint argument is the minimal length of the lines in the text file ,if you know it beforehand ,use it to speed up the function .

score 0 · Answer 13 · answered Jul 16 '21 at 20:30

Always works for me

import csv
import random
randomINT = random.sample(range(1, 72655), 40000)
with open(file.csv,"rU") as fp:
    reader = csv.reader(fp, delimiter=",", quotechar='"', dialect=csv.excel_tab)
    data_read = [row for idx, row in enumerate(reader) if idx in randomINT]
    for idx, line in enumerate(data_read):
        pass

Read random lines from huge CSV file

13 Answers13

Linked

Related