read a csv into multiple dataframes

Question

I have a very large csv file(8GB+). I want to use data in that csv file for training, testing and cross-validation sets. How do I read that csv file randomly into multiple dataframes? I am using Python 3.

Do you want to divide it ~50/50, or read a sample population of N from it to the other set and so on? What have you tried so far? — Ilja Everilä, May 03 '16 at 08:57
@IljaEverilä I want to divide it into two sets (training and testing set for a machine learning algorithm). There is no fixed ratio but generally 80:20 is preferred. — Amb, May 03 '16 at 09:14
Have you any code of which a specific part is problematic? In its current form your question is way too broad. — Ilja Everilä, May 03 '16 at 09:23
Why does it need to be read into two dataframes? sklearn implements a variety of methods for splitting datasets into training and test sets. In fact, just two sets is widely considered to be a poor approach in most circumstances. You get information leaking from the test set. Cross validation is a much better approach. http://scikit-learn.org/stable/modules/cross_validation.html — Chris, May 03 '16 at 09:33
@Chris Thanks for mentioning cross validation. I actually would divide the data into a cross-validation set too. But once my asked problem is solved then having cross-validation set is a cakewalk — Amb, May 03 '16 at 09:44
@xirururu sklearn cv-iterators produce indicies to slice a numpy array on - from this respect, sklearn is not the issue. If memory is so tight you cannot handle loading and then splitting, you will almost certainly not have sufficient memory to do any interesting calculations on the data. The question is not whether sklearn can handle the data, but whether your pc can. — Chris, May 03 '16 at 09:50
@Amb why can you not load the full dataset into a dataframe or numpy array and then split? In my experience if a dataset barely fits into memory just loading it, you will overflow the memory trying to train a model on that data. — Chris, May 03 '16 at 09:52

Shintiger · Answer 1 · 2016-05-04T01:26:25.500

1

The critical point is randomly. CSV separate records by line break.If you cannot know the length of records before you have readed, random should be done by trick rather than totally random.

import os, random

FILENAME = "foo.txt"

MAX_ROW = 200
#Max length of one row possible

filsize = os.stat(FILENAME).st_size

fo = open(FILENAME, "r+")

block_count = filesize/MAX_ROW
#Count how many blocks are there

randomkeys = [[random.random() for i in range(block_count)] for j in range(block_count)]
#Randomize those keys

for seeknum in randomkeys:
    fo.seek(0, seeknum*MAX_ROW)
    findline = fo.readline()
    # to find next line
    line = fo.readline()
    #handling line here

edited May 04 '16 at 01:26

answered May 03 '16 at 09:26

Shintiger

116
5

I think `block_count=filesize/MAX_ROW` is a problem, because the rows can have different size. Do you mean the `block_size` is actually `block_count`? – xirururu May 03 '16 at 09:43
You are right, thank you for correction.About the different size: the seeknum is a random entry point, the actual line is after next EOL – Shintiger May 03 '16 at 09:45

score 1 · Answer 2 · edited May 23 '17 at 12:16

Count at first how many lines in your csv file.(There are many ways to do this, stackoverflow has already many related questions.) Then, create a list with indices = range(num_lines)
random select a set of line indices. For example, you can use your_selected_lineindices = random.sample(indices, 10000).

use the following code example:

with open("file") as fp:
    for i, line in enumerate(fp):
        if i in your_selected_lineindices:
            do_something_with(line)

This code won't overflow your memory. Original code is from here: https://stackoverflow.com/a/2081880/3279996

read a csv into multiple dataframes

2 Answers2