I have a very large csv file(8GB+). I want to use data in that csv file for training, testing and cross-validation sets. How do I read that csv file randomly into multiple dataframes? I am using Python 3.
Asked
Active
Viewed 219 times
1
-
Do you want to divide it ~50/50, or read a sample population of N from it to the other set and so on? What have you tried so far? – Ilja Everilä May 03 '16 at 08:57
-
@IljaEverilä I want to divide it into two sets (training and testing set for a machine learning algorithm). There is no fixed ratio but generally 80:20 is preferred. – Amb May 03 '16 at 09:14
-
Have you any code of which a specific part is problematic? In its current form your question is way too broad. – Ilja Everilä May 03 '16 at 09:23
-
Why does it need to be read into two dataframes? sklearn implements a variety of methods for splitting datasets into training and test sets. In fact, just two sets is widely considered to be a poor approach in most circumstances. You get information leaking from the test set. Cross validation is a much better approach. http://scikit-learn.org/stable/modules/cross_validation.html – Chris May 03 '16 at 09:33
-
@Chris, are you sure, sklearn can deal with 8GB file? – xirururu May 03 '16 at 09:44
-
@Chris Thanks for mentioning cross validation. I actually would divide the data into a cross-validation set too. But once my asked problem is solved then having cross-validation set is a cakewalk – Amb May 03 '16 at 09:44
-
@xirururu sklearn cv-iterators produce indicies to slice a numpy array on - from this respect, sklearn is not the issue. If memory is so tight you cannot handle loading and then splitting, you will almost certainly not have sufficient memory to do any interesting calculations on the data. The question is not whether sklearn can handle the data, but whether your pc can. – Chris May 03 '16 at 09:50
-
@Amb why can you not load the full dataset into a dataframe or numpy array and then split? In my experience if a dataset barely fits into memory just loading it, you will overflow the memory trying to train a model on that data. – Chris May 03 '16 at 09:52
2 Answers
1
The critical point is randomly. CSV separate records by line break.If you cannot know the length of records before you have readed, random should be done by trick rather than totally random.
import os, random
FILENAME = "foo.txt"
MAX_ROW = 200
#Max length of one row possible
filsize = os.stat(FILENAME).st_size
fo = open(FILENAME, "r+")
block_count = filesize/MAX_ROW
#Count how many blocks are there
randomkeys = [[random.random() for i in range(block_count)] for j in range(block_count)]
#Randomize those keys
for seeknum in randomkeys:
fo.seek(0, seeknum*MAX_ROW)
findline = fo.readline()
# to find next line
line = fo.readline()
#handling line here

Shintiger
- 116
- 5
-
I think `block_count=filesize/MAX_ROW` is a problem, because the rows can have different size. Do you mean the `block_size` is actually `block_count`? – xirururu May 03 '16 at 09:43
-
You are right, thank you for correction.About the different size: the seeknum is a random entry point, the actual line is after next EOL – Shintiger May 03 '16 at 09:45
1
Count at first how many lines in your csv file.(There are many ways to do this, stackoverflow has already many related questions.) Then, create a list with
indices = range(num_lines)
random select a set of line indices. For example, you can use
your_selected_lineindices = random.sample(indices, 10000)
.use the following code example:
with open("file") as fp: for i, line in enumerate(fp): if i in your_selected_lineindices: do_something_with(line)
This code won't overflow your memory. Original code is from here: https://stackoverflow.com/a/2081880/3279996