0

Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.

Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).

Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • 1
    If data are only in one column, you can use `usecols` as filter in [`read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in `pandas`. [docs](http://pandas.pydata.org/pandas-docs/stable/io.html#filtering-columns-usecols) – jezrael Feb 04 '16 at 15:53
  • There is also the `skiprows` argument, unfortunately you have to know befor then the lines numbers of the rows you want to skip. – user3017048 Feb 04 '16 at 16:08

1 Answers1

4

I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.

The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.

import pandas as pd

pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
    chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
    good_data.append(chunk)

df = pd.concat(good_data)

Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.

If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.

Community
  • 1
  • 1
root
  • 32,715
  • 6
  • 74
  • 87
  • Thanks @root I thought this option would come in handy for this, but never had a chance to look into it. Do you know how can I determine how many chunks a given `pd.read_csv(my_file, chunksize=X)` returns? The following doesn' seem to do it: `len(pd.read_csv(my_file, chunksize=X))` – Amelio Vazquez-Reina Feb 04 '16 at 20:44
  • `chunksize` denotes the number of rows in each chunk of the csv file that get read in each iteration. So, the total number of chunks would be the ceiling of `chunksize/total_csv_rows`. Ceiling because if `chunksize` doesn't divide `total_csv_rows`, the last chunk will just be the remaining rows. Ex: if `total_csv_rows = 102` and `chunksize = 10`, there will be 11 chunks, with the last chunk only having 2 rows. – root Feb 04 '16 at 20:58
  • Right, thanks root. I meant how to obtain the number of chunks that the call will return for given input chunk size. – Amelio Vazquez-Reina Feb 04 '16 at 21:18