I don't know of anything in odo
or pandas
that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas
script that gets the job done.
The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
import pandas as pd
pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
good_data.append(chunk)
df = pd.concat(good_data)
Add/alter parameters for pd.read_csv
and pd.concat
as necessary for your specific situation.
If performance is an issue, you may be able to speed things up by using an alternative to .isin
, as described in this answer.