Pandas Read in only specific lines of a CSV file

Question

I have a very large CSV that takes ~30 seconds to read when using the normal pd.read_csv command. Is there a way to speed this process up? I'm thinking maybe something that only reads rows that have some matching value in one of the columns. i.e. only read in rows where the value in column 'A' is the value '5'.

You could use the `csv` module to filter the rows and write to a temporary file. Then use `pandas`. — tdelaney, Aug 01 '22 at 21:32
You may want to check this SO post: https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function — Sheldon, Aug 01 '22 at 21:33
Does this answer your question? [How can I filter lines on load in Pandas read\_csv function?](https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function) — Sheldon, Aug 01 '22 at 21:34

CodeMonkey · Answer 1 · 2022-08-01T23:10:04.197

Dask module can do a lazy read of a large CSV file in Python.

You trigger the computation by calling the .compute() method. At this time the file is read in chunks and applies whatever conditional logic you specify.

import dask.dataframe as dd

df = dd.read_csv(csv_file)
df = df[df['A'] == 5]

df = df.compute()
print(len(df)) # print number of records

print(df.head()) # print first 5 rows to show sample of data

score 0 · Answer 2 · answered Aug 01 '22 at 23:17

If you're looking for a value in a CSV file, you must look for the entire document, then limit it to 5 results.

If you want to just retrieve the first five rows, you may are looking for this:

nrows :int,optional

Number of rows of file to read. Useful for reading pieces of large files.

Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

score 0 · Answer 3 · answered Sep 06 '22 at 02:41

0

Try and chunk it dude! Truffle Shuffle! Goonies Never say die.

mylist = []
for chunk in  pd.read_csv('csv_file.csv', sep=',', chunksize=10000):
    mylist.append(chunk[chunk.A == 5])

big_data = pd.concat(mylist, axis= 0)
del mylist

answered Sep 06 '22 at 02:41

Dude

1

Pandas Read in only specific lines of a CSV file

3 Answers3