I have a very large CSV that takes ~30 seconds to read when using the normal pd.read_csv
command. Is there a way to speed this process up? I'm thinking maybe something that only reads rows that have some matching value in one of the columns.
i.e. only read in rows where the value in column 'A' is the value '5'.
Asked
Active
Viewed 1,539 times
-1

View_user
- 31
- 4
-
You could use the `csv` module to filter the rows and write to a temporary file. Then use `pandas`. – tdelaney Aug 01 '22 at 21:32
-
You may want to check this SO post: https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function – Sheldon Aug 01 '22 at 21:33
-
1Does this answer your question? [How can I filter lines on load in Pandas read\_csv function?](https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function) – Sheldon Aug 01 '22 at 21:34
3 Answers
0
Dask module can do a lazy read of a large CSV file in Python.
You trigger the computation by calling the .compute()
method. At this time the file is read in chunks and applies whatever conditional logic you specify.
import dask.dataframe as dd
df = dd.read_csv(csv_file)
df = df[df['A'] == 5]
df = df.compute()
print(len(df)) # print number of records
print(df.head()) # print first 5 rows to show sample of data

CodeMonkey
- 22,825
- 4
- 35
- 75
0
If you're looking for a value in a CSV file, you must look for the entire document, then limit it to 5 results.
If you want to just retrieve the first five rows, you may are looking for this:
nrows
:int,optional
Number of rows of file to read. Useful for reading pieces of large files.
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
0
Try and chunk it dude! Truffle Shuffle! Goonies Never say die.
mylist = []
for chunk in pd.read_csv('csv_file.csv', sep=',', chunksize=10000):
mylist.append(chunk[chunk.A == 5])
big_data = pd.concat(mylist, axis= 0)
del mylist

Dude
- 1