4

I'm handling some CSV files with sizes in the range 1Gb to 2Gb. It takes 20-30 minutes just to load the files into a pandas dataframe, and 20-30 minutes more for each operation I perform, e.g. filtering the dataframe by column names, printing dataframe.head(), etc. Sometimes it also lags my computer when I try to use another application while I wait. I'm on a 2019 Macbook Pro, but I imagine it'll be the same for other devices.

I have tried using modin, but the data manipulations are still very slow.

Is there any way for me to work more efficiently?

Thanks in advance for the responses.

Mammoth
  • 331
  • 3
  • 9
  • What are you doing with the files? Do you need to read in the entire file for each operation? – norie Sep 12 '21 at 15:55
  • Sounds like you have too little free RAM. – tripleee Sep 12 '21 at 16:00
  • 1
    Use the activity monitor to see if you are swapping - you may need more memory. If you don't need all of the columns, `usecols` can limit what is kept. If you use the same csvs multiple times, save them into something like parquet. If some data keeps you from using a small dtype like int64 (say you end up with object because of blank lines), scrub the csv line by line with the `csv` module first. Apache arrow can help https://arrow.apache.org/docs/python/ – tdelaney Sep 12 '21 at 16:01
  • The symptoms are a hint that your processing exhausts the available memory. Solutions: either buy more memory if you can add some to your machine or buy a computer with more memory, or if it makes sense try to not load the whole file in memory. For simple operations, processing the file one row at a time with the csv module is module and require only very little memory, whatever the file size... – Serge Ballesta Sep 12 '21 at 16:06
  • @norie I'm selecting columns from a large CSV file to then convert it to a numpy array to use with tensorflow. – Mammoth Sep 12 '21 at 16:44
  • I agree with you guys saying it's because I have too little memory. I'll have to learn a way to store dataframes I use repeatedly somewhere other than my RAM. Thanks for this suggestion. – Mammoth Sep 12 '21 at 16:47
  • If you have a choice over how the data is stored I'd definitely check out [@SMeznaric's answer](https://stackoverflow.com/a/69153412/3888719). I'm particularly a fan of the [parquet format](https://pandas.pydata.org/docs/user_guide/io.html#io-parquet) for columnar data – Michael Delgado Sep 12 '21 at 17:08
  • Linking [this answer](https://stackoverflow.com/questions/25508510/fastest-way-to-parse-large-csv-files-in-pandas) because it's obviously asking the same thing, but it sounds like you've already tried out the accepted answer – Michael Delgado Sep 12 '21 at 17:52
  • Does this answer your question? [Fastest way to parse large CSV files in Pandas](https://stackoverflow.com/questions/25508510/fastest-way-to-parse-large-csv-files-in-pandas) – Michael Delgado Sep 12 '21 at 17:55

4 Answers4

12

The pandas docs on Scaling to Large Datasets have some great tips which I'll summarize here:

  1. Load less data. Read in a subset of the columns or rows using the usecols or nrows parameters to pd.read_csv. For example, if your data has many columns but you only need the col1 and col2 columns, use pd.read_csv(filepath, usecols=['col1', 'col2']). This can be especially important if you're loading datasets with lots of extra commas (e.g. the rows look like index,col1,col2,,,,,,,,,,,. In this case, use nrows to read in only a subset of the data to make sure that the result only includes the columns you need.
  2. Use efficient datatypes. By default, pandas stores all integer data as signed 64-bit integers, floats as 64-bit floats, and strings as objects or string types (depending on the version). You can convert these to smaller data types with tools such as Series.astype or pd.to_numeric with the downcast option.
  3. Use Chunking. Parsing huge blocks of data can be slow, especially if your plan is to operate row-wise and then write it out or to cut the data down to a smaller final form. Alternately, use the low_memory flag to get Pandas to use the chunked iterator on the backend, but return a single dataframe.
  4. Use other libraries. There are a couple great libraries listed here, but I'd especially call out dask.dataframe, which specifically works toward your use case, by enabling chunked, multi-core processing of CSV files which mirrors the pandas API and has easy ways of converting the data back into a normal pandas dataframe (if desired) after processing the data.

Additionally, there are some csv-specific things I think you should consider:

  1. Specifying column data types. Especially if chunking, but even if you're not, specifying the column types can dramatically reduce read time and memory usage and highlight problem areas in your data (e.g. NaN indicators or Flags that don't meet one of pandas's defaults). Use the dtypes parameter with a single data type to apply to all columns or a dict of column name, data type pairs to indicate the types to read in. Optionally, you can provide converters to format dates, times, or other numerical data if it's not in a format recognized by pandas.
  2. Specifying the parser engine - pandas can read csvs in pure python (slow) or C (much faster). The python engine has slightly more features (e.g. currently the C engine can't read files with complex multi-character delimeters and it can't skip footers). Try using the argument engine='c' to make sure the C engine is being used. If you need one of the unsupported file types, I'd try fixing the file(s) first manually (e.g. stripping out a footer) and then parsing with the C engine, if possible.
  3. Make sure you're catching all NaNs and data flags in numeric columns. This can be a tough one, and specifying specific data types in your inputs can be helpful in catching bad cases. Use the na_values, keep_default_na, date_parser, and converters argumentss to pd.read_csv. Currently, the default list of values interpreted as NaN are ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'].For example, if your numeric columns have non-numeric values coded as notANumber then this would be missed and would either cause an error (if you had dtypes specified) or would cause pandas to re-categorieze the entire column as an object column (suuuper bad for memory and speed!).
  4. Read the pd.read_csv docs over and over and over again. Many of the arguments to read_csv have important performance considerations. pd.read_csv is optimized to smooth over a large amount of variation in what can be considered a csv, and the more magic pandas has to be ready to perform (determine types, interpret nans, convert dates (maybe), skip headers/footers, infer indices/columns, handle bad lines, etc) the slower the read will be. Give it as many hints/constraints as you can and you might see performance increase a lot! And if it's still not enough, many of these tweaks will also apply to the dask.dataframe API, so this scales up further nicely.
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Thank you very much for the detailed reply. I'll try some of these out. – Mammoth Sep 12 '21 at 16:51
  • I've added some more pointers - sorry I accidentally hit publish midway! :) good luck – Michael Delgado Sep 12 '21 at 16:56
  • This is good information, check out this article to see how some of these factors affect performance, especially the newer parser engine options: https://medium.com/itnext/the-fastest-way-to-read-a-csv-file-in-pandas-2-0-532c1f978201 – Finn Andersen Apr 25 '23 at 13:37
3

This may or may not help you, but I found that storing data in HDF files has significantly improved IO speed. If you are ultimately the source of CSV files, I think you should try to store them as HDF instead. Otherwise what Michael has already said may be the way to go.

SMeznaric
  • 430
  • 3
  • 13
  • 2
    there are a ton of good binary file formats, and some work really well with columnar data, including the [apache parquet](https://parquet.apache.org/) format, which [pandas supports](https://pandas.pydata.org/docs/user_guide/io.html#io-parquet) through the [`DataFrame.to_parquet`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html) and [`pd.read_parquet`](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html) methods, if you install one of the `pyarrow` or `fastparquet` libraries. – Michael Delgado Sep 12 '21 at 16:44
  • 1
    (in case it was unclear - this was meant to add to your answer not contradict). @SMeznaric's is a really important point - whenever possible - convert to binary formats if possible when performance or precision are important! CSV is just too messy and inefficient a storage medium. It's stable (due in part to a lack of a technical specification) and it's human readable though, so it's good job security for data scientists lol – Michael Delgado Sep 12 '21 at 17:06
  • @MichaelDelgado I'm glad I bump into your comment; I've got a `pandas` dataframe of some 500 million rows on 9 columns; `h5` caching was fast but big (4.3G); whereas `parquet` caching was also fast yet the file was only 1.8G! – stucash Oct 01 '22 at 17:43
0

Consider using polars. It is typically orders of magnitudes faster than pandas. Here are some benchmarks backing that claim.

If you really want full performance, consider using the lazy API. All the filters you describe can maybe even be done at scan level. We can also parralellize over all files easily with pl.collect_all().

ritchie46
  • 10,405
  • 1
  • 24
  • 43
0

Based on your description you could be fine with processing these csv files as streams instead of fully loading them into memory/swap to filter and calling head.

There's a Table (docs) helper in convtools library (github), which helps with streaming csv-like files, applying transforms and of course you can pipe the resulting stream of rows to another tool of your choice (polars / pandas).

For example:

import pandas as pd

from convtools import conversion as c
from convtools.contrib.tables import Table


pd.DataFrame(
    Table.from_csv("input.csv", header=True)
    .take("a", "c")
    .update(b=c.col("a") + c.col("c"))
    .filter(c.col("b") < -2)
    .rename({"a": "A"})
    .drop("c")
    .into_iter_rows(dict)  # .into_csv("out.csv") if passing to pandas is not needed
)
westandskif
  • 972
  • 6
  • 9