pd.read_csv optimization to reduce the running time

Question

My input file is 20GB .txt file, So It faces performance issues when I test run the below code. pd.read_csv is taking more than 3 hours. Need to optimization in the reading stage.

Sample input file.

007064770000|SODIUM|95 MILLIGRAM|0
007064770000|MULTI|001|0
007064770000|PET STARCH FREE|NOT APPLICABLE|0
007064770000|GRAIN TYPE|FLOUR|0
003980010200|MULTI|001|0
003980010200|DEAL|NON-DEAL|0
003980010200|PRODUCT SIZE|1 COUNT|0
003980010200|BASE SIZE|1 COUNT|0
757582821517|HW APPLIANCES|001|0
757582821516|HW APPLIANCES|001|0
757582821517|PACKAGE GENERAL SHAPE|BOTTLE|0
757582821517|SYND FORM|CREAM|0
757582821517|FORM|CREAM|0
757582821517|TARGET SKIN CONDITION|DRY SKIN|0
003980010205|HW MEDICINE|NON-DEAL|0
003980010205|PRODUCT SIZE|1 COUNT|0
003980010205|BASE SIZE|1 COUNT|0
007064770054|SODIUM|95 MILLIGRAM|0
007064770054|HW SPORTS|001|0
007064770054|PET STARCH FREE|NOT APPLICABLE|0
007064770054|GRAIN TYPE|FLOUR|0
003980010312|HW DIAMETER|1 COUNT|0
003980010312|BASE SIZE|1 COUNT|0

Output file

       UPC code HW APPLIANCES HW DIAMETER HW MEDICINE HW SPORTS
0    3980010205           NaN         NaN    NON-DEAL       NaN
1    3980010312           NaN     1 COUNT         NaN       NaN
2    7064770054           NaN         NaN         NaN       001
3  757582821516           001         NaN         NaN       NaN
4  757582821517           001         NaN         NaN       NaN

Existing code

import pandas as pd
import datetime

df = pd.read_csv('sample.txt', sep='|', names=['upc_cd', 'chr_typ', 'chr_vl', 'chr_vl_typ'], engine='python')
df = df[df['chr_typ'].str.contains('HW ')]     
df.sort_values('chr_typ')
df = (
    df.iloc[:, :-1]  # Remove last Column
        .pivot(index=['upc_cd'], columns=['chr_typ'])
        .droplevel(0, axis=1)  # Fix Levels and axes names
        .rename_axis('UPC code')
        .rename_axis(None, axis=1)
        .reset_index()
)
print(df)
df.to_csv('output.csv', sep=',', index=None, mode='w', encoding='utf-8')

Please suggest the modification to the code in order to reduce the running time

Do you have enough RAM for that? if not check [link](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) — Marko Milosevic, Jun 09 '21 at 08:04
You might also consider dropping some cols if possible and changing the dtype during upload. like this: cols = [0, 1] dtypes = {1: 'category'} out = pd.read_csv('test.csv', delimiter='|', usecols=cols, dtype=dtypes) — lmielke, Jun 09 '21 at 08:38

Durtal · Answer 1 · 2021-06-09T09:37:12.433

1

According to the docs the Python engine is slower than the default C engine. Do you really need it? Also it is worth a try to specify the dtype of every column beforehand, so pandas doesn't have to infer the dtypes itself. The docs list some additional parameter you can experiment with like memory_map.

Furthermore: It might be old fashioned but I don't think that you should put 20GB into a dataframe; also not into a single csv-file. I guess you are better off with a sqlite-database from which you can access data fast without loading it to memory.

edited Jun 09 '21 at 09:37

answered Jun 09 '21 at 09:24

Durtal

1,063
3
11

Thanks, without C engine it throws some error. – Paul Mathew Jun 09 '21 at 10:03
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 16: invalid start byte – Paul Mathew Jun 09 '21 at 10:25
@AmruthAnand Look [here](https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python) – Durtal Jun 09 '21 at 14:30

René · Answer 2 · 2021-06-09T08:39:47.357

By default, pandas read_csv() function will load the entire dataset into memory, and this could be a memory and performance issue when importing a huge CSV file.

read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk. This is especially useful when reading a huge dataset as part of your data science project.

Example:

import pandas as pd
import numpy as np

np.random.seed(123)

n_row, n_col, chunksize = 1_000_000, 5, 100_000
pd.DataFrame(np.random.randn(n_row, n_col)).to_csv('sample_data.csv')

sum_list = list()
with pd.read_csv('sample_data.csv', chunksize=chunksize) as reader:
    for chunk in reader:
        sum_list.append(chunk.iloc[:, 0].sum())

print(sum_list)

print(sum(sum_list))

For more info see question 25962114 or the pandas documentation.

May I know How chunksize can be implemented – Paul Mathew Jun 09 '21 at 08:14 — Paul Mathew, Jun 09 '21 at 08:14
I added links for more information and an code example. – René Jun 09 '21 at 08:40 — René, Jun 09 '21 at 08:40

score 0 · Answer 3 · answered Jun 09 '21 at 08:41

0

With pandas you need to split up the dataset, as the other answer suggests. However you can also use Dask, as it is better to handle large datasets. https://dask.org/

answered Jun 09 '21 at 08:41

OleGregersen

19
2
3

No idea How to convert the pivot section to dask. It seems dask doesn't have pivot – Paul Mathew Jun 09 '21 at 08:54

pd.read_csv optimization to reduce the running time

3 Answers3