1

I have a csv file which is more than 1 GB in size. Usually this is how I convert a csv to a table using pandas.

import pandas as pd
from sqlalchemy import create_engine

file_path = "/path/to/test.csv"
df = pd.read_csv(file_path)

for col in df.columns:
    # convert each column values to numeric 
    df[col] = pd.to_numeric(df[col], errors='coerce')

engine = create_engine('postgresql://username:password@localhost:port/dbname')
df.to_sql("mytable", engine)

As you can see, I load up the csv file, perform operation on each column and dump it to a table.

Now since my file is very large, pandas is unable to load it in data frame. So I looked up a solution online which talked about loading data in chunks and performing operations Using pandas to efficiently read in a large CSV file without crashing. So this is what I came up with

file_path = "/path/to/test.csv" 
chunksize = 100000
for chunk in pd.read_csv(file_path, chunksize=chunksize, iterator=True, low_memory=False):
    columns = chunk.columns

It gives me the columns of every chunk. So does a chunk size means it reads n rows at a time? I can't quite figure out how do I ensure I cover all the chunks and keep appending the data to the table such that in the end, I have all the data dumped to the table like I do with smaller csv files?

Souvik Ray
  • 2,899
  • 5
  • 38
  • 70

1 Answers1

1

The updated docs on iterator and chunksize can be found here: Iterating through files chunk by chunk.

So does a chunk size means it reads n rows at a time?

Yes.

Your code using chunksize is mostly correct, you just need to add each chunk to your dataframe.

If all your columns are the same type and don't require any special logic/types, then convert the entire DataFrame instead of doing it column by column. Or, you can specify dtypes to read_csv. However, you'd lose access to specify 'coerce', so will leave it as is.

For extremely large data, it's better to do the entire read, convert, to_sql in chunks. Also note that it doesn't make sense to use low_memory=False in that case, use the default True. You're converting the type later anyway, so a mixed type inference (which is what might happen), won't matter.

engine = create_engine('postgresql://username:password@localhost:port/dbname')
reader = pd.read_csv(file_path, chunksize=chunksize, low_memory=True)
for rows in reader:
    df = pd.DataFrame(rows)
    # column conversions
    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    # sql export
    df.to_sql("mytable", engine)
aneroid
  • 12,983
  • 3
  • 36
  • 66
  • wouldn't adding chunk to data frame increase memory consumption? Or is it that for each chunk, I need to keep dumping to the database? – Souvik Ray Apr 23 '20 at 15:37
  • For 1 GB, you could probably hold that in memory and then dump to DB. But you raise a good point, especially for larger sizes. Better to it all in chunks: read, convert, to_sql. But `low_memory=False` doesn't make sense. Have edited my answer with the changes. – aneroid Apr 23 '20 at 16:17
  • just another question. If I have a csv file which is of very small size, can I still keep the chunk size as same? I am planning to use this code to convert csv of any size to a table. – Souvik Ray Apr 23 '20 at 16:26
  • Yeah, sure. That works too because if it's very small, it will be processed in a single loop. So apart from _microsecond_ overhead of loop constructs, it will work as well for small files as it does for larger ones. – aneroid Apr 23 '20 at 17:08
  • 1
    Btw, I would recommend that you try to specify `dtypes` in `read_csv` and check if you really need the 'coerce' option. The `for col in df.columns: df[col] = pd.to_numeric...` is the least efficient part of the code, so use it only if it's necessary. If your dtypes are all the same, could also use [`df.astype()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html#pandas-dataframe-astype) after `df = pd.DataFrame(rows)`. – aneroid Apr 23 '20 at 17:12