I have a csv
file which is more than 1 GB
in size. Usually this is how I convert a csv
to a table
using pandas
.
import pandas as pd
from sqlalchemy import create_engine
file_path = "/path/to/test.csv"
df = pd.read_csv(file_path)
for col in df.columns:
# convert each column values to numeric
df[col] = pd.to_numeric(df[col], errors='coerce')
engine = create_engine('postgresql://username:password@localhost:port/dbname')
df.to_sql("mytable", engine)
As you can see, I load up the csv
file, perform operation on each column
and dump it to a table
.
Now since my file is very large, pandas
is unable to load it in data frame
. So I looked up a solution online which talked about loading data
in chunks and performing operations Using pandas to efficiently read in a large CSV file without crashing. So this is what I came up with
file_path = "/path/to/test.csv"
chunksize = 100000
for chunk in pd.read_csv(file_path, chunksize=chunksize, iterator=True, low_memory=False):
columns = chunk.columns
It gives me the columns
of every chunk. So does a chunk size means it reads n
rows at a time? I can't quite figure out how do I ensure I cover all the chunks and keep appending the data
to the table
such that in the end, I have all the data
dumped to the table
like I do with smaller csv
files?