Is it possible to gradually read data from sql with read_sql to dataframe and at the same time write to parquet using pandas?

Question

I have a large dataset which is really big to handle as a dataframe. Besides, It takes a long time reading whole data from database every time. Once I tried to write my data to parquet format, and read the data with read_parquet which was very faster.

So my question is: Can I read the data from database in chunk read_sql, write it to parquet with pandas to_parquet, read another chunk (when I delete previous one to save RAM) and append it to the parquet file and so on?

Did you give it a try to find out? Please post your attempt and any issues you face. Consider saving chunk size parquet files and then appending with pyarrow library. See [Using pyarrow how do you append to parquet file?](https://stackoverflow.com/q/47113813/1422451) — Parfait, Aug 27 '22 at 13:01
@Parfait Yes, I've give it a try. But as it's not some problem with code, I didn't share mine. Will try append with pyarrow. — masoud, Aug 27 '22 at 13:05

score 1 · Answer 1 · edited Sep 14 '22 at 14:59

1

We can try using pyarrow to write from database directly to parquet:

import pyarrow as pa
import pyarrow.parquet as pq

chunk = 1_000_000
pqwriter = None

for i, df in enumerate(pd.read_sql(query, con=con, chunksize=chunk):
   table = pa.Table.from_pandas(df)
   if i == 0:
      pqwriter = pq.ParquetWriter('output.parquet', table.schema)
   pqwriter.write_table(table)
   print('One Chunk Written')
if pqwriter:
   pqwriter.close()

edited Sep 14 '22 at 14:59

jtlz2

7,700
9
64
114

answered Sep 14 '22 at 13:31

masoud

535
3
16

Brilliant, works for me! How would you partition the output table as you write it? – jtlz2 Sep 14 '22 at 14:58
How are you declaring/defining `con`? – jtlz2 Sep 14 '22 at 15:52
Did you copy this from here? https://stackoverflow.com/a/47839247/1021819 If so you ought at least to acknowledge it! :-S – jtlz2 Sep 14 '22 at 20:25
1

@jtlz2 I had read that post ( and of course other ones) when I asked the question. But not copy from it. I did just wrote the code that worked for me in my project :) – masoud Sep 15 '22 at 18:22
@jtlz2 you can define a con object with respect to your own database. I've used cx_Oracle to connect to oracle db. – masoud Sep 15 '22 at 18:27

Is it possible to gradually read data from sql with read_sql to dataframe and at the same time write to parquet using pandas?

1 Answers1