1

I have a large dataset which is really big to handle as a dataframe. Besides, It takes a long time reading whole data from database every time. Once I tried to write my data to parquet format, and read the data with read_parquet which was very faster.

So my question is: Can I read the data from database in chunk read_sql, write it to parquet with pandas to_parquet, read another chunk (when I delete previous one to save RAM) and append it to the parquet file and so on?

Vitalizzare
  • 4,496
  • 7
  • 13
  • 32
masoud
  • 535
  • 3
  • 16
  • 3
    Did you give it a try to find out? Please post your attempt and any issues you face. Consider saving chunk size parquet files and then appending with pyarrow library. See [Using pyarrow how do you append to parquet file?](https://stackoverflow.com/q/47113813/1422451) – Parfait Aug 27 '22 at 13:01
  • @Parfait Yes, I've give it a try. But as it's not some problem with code, I didn't share mine. Will try append with pyarrow. – masoud Aug 27 '22 at 13:05
  • @masoud Any luck / yet? – jtlz2 Sep 14 '22 at 11:17
  • @jtlz2 I'll share my solution with you – masoud Sep 14 '22 at 13:24
  • 1
    Perfect, yes please! – jtlz2 Sep 14 '22 at 13:27

1 Answers1

1

We can try using pyarrow to write from database directly to parquet:

import pyarrow as pa
import pyarrow.parquet as pq

chunk = 1_000_000
pqwriter = None

for i, df in enumerate(pd.read_sql(query, con=con, chunksize=chunk):
   table = pa.Table.from_pandas(df)
   if i == 0:
      pqwriter = pq.ParquetWriter('output.parquet', table.schema)
   pqwriter.write_table(table)
   print('One Chunk Written')
if pqwriter:
   pqwriter.close()
jtlz2
  • 7,700
  • 9
  • 64
  • 114
masoud
  • 535
  • 3
  • 16
  • Brilliant, works for me! How would you partition the output table as you write it? – jtlz2 Sep 14 '22 at 14:58
  • How are you declaring/defining `con`? – jtlz2 Sep 14 '22 at 15:52
  • Did you copy this from here? https://stackoverflow.com/a/47839247/1021819 If so you ought at least to acknowledge it! :-S – jtlz2 Sep 14 '22 at 20:25
  • 1
    @jtlz2 I had read that post ( and of course other ones) when I asked the question. But not copy from it. I did just wrote the code that worked for me in my project :) – masoud Sep 15 '22 at 18:22
  • @jtlz2 you can define a con object with respect to your own database. I've used cx_Oracle to connect to oracle db. – masoud Sep 15 '22 at 18:27