1

To write from a pandas dataframe to parquet I'm doing the following:

df = pd.DataFrame(DATA)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'DATA.parquet')

However, this doesn't work well if I have let's say 1B rows, and it cannot fit in memory. In that case, how would I write the data incrementally. For example, something like:

DATA = []
BACTCH_SIZE = 10000
with open('largefile.csv') as f:
    for num, line in enumerate(f):
        if (len(DATA) == BATCH_SIZE):
            pq.write_table(pa.Table.from_pandas(pd.DataFrame(DATA)), 'DATA.parquet')
            DATA = []
        DATA.append(line.split(','))

if DATA: pq.write_table(pa.Table.from_pandas(pd.DataFrame(DATA)), 'DATA.parquet')

However, I believe the above would just keep overwriting the parquet file. How could I do the equivalent of appending?

  • https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file – Ajay Srivastava Feb 09 '19 at 02:14
  • Does this answer your question? [Using pyarrow how do you append to parquet file?](https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file) – Contango Jan 23 '22 at 12:50

1 Answers1

3

Hadoop isn't meant for appends. Just write new files, per batch, into a single directory, and almost all Hadoop APIs should be able to read all the parquet files

BACTCH_SIZE = 10000
c = 0
with open('largefile.csv') as f:
    for num, line in enumerate(f):
        if len(DATA) == BATCH_SIZE:
            pq.write_table(pa.Table.from_pandas(pd.DataFrame(DATA)), 'DATA.{}.parquet'.format(c))
            DATA = []
            c += 1
        DATA.append(line.split(','))

This is how Spark would write the data too; one file per executor

But if you had a large csv anyway, just put it in HDFS, then create a Hive table over it, and then convert it to parquet from there. No need for pandas at all

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245