To write from a pandas dataframe to parquet I'm doing the following:
df = pd.DataFrame(DATA)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'DATA.parquet')
However, this doesn't work well if I have let's say 1B rows, and it cannot fit in memory. In that case, how would I write the data incrementally. For example, something like:
DATA = []
BACTCH_SIZE = 10000
with open('largefile.csv') as f:
for num, line in enumerate(f):
if (len(DATA) == BATCH_SIZE):
pq.write_table(pa.Table.from_pandas(pd.DataFrame(DATA)), 'DATA.parquet')
DATA = []
DATA.append(line.split(','))
if DATA: pq.write_table(pa.Table.from_pandas(pd.DataFrame(DATA)), 'DATA.parquet')
However, I believe the above would just keep overwriting the parquet file. How could I do the equivalent of appending?