4

Is it possible to convert a xlsx excel file in parquet without converting in csv ? The thing is that i have many excel files with each many sheets and i don't want to convert each sheet in csv and then in parquet so i wonder if there is a way to convert directly excel to parquet ? Or maybe, is there a way to do it with nifi ? I wanted to do it this way using a python script

def csv_from_excel():

wb = xlrd.open_workbook('your_workbook.xls')
sh = wb.sheet_names()
for i in sh:
    sh = wb.sheet_by_name(i)
    your_csv_file = open('your_csv_file.csv', 'wb')
    wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)

    for rownum in xrange(sh.nrows):
        wr.writerow(sh.row_values(rownum))

    your_csv_file.close()
`
Scott Holtzman
  • 27,099
  • 5
  • 37
  • 72
Orhan Yazar
  • 909
  • 7
  • 19
  • Very related: https://stackoverflow.com/questions/32940416/methods-for-writing-parquet-files-using-python – John Y Jul 31 '17 at 21:37

2 Answers2

1

From a Nifi perspective, the two interesting questions here are:

  1. Can Nifi pick up this Excel?

This should not be too difficult when leveraging the XLSX processor, but if your situation is a bit more complex, this elaborate HCC article might be helpful.

  1. Can Nifi write to Parquet?

This part is easy, with the PutParquet processor, Nifi can directly write to Parquet.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
0

Run the following code and install all required libraries

import os
import fnmatch
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import openpyxl

path = './'
pattern = 'mydataset.xlsx'
all_files = os.listdir(path)

for name in all_files:
    if fnmatch.fnmatch(name, pattern):
        df = pd.read_excel(name)
        table = pa.Table.from_pandas(df)
        pq.write_table(table, name + '.parquet')