0

I have a problem with multiindex column name. I'm using XLRD to convert excel data to json using json.dumps but instead it gives me only one row of column name only. I have read about multilevel json but i have no idea how to do it using XLRD.

Here is my sample of table column name

Sample of code:

for i in path:
with xlrd.open_workbook(i) as wb:
    print([i])
    kwd = 'sage'
    print(wb.sheet_names())
    for j in range(wb.nsheets):
        worksheet = wb.sheet_by_index(j)
        data = []
        n = 0
        nn = 0
        keyword = 'sage'
        keyword2 = 'adm'
        try:
            skip = skip_row(worksheet, n, keyword)
            keys = [v.value for v in worksheet.row(skip)]
        except:
            try:
                skip = skip_row2(worksheet, nn, keyword2)
                keys = [v.value for v in worksheet.row(skip)]
            except:
                continue
        print(keys)
        for row_number in range(check_skip(skip), worksheet.nrows):
            if row_number == 0:
                continue
            row_data = {}
            for col_number, cell in enumerate(worksheet.row(row_number)):

                row_data[keys[col_number]] = cell.value
            data.append(row_data)
        print(json.dumps({'Data': data}))

ouh by the way, each worksheet have different number to skip before column name so that's why my code got function of skip row. After I skip the row and found the exact location of my column name. Then i start to read the values. But it yah there is where the problem raise from my view because i got two rows of column name. And still confuse how to do multi level json with XLRD or at least join the column name with XLRD (which i guess can't).

Desired outcome multilevel json:

{ "Data":[{ "ID" : "997", "Tax" : [{"Date" : "9/7/2019", "Total" : 2300, "Grand Total" : 340000"}], "Tax ID" : "ST-000", .... }]}

pss:// I've tried to use pandas but it gives me a lot of trouble since i work with big data.

IAMajid
  • 1
  • 2
  • does your header row index and column names are always fixed? – Akash Oct 17 '19 at 02:41
  • @Akash sadly nope. It change for every worksheet. For example: on sheet 1, need to skip 3 row before found header and column name sometimes differ with sheet 2 (this sheet can have no row to skip before found header). But for sure in every sheet MUST have multiindex column name :| – IAMajid Oct 17 '19 at 02:46
  • can you please share sample excel file and required output format? – Akash Oct 17 '19 at 03:03
  • @Akash sorry have no idea how to share excel file. But I put the image of example excel above. Just imagine differ sheet got differ row to skip before header found and the column name and location not fixed – IAMajid Oct 17 '19 at 03:49

1 Answers1

0

You can use multi indexing in panda, first you need to get header row index for each sheet.

header_indexes = get_header_indexes(excel_filepath, sheet_index) #returns list of header indexes

You need to write get_header_indexes function which scans sheet and return header indexes.

you can use panda to get JSON from dataframe.

import pandas as pd
df = pd.read_excel(excel_filepath, header=header_indexes, sheet_name=sheet_index)
data = df.to_dict(orient="records")

for multiple headers data containts list of dict and each dict has tuple as key, you can reformat it to final JSON as per your requirement.

Note: Use chunksize for reading large files.

Akash
  • 559
  • 6
  • 17
  • hi your solution is good. But i have bundle of excel files and pandas can read .xlsx but not support xls. anymore. So that's why i need to use XLRD so can read both extension. – IAMajid Oct 17 '19 at 05:42
  • panda supports both xls and xlsx extensions, for reading `xls` file you required to install `xlrd` package. refer this link https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel – Akash Oct 17 '19 at 06:08
  • `Out of bounds nanosecond timestamp: 3434-07-06 00:00:00` got this error. – IAMajid Oct 17 '19 at 06:20
  • your timestamp value is out of pandas timestamp limitation. https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a If you are not dealing with datetime, you can change column type by specifying `dtype` to `str`. – Akash Oct 17 '19 at 06:36
  • ouh do you know how to delete entire row if the timestamp out of pandas timestamp? @Akash – IAMajid Oct 17 '19 at 14:22