0

I have some excel files with this configuration.

Excel sheet that needs to be written to a pandas dataframe

That is, instead of writing Header3/Header31 to column D, Header4/Header41 to column E, and Header5/Header51 to column F, the user wrote in the way shown in the Figure. Note that the users are the same for all variables (headers) and a blank space separate the next 'batch' of headers. Is there any way to read this file as a pandas dataframe? I can't just correct the excel file manually, then read in Python, as there are many files and many columns.

Intended way of the dataframe to be shown in a pandas dataframe

I have been doing a research through Stack Overflow to try to find an answer and I found some interesting similar problems, but not quite what I need to do.

Link 1 This one has different headers through the sheet, but the OP needed only a specific part of dataframe. The accepted answer used open(file) and read the file until the correct header was found. I thought about doing something like this, but how could I stop the reading column by column,as I need them all, not only one and then pass to read_excel.

Link2 This one also is similar, but the OP needed to not include data after the 'break' occur. That is, when the blank line separating the other headers occurs, the loop is ended and results go to read_excel.

I think the way to go will be through file reading with open instead of going directly to read_excel but how to read column by column?

Thank you!

kolrocket
  • 9
  • 3

1 Answers1

0

You can try this :

#https://stackoverflow.com/a/75922169/16120011
import numpy as np
import pandas as pd

#pip install scikit-image
from skimage.measure import label, regionprops

df = pd.read_excel("file.xlsx", header=None)

larr = label(np.array(df.notnull()).astype("int"))

list_dfs = []
for s in regionprops(larr):
    sub_df = (df.iloc[s.bbox[0]:s.bbox[2], s.bbox[1]:s.bbox[3]].set_index(0)
                .pipe(lambda df_: df_.set_axis(pd.MultiIndex.from_frame(
                    df_.iloc[0:2].T, names=[None]*2), axis=1)
                .drop(df_.index[:2])
                     )
             )
    list_dfs.append(sub_df)

out = pd.concat(list_dfs, axis=1).rename_axis(index=None)

Output :

print(out)

       Header1  Header2  Header3  Header4  Header5
      Header12 Header21 Header31 Header41 Header51
user1        1        2       11       12        3
user2        3        4       13       14        4
user3        5        6       10        7        1
user4        7        8        1        2        2
user5        9       10        4        5        3

If you don't wanna a MultiIndex and only care about the output format, you can use :

list_dfs = []
for s in regionprops(larr):
    sub_df = df.iloc[s.bbox[0]:s.bbox[2], s.bbox[1]:s.bbox[3]].set_index(0)
    list_dfs.append(sub_df)

out = pd.concat(list_dfs, axis=1).rename_axis(index=None).reset_index()

out.to_excel("outfile.xlsx", header=False, index=False)

Output :

enter image description here

Spreadsheet used :

enter image description here

Timeless
  • 22,580
  • 4
  • 12
  • 30
  • 1
    Thank you very much! I did not know about `regionprop`, that is a very powerful tool for these messy excel files! Also thanks for the multiindex tip! :) – kolrocket May 23 '23 at 20:38