1

I'm having trouble describing the way my data is structured, hopefully this image will help. enter image description here

As you can see I have 3 columns: Name, Corp #, and Dissagregation Category, that together serve as a unique identifier for each row. There are two super headers, "English/Language Arts Scores", and "Mathematics Scores." I want to create two separate dataframes for each header. This is what I've managed so far:

df1 = pd.read_excel(file, header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'Name'])
df = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

This generates a dataframe with two levels: df['English/Language Arts Scores'] and df['Mathematics Scores'], that contain the columns under the header and "Name" as the index. I want these dataframes to include: Name, Copr #, and Dissagregation Category as columns or as the index.

What would be the best way to go about doing this?

EDIT

Here is a copy and pasteable snippet of my data:

                                          English        Math   
         A        B          C           X       Y     X       Y
 ADAMS CENTRAL  0015    All Students    83      590    83     579
 ADAMS CENTRAL  0015    General 1       0       ***    0      ***
 ADAMS CENTRAL  0015    Total Gene      71      590    71     579
 ADAMS West     0016    All Students    93      440    83     765
 ADAMS West     0016    General 1       1        33    0      660
 ADAMS West     0016    Total Gene      31      ***    46     572

The output that I want should look like this:

In [1]: df['English']
Out[1]:                                              
         A        B          C           X       Y     
 ADAMS CENTRAL  0015    All Students    83      590    
 ADAMS CENTRAL  0015    General 1       0       ***    
 ADAMS CENTRAL  0015    Total Gene      71      590    
 ADAMS West     0016    All Students    93      440    
 ADAMS West     0016    General 1       1        33    
 ADAMS West     0016    Total Gene      31      ***  

In [2]: df['Math']
Out[2]:                                                 
         A        B          C           X       Y
 ADAMS CENTRAL  0015    All Students    83      579
 ADAMS CENTRAL  0015    General 1       0       ***
 ADAMS CENTRAL  0015    Total Gene      71      579
 ADAMS West     0016    All Students    83      765
 ADAMS West     0016    General 1       0       660
 ADAMS West     0016    Total Gene      46      572
otteheng
  • 594
  • 1
  • 9
  • 27
  • 1
    Please see [how to make good, reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Posting images of a spreadsheet program is not very helpful. – juanpa.arrivillaga Nov 29 '17 at 17:57
  • Thanks. Let me see if I can make the data into runnable code. – otteheng Nov 29 '17 at 18:01
  • could you maybe provide a sample expected output: the phrase ` I want to create two separate dataframes for each header` seems a little ambiguous to me and I want to make sure the viewers will interpret your need correctly – MattR Nov 29 '17 at 18:04
  • Of course. It is a little vague. – otteheng Nov 29 '17 at 18:15

1 Answers1

1

We can using read_excel, read the file, then using .loc, slice the columns level0 you need

df = pd.read_excel('yourexcel.xlsx',
                   header=[0,1],
                   index_col=[0,1,2],
                   sheetname="Sheet1")

df.loc[:,'English']
Out[837]: 
C                               X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

To make it more neat add rename_axis

df.loc[:,'English'].rename_axis(None,1)
Out[840]: 
                                X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

Our base on your method

vals = df.iloc[3:,:].values
df1 = pd.DataFrame(df.values[3:, 3:], df.values[3:, 0:3])
mux = pd.MultiIndex.from_arrays(df.ffill().ffill(1).values[1:3, 3:])
df1.columns=mux
df1.index = pd.MultiIndex.from_tuples(df1.index)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I'm getting the error `ValueError: Length of new names must be 1, got 2`. When I read in the excel there are blank columns at the end of the file (in my example after Y under Math). If I manually delete them the error goes away. Is there a way to avoid the error without manually deleting the columns each time? – otteheng Nov 29 '17 at 19:24
  • @otteheng base on your data, I can not reproduce the problem, Maybe you can show the the file ? – BENY Nov 29 '17 at 19:27
  • @otteheng do you know how many columns you want to read into python from excel ? – BENY Nov 29 '17 at 19:37
  • The number of columns varies from excel to excel unfortunately. Is there a way to incrementally remove the last column until the command no longer errors? – otteheng Nov 29 '17 at 19:41
  • @otteheng https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html, you can check the link, I will guess for your case the use.col may help – BENY Nov 29 '17 at 19:43
  • Would you be able to create an answer to this using the steps in my question? If it's not possible let me know and I'll go ahead and accept your answer, as it does work and was very helpful. – otteheng Nov 30 '17 at 23:02
  • Did you run it on the file linked in my comment or the example posted in my question? I'm getting this error `TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument`. I haven't found very much guidance as to what this error means. – otteheng Dec 01 '17 at 14:31
  • @otteheng I am using the example in your post – BENY Dec 01 '17 at 14:52
  • Somethings not working with the excel files but I'm ready to move on. Thanks for your help. – otteheng Dec 01 '17 at 15:11