2

In the following the values _0,_1 have been used where there should be spaces, I wasn't able to create the example with spaces here as pandas didn't allow them. They are present in the excel file that I'm reading from though.

1,2,3 are not values which can be relied on in any solution, they're just filling for this example.

What I would like to do is convert an additional heading into a column, so that there is only one heading for the data.

Some example data:

ef = pd.DataFrame({
  '_0' : ['loc', 1, 2, 3],
  'a' :  ['x', 1, 2, 3],
  '_1' : ['y', 1, 2, 3],
  '_2' : ['z', 1, 2, 3],
  'b' :  ['x', 1, 2, 3],
  '_3' : ['y', 1, 2, 3],
  '_4' : ['z', 1, 2, 3],
  'c' :  ['x', 1, 2, 3],
  '_5' : ['y', 1, 2, 3],
  '_6' : ['z', 1, 2, 3],
})

Which outputs

In [98]: ef
Out[98]:
    _0  a _1 _2  b _3 _4  c _5 _6
0  loc  x  y  z  x  y  z  x  y  z
1    1  1  1  1  1  1  1  1  1  1
2    2  2  2  2  2  2  2  2  2  2
3    3  3  3  3  3  3  3  3  3  3

Without the underscores this is

        a        b        c      
0  loc  x  y  z  x  y  z  x  y  z
1    1  1  1  1  1  1  1  1  1  1
2    2  2  2  2  2  2  2  2  2  2
3    3  3  3  3  3  3  3  3  3  3

And I would like to get it into the form

loc  type  x  y  z  
  1   a    1  1  1  
  1   b    1  1  1  
  1   c    1  1  1  
  2   a    2  2  2  
  2   b    2  2  2  
  2   c    2  2  2  
  3   a    3  3  3  
  3   b    3  3  3  
  3   c    3  3  3  

How can this be done using pandas?

baxx
  • 3,956
  • 6
  • 37
  • 75
  • 1
    Or if possible `print (df.columns.to_frame(index=False))` ? – jezrael Oct 26 '19 at 17:43
  • cheers @jezrael , a solution which leverages the data given in the post rather than relying on reading in information from the excel file that I have locally would be best I think. – baxx Oct 26 '19 at 17:46
  • I ask for data, because this parsing is pain generally and not sure how looks data in `MultiIndex` (it depends of your data, there are some spaces, so not sure if I know create good data for reproduce...) – jezrael Oct 26 '19 at 17:48
  • I'm trying to avoid any ingestion confusion and just focus on manipulation of data in the format that I've given in the OP though? Anything related to the reading of the data can be done in a separate post, I'm aware there are issues as the data I have is very irregular. Usually there's useless information at the top of the file and so on. I don't understand why the example I have given is not suitable. – baxx Oct 26 '19 at 17:51
  • 1
    OK, I create data like you mentioned in question. Give me a sec. – jezrael Oct 26 '19 at 17:52
  • hmmm, so it means `df = df.read_excel(file, header=[0,1], index_col=0)` cannot be used here? – jezrael Oct 26 '19 at 17:54
  • 1
    @jezrael no that didn't work, I could make a separate post about reading this data in, but it really is a mess. I have tried to capture an atomic part of the problem though with the example data in this post (though I couldn't do the spaces so well). I'm interested in how best to manipulate data like this though, that would be good to learn. – baxx Oct 26 '19 at 17:58
  • Answer was edited... – jezrael Oct 26 '19 at 18:10

1 Answers1

3

I think here is best convert to MultiIndex in columns and first column loc to index in read_excel:

df = df.read_excel(file, header=[0,1], index_col=0)

Then is possible columns names are a bit changed with unammed values, so necessary later processing.

Solution for your data:

ef = pd.DataFrame({
  '_0' : ['loc', 1, 2, 3],
  'a' :  ['x', 1, 2, 3],
  '_1' : ['y', 1, 2, 3],
  '_2' : ['z', 1, 2, 3],
  'b' :  ['x', 1, 2, 3],
  '_3' : ['y', 1, 2, 3],
  '_4' : ['z', 1, 2, 3],
  'c' :  ['x', 1, 2, 3],
  '_5' : ['y', 1, 2, 3],
  '_6' : ['z', 1, 2, 3],
})

#added spaces for values with `_`
ef.columns = np.where(ef.columns.str.contains('_'), ' ', ef.columns).tolist()

#create MultiIndex by set first row to columns
ef.columns = [ef.columns, ef.iloc[0]]
#remove first row by iloc and set index by first column - it is MultiIndex, so used tuple
ef = ef.iloc[1:].set_index([(' ', 'loc')])
#created tuples in index - removed tuples
ef.index = ef.index.str[0]
#set index name later for new column
ef.index.name='loc'

#converted MultiIndex to df for forward filling spaces converted to NaNs
df = pd.DataFrame(ef.columns.tolist(), columns=['type', 'c2'])
df['type'] = df['type'].mask(df['type'] == ' ').ffill()
print (df)
  type c2
0    a  x
1    a  y
2    a  z
3    b  x
4    b  y
5    b  z
6    c  x
7    c  y
8    c  z

#set MultiIndex to ef
ef.columns = [df['c2'], df['type']]

#last possible reshape
ef = ef.stack().reset_index().rename_axis(None, axis=1)
print (ef)
   loc type  x  y  z
0    1    a  1  1  1
1    1    b  1  1  1
2    1    c  1  1  1
3    2    a  2  2  2
4    2    b  2  2  2
5    2    c  2  2  2
6    3    a  3  3  3
7    3    b  3  3  3
8    3    c  3  3  3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    brilliant answer, close to my own answer but I used `pd.MultiIndex.from_arrays([ef.iloc[0, :], ef.columns]` I couldn't melt/stack the dataframe tho. – Umar.H Oct 26 '19 at 18:07
  • for the line with `ef = ef.iloc[1:].set_index([(" ", "loc")]).copy()` the following error is given: `PerformanceWarning: indexing past lexsort depth may impact performance`. I've seen: https://stackoverflow.com/a/53927461 , but am not sure how it would be best handled your example. – baxx Oct 23 '20 at 21:39
  • @baxx - I guess problem should be MultiIndex is not sorted, so try after `ef.columns = [ef.columns, ef.iloc[0]]` add `ef = ef.sort_index(axis=1)` – jezrael Oct 25 '20 at 04:21