1

Kind of new to Python and scripting so the solution is probably easy but breaking my head over this...

I would like to get a dataframe from long format to wide format.

The dataframe contains records on results of lab tests. The results as extracted from the database are categorical and include text, for example 10 mU/L, 1000 mU/L, >100000 mU/L.

As the value is text, pivot_table did not work and raised 'DataError: 'no numeric types to aggregate'

Therefore I used pivot.

import pandas as pd

data = pd.DataFrame({"specimen":[4456, 4456, 4460, 4460, 4480, 4480, 4598, 4656], 
                     "patientnumber":[264, 264, 264, 264, 285, 285, 412, 583], 
                     "micro_org":["Ecoli", "Entcc", "Ecoli", "Staph", "Hsga", "Cfreundi", "Ecoli", "Cfreundi"],
                     "value":['100 mU', '1000 mU', '>10000 mU', '10000 mU', '100 mU', '1000 mU', '>10000 mU', '100 mU']})


data['idx'] = data.groupby('specimen').cumcount()
data_pivot = data.pivot(index='specimen',columns='idx')[['micro_org', 'value']] 

This gives me

data

The columns seem to be multi-index as

  • data_pivot.index.names gives FrozenList([None]) and
  • data_pivot.columns.names gives FrozenList([None, 'idx']) and
  • data_pivot.columns gives MultiIndex(levels=[['patientnumber', 'micro_org', 'value', 'specimen', 'index'], [0, 1, '']], codes=[[4, 3, 0, 0, 1, 1, 2, 2], [2, 2, 0, 1, 0, 1, 0, 1]], names=[None, 'idx'])

Question: how do I remove the multi-index of the columns so column names are ['specimen', 'micro_org1', 'micro_org2', 'value1', 'value2' ]

I have tried data_pivot.reset_index() but that does change anything in the columns. I also looked at stack/unstack but that does not seem to be the answer to this question either. I searched blogs and SO but the categorical variable as 'value' to pivot on seems to complicate matters. Perhaps I am better of recoding this to integer?

Any help, suggestions for improvement and explanation greatly appreciated.

HBN
  • 33
  • 4

0 Answers0