Ive been tasked with cleaning data from a mobile application designed by a charity
In one section, a users Q/A app use session is represented by a row. This section consists of repeated question answer field pairs, where a field represents the question asked and then the field next to it represents the corresponding answer. Together, each question/field and answer column pair represent a unique question with the answer.
Starting data
answers.0.answer answers.0.fieldName answers.1.answer answers.1.fieldName
5 0 avoidexercise 0.0 vomitflag
6 156 height 54.0 weight
7 1 affectedkneeside 3.0 painlocationknee
I have been asked to reformat the section so each question forms a column, the corresponding answer a field in that column
The ideal output
_id avoidexercise enjoyment fatigue2weeks height
5f27f29c362a380d3f9a9e46 1.0 yes 20.0 120.0
5f27f2ac362a380d3f9a9e4b 0.0 no 40.0 180.0
5f27f4d4362a380d3f9a9e52 1.0 yes 50.0 150.0
My plan is to create many pivot tables, from each other Q/A pairs of columns, then concaternate (outer join) then inner join to remove duplications
However, the original dataframe contains a mixture of numeric and object datatypes
Therefore, only some question/answer column pairs appear to be converting to pivot tables. I have tried using various aggregate functions
p1 = ur.pivot_table(index=['_id'],columns= ['answers.0.fieldName'],values=['answers.0.answer'],aggfunc=lambda x: ' '.join(x))
p2 = ur.pivot_table(index=['_id'],columns= ['answers.1.fieldName'],values=['answers.1.answer'],aggfunc=lambda x: ' '.join(x))
p3 = ur.pivot_table(index=['_id'],columns= ['answers.2.fieldName'],values=['answers.2.answer'],aggfunc=lambda x: ' '.join(x))
I have also tried another lambda function
p1 = ur.pivot_table(index=['_id'],columns= ['answers.0.fieldName'],values=['answers.0.answer'],aggfunc=lambda x: ' '.join(str(v) for v in x)
The furthest I have got so far is to run pivots with standard mean aggfunc
p1 = ur.pivot_table(index=['_id'],columns=['answers.0.fieldName'],values=['answers.0.answer'])
ps = [p1,p2,p3]
c = pd.concat(ps)
Then attempting to remove merge rows and columns
df = c.sum(axis=1, level=1, skipna=False)
g = df.groupby('_id').agg(np.sum)
This returns a dataframe with the right shape
However, it looses the values in the object columns, and I'm not sure how accurate all the numeric columns are
To overcome this problem, I was considering converting as much data as I can into numeric
c4 = c.apply(pd.to_numeric, errors='ignore').info()
Then splitting the combined pivot table dataframe into numeric and object type
nu = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
cndf = c4.select_dtypes(include=nu)
o = ['object', 'bool', 'datetime64', 'category']
codf = c4.select_dtypes(include=o)
And running through the same .sum and groupby operations as above on the numeric dataframe
n1 = cndf.sum(axis=1, level=1, skipna=False)
n2 = n1.groupby('_id').agg(np.sum)
However, this still leaves the challenge of dealing with the object columns