1

Ive been tasked with cleaning data from a mobile application designed by a charity

In one section, a users Q/A app use session is represented by a row. This section consists of repeated question answer field pairs, where a field represents the question asked and then the field next to it represents the corresponding answer. Together, each question/field and answer column pair represent a unique question with the answer.

Starting data

    answers.0.answer    answers.0.fieldName answers.1.answer    answers.1.fieldName
5   0                    avoidexercise             0.0            vomitflag
6   156                  height                    54.0         weight
7   1                    affectedkneeside           3.0       painlocationknee

I have been asked to reformat the section so each question forms a column, the corresponding answer a field in that column

The ideal output

_id                    avoidexercise    enjoyment   fatigue2weeks   height
        

5f27f29c362a380d3f9a9e46    1.0           yes            20.0       120.0
5f27f2ac362a380d3f9a9e4b    0.0           no             40.0       180.0
5f27f4d4362a380d3f9a9e52    1.0           yes            50.0       150.0

My plan is to create many pivot tables, from each other Q/A pairs of columns, then concaternate (outer join) then inner join to remove duplications

However, the original dataframe contains a mixture of numeric and object datatypes

Therefore, only some question/answer column pairs appear to be converting to pivot tables. I have tried using various aggregate functions

p1 = ur.pivot_table(index=['_id'],columns= ['answers.0.fieldName'],values=['answers.0.answer'],aggfunc=lambda x: ' '.join(x))
p2 = ur.pivot_table(index=['_id'],columns= ['answers.1.fieldName'],values=['answers.1.answer'],aggfunc=lambda x: ' '.join(x))
p3 = ur.pivot_table(index=['_id'],columns= ['answers.2.fieldName'],values=['answers.2.answer'],aggfunc=lambda x: ' '.join(x))
I have also tried another lambda function

p1 = ur.pivot_table(index=['_id'],columns= ['answers.0.fieldName'],values=['answers.0.answer'],aggfunc=lambda x: ' '.join(str(v) for v in x)
The furthest I have got so far is to run pivots with standard mean aggfunc

p1 = ur.pivot_table(index=['_id'],columns=['answers.0.fieldName'],values=['answers.0.answer'])
ps = [p1,p2,p3]
c = pd.concat(ps)

Then attempting to remove merge rows and columns

df = c.sum(axis=1, level=1, skipna=False)

g = df.groupby('_id').agg(np.sum)

This returns a dataframe with the right shape

However, it looses the values in the object columns, and I'm not sure how accurate all the numeric columns are

To overcome this problem, I was considering converting as much data as I can into numeric

c4 = c.apply(pd.to_numeric, errors='ignore').info()

Then splitting the combined pivot table dataframe into numeric and object type

nu = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
cndf = c4.select_dtypes(include=nu)
o = ['object', 'bool', 'datetime64', 'category']
codf = c4.select_dtypes(include=o)

And running through the same .sum and groupby operations as above on the numeric dataframe

n1 = cndf.sum(axis=1, level=1, skipna=False)
n2 = n1.groupby('_id').agg(np.sum)

However, this still leaves the challenge of dealing with the object columns

Riuh S
  • 29
  • 6

1 Answers1

1

Here's an approach to obtain your desired result:

First define the number of question-answer pairs you have:

num_answers = 2 #Following your 'Starting data' in the question

Then use the following couple of lines to obtain a dataframe as required:

import pandas as pd
df2 = pd.concat([pd.pivot_table(df1, index=['_id'], columns= ['answers.{}.fieldName'.format(i)], values=['answers.{}.answer'.format(i)]) for i in range(num_answers)], axis = 1).fillna('N/A')
df2.columns = [col[1] for col in df2.columns]

Here df1 is assumed to be your dataframe with the starting data.

As you might have noticed, 'N/A' is present in cells where the particular id has no recorded answer for that particular field.

Assuming an ID of [1,2,3] for the three rows respectively, the output df2 for your 'Starting data' would look like this:

      affectedkneeside  avoidexercise   height  painlocationknee    vomitflag   weight
_id                     
0          N/A                0           N/A         N/A              0         N/A
1          N/A               N/A          156         N/A              N/A       54
2           1                N/A          N/A          3               N/A       N/A
Ishwar Venugopal
  • 872
  • 6
  • 17
  • Hey Ishwar, I really like this However, unfortunately the original dataframe had mixed data, so I ham getting a DataError: No numeric types to aggregate when I attempt to contacternate all 68 pairs – Riuh S Feb 16 '21 at 16:38
  • Also I love your code, could I pay you a quid for every solution you have? – Riuh S Feb 16 '21 at 16:38
  • How about we use apply somehow ur.iloc[:,5:136:2].apply(pd.Series.value_counts, axis=1) This just tells us weather or not a user has answered a question Would it be possible to use a different method to .valuecounts() – Riuh S Feb 16 '21 at 16:41
  • This is an open forum, so you dont need to pay :) ! You can upvote and accept answers if it helps you, that would do :) – Ishwar Venugopal Feb 16 '21 at 16:47
  • I am not sure if the .iloc approach would work! – Ishwar Venugopal Feb 16 '21 at 16:48
  • Well, I certain respect and appreciate your genius Any ideas on how we could merge the object columns together – Riuh S Feb 16 '21 at 18:09