9

Consider this Pandas dataframe:

df = pd.DataFrame({
    'User ID': [1, 2, 2, 3],
    'Cupcakes': [1, 5, 4, 2],
    'Biscuits': [2, 5, 3, 3],
    'Score': [0.65, 0.12, 0.15, 0.9]
})

ie.

   User ID  Cupcakes  Biscuits  Score
0        1         1         2   0.65
1        2         5         5   0.12
2        2         4         3   0.15
3        3         2         3   0.90

I want to tidy ("melt") this data so that the dessert type are separate observations. But I also want to keep the score for each user.

Using melt() directly doesn't work:

df.melt(
    id_vars=['User ID'],
    value_vars=['Cupcakes', 'Biscuits'],
    var_name='Dessert', value_name='Enjoyment'
)

...gives:

   User ID   Dessert  Enjoyment
0        1  Cupcakes          1
1        2  Cupcakes          5
2        2  Cupcakes          4
3        3  Cupcakes          2
4        1  Biscuits          2
5        2  Biscuits          5
6        2  Biscuits          3
7        3  Biscuits          3

I've lost the score data!

I can't use wide_to_long() because I don't have a common "stub name" for my dessert types.

I can't join or merge the tidied data with the original data because the tidied data is reindexed and the user ID is not unique for each observation.

How do I tidy this data but retain columns that aren't involved in the tidying?

detly
  • 29,332
  • 18
  • 93
  • 152
  • I should add to this that upon further reading of that "tidy data" paper, it seems like the *truly* correct approach is to separate the score data into another table, with a new shared index (eg. "observation number") between the two tables. – detly Jul 24 '19 at 23:13

1 Answers1

11

Add column Score to id_vars in DataFrame.melt:

id_vars : tuple, list, or ndarray, optional

Column(s) to use as identifier variables.

df1 = df.melt(
    id_vars=['User ID', 'Score'],
    value_vars=['Cupcakes', 'Biscuits'],
    var_name='Dessert', value_name='Enjoyment'
)
print (df1)
   User ID  Score   Dessert  Enjoyment
0        1   0.65  Cupcakes          1
1        2   0.12  Cupcakes          5
2        2   0.15  Cupcakes          4
3        3   0.90  Cupcakes          2
4        1   0.65  Biscuits          2
5        2   0.12  Biscuits          5
6        2   0.15  Biscuits          3
7        3   0.90  Biscuits          3

If need melting all columns without User ID and Score omit value_vars:

df.melt(
    id_vars=['User ID', 'Score'],
    var_name='Dessert', value_name='Enjoyment'
)
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Huh, I missed that `id_vars` took a list. I had assumed from the name that `id_vars` should be an index-like column (ie. not a "dependent" variable), but it doesn't have to be, does it? – detly Jul 24 '19 at 05:25
  • @detly - If check docs - [`DataFrame.melt`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html) - `Column(s) to use as identifier variables.` – jezrael Jul 24 '19 at 05:27
  • Oh I read it, I just didn't know what was strictly meant by "identifier variables." – detly Jul 24 '19 at 05:28
  • 1
    @detly - hmmm, not sure about definition, but one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’. [link](https://stackoverflow.com/questions/34830597/pandas-melt-function) – jezrael Jul 24 '19 at 05:32