Pandas - Map - Dummy Variables - Assign value of 1

Question

I have two dataframes, x.head() looks like this:

top      mid       adc      support jungle
Irelia   Ahri      Jinx     Janna   RekSai
Gnar     Ahri      Caitlyn  Leona   Rengar
Renekton Fizz      Sivir    Annie   Rengar
Irelia   Leblanc   Sivir    Thresh  JarvanIV
Gnar     Lissandra Tristana Janna   JarvanIV

and dataframe fullmatrix.head() that I have created looks like this:

Irelia  Gnar    Renekton    Kassadin    Sion    Jax Lulu    Maokai  Rumble  Lissandra   ... XinZhao Amumu   Udyr    Ivern   Shaco   Skarner FiddleSticks    Aatrox  Volibear    MonkeyKing
0   0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
4   0   0   0   0   0   0   0   0   0   0   ...

Now what I cannot figure out is how to assign a value of 1 for each name in the x dataframe to the respective column that has the same name in the fullmatrix dataframe row by row (both dataframes have the same number of rows).

throw us a bone here. simplify this sample dataset to about a 1/10 of what you have here and include your expected output (even if you have to calculate it manually). — Paul H, Dec 24 '17 at 21:32
Apologies Paul, the output should look like the second dataframe only with 1s where the name appears under the column for the respective row. Also I am still trying to figure out how to make my tables show up properly. Irelia Gnar Ahri Renekton Jinx Kassadin Janna Sion RekSai 1 0 1 0 1 0 1 0 1 — bloo, Dec 24 '17 at 22:19

Tai · Answer 1 · 2018-01-01T03:35:38.710

2

The OP tries to create a table of dummy variables with a set of data points. For each data point, it contains 5 attributes. There are in total N unique attributes.

We will use a simplied dataset to demonstrate how to do it:

5 unique attributes
3 data entries

each data entry contains 3 attributes.

x = pd.DataFrame([['a', 'b', 'c'],  
                  ['b', 'd', 'e'], 
                  ['e', 'b', 'a']])
fullmatrix = pd.DataFrame([[0 for _ in range(5)] for _ in range(3)], 
                          columns=['a','b','c','d','e'])
""" fullmatrix:
   a  b  c  d  e
0  0  0  0  0  0
1  0  0  0  0  0
2  0  0  0  0  0
"""

# each row in x_temp is a string of attributed delimited by ","
x_row_joined = pd.Series((",".join(row[1]) for row in x.iterrows()))    
fullmatrix = x_row_joined.str.get_dummies(sep=',')

The method is inspired by offbyone's answer It uses pandas.Series.str.get_dummies. We first joins each row of x with a specified delimiter. Then make use of the Series.str.get_dummies method. The method takes a delimiter that we just use to join attributes and will generate the dummy-varaible table for you. (Caution: don't pick sep that exists in x.)

edited Jan 01 '18 at 03:35

answered Dec 24 '17 at 20:49

Tai

7,684
3
29
49

I tried the first solution and I am getting an output that looks like the way it is supposed to be but the 1s are in the wrong places – bloo Dec 24 '17 at 22:17
Maybe your `columns` has different order? @bloo Try to check the order of your columns in fullmatrix. – Tai Dec 24 '17 at 22:24
I am trying to debug the issue through my jupyter notebook and I am going to get back with whether this works. I want to use your first solution though cause it's easy, makes sense and other people can read and understand what's going on. – bloo Dec 24 '17 at 22:57
@bloo Tried to simply the answer. Hope this helps. Merry Christmas. – Tai Dec 25 '17 at 04:57

score 2 · Answer 2 · answered Dec 24 '17 at 21:29

Consider adding a key = 1 column and then iterating through each column for a list of pivoted dfs which you then horizontally merge with pd.concat. Finally run a DataFrame.update() to update original fullmatrix with values from pvt_df, aligned to indices.

x['key'] = 1

dfs = []
for col in x.columns[:-1]:
    dfs.append(x.pivot_table(index=df.index, columns=[col], values='key').fillna(0))

pvt_df = pd.concat(dfs, axis=1).astype(int)

fullmatrix.update(pvt_df)
fullmatrix = fullmatrix.astype(int)

fullmatrix   # ONLY FOR VISIBLE COLUMNS IN ORIGINAL POST
#    Irelia  Gnar  Renekton  Kassadin  Sion  Jax  Lulu  Maokai  Rumble  Lissandra  XinZhao  Amumu  Udyr  Ivern  Shaco  Skarner  FiddleSticks  Aatrox  Volibear  MonkeyKing
# 0       1     0         0         0     0    0     0       0       0          0        0      0     0      0      0        0             0       0         0           0
# 1       0     1         0         0     0    0     0       0       0          0        0      0     0      0      0        0             0       0         0           0
# 2       0     0         1         0     0    0     0       0       0          0        0      0     0      0      0        0             0       0         0           0
# 3       1     0         0         0     0    0     0       0       0          0        0      0     0      0      0        0             0       0         0           0

I am getting ValueError: cannot reindex from a duplicate axis at the fullmatrix.update(pvt_df), is index in (index=df.index, columns=[col], values='key') supposed to be =dfs.index or something else? — bloo, Dec 24 '17 at 22:16

Seiji Armstrong · Accepted Answer · 2017-12-24T22:54:54.653

2

I'm sure this can be improved but one advantage is that it only requires the first DataFrame, and it's conceptually nice to chain operations until you get the desired solution.

fullmatrix = (x.stack()
               .reset_index(name='names')
               .pivot(index='level_0', columns='names', values='names')
               .applymap(lambda x: int(x!=None))
               .reset_index(drop=True))

note that only the names that appear in your x DataFrame will appear as columns in fullmatrix. if you want the additional columns you can simply perform a join.

edited Dec 24 '17 at 22:54

answered Dec 24 '17 at 22:47

Seiji Armstrong

1,105
1
9
10

I am getting: ValueError: Index contains duplicate entries, cannot reshape. It works if I remove index='level_0'. Testing if it had assigned the values correctly. – bloo Dec 24 '17 at 22:58
1

It's unclear from your post what `x` has as an index as I'm assuming you have just printed the columns.. Could you either please print what your index is, or include the `reset_index(drop=True)` first. So it would look like `x.reset_index(drop=True).stack().....` – Seiji Armstrong Dec 24 '17 at 23:09
Legend. Just quickly tested as well and it's assigning the values the way it's supposed to. Thank you for the quick reply. – bloo Dec 24 '17 at 23:20
Awesome. One last thing on the index issue... if your index also contains names that you want counted (eg if your `top` col is actually the index) you can set `drop=False` in the first `reset_index` and then it will appear in the DataFrame when you stack it and then appear in the final count Frame. – Seiji Armstrong Dec 24 '17 at 23:30

Pandas - Map - Dummy Variables - Assign value of 1

3 Answers3