1

I have the following data:

Rank    Platforms        Technology

high    Windows||Linux   Unity
high    Linux             
low     Windows          Unreal 
low     Linux||MacOs     GameMakerStudio||Unity||Unreal
low                      GameMakerStudio

Both Platforms and Technology are categorical variables. The issue here is they can have one, or Empty, or, especially multiple values like GameMakerStudio||Unity||Unreal. I am building a logistic regression model to predict Rank data.

I am attempting to encoding these variables for my model. However, I have not found any solution for list-type categorical values. I have read this page Encoding Categorical Variables and found that One-hot encoding is the most closely related, but still does not address my issue.

I could, of course, manually encode it. For example, there are around 7 distinct platform value for Platforms column, if Platforms = Windows||Linux, I could set 2 columns is_windows = true and is_linux = true. But for Technology column, there are 21 distinct values.

Is there a way to encode it automatically?

martineau
  • 119,623
  • 25
  • 170
  • 301
hydradon
  • 1,316
  • 1
  • 21
  • 52
  • Try looking at this answer https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows, then once you have that done the one-hot should be easier. – mgrollins Nov 18 '19 at 23:41
  • What is the problem with the 21 distinct values? – Dani Mesejo Nov 18 '19 at 23:44

1 Answers1

1

You never mention Pandas in your question, but I'll guess that's what you're using. If so, the link you mentioned has the response: get_dummies

[nav] In [17]: c = pandas.read_csv ("/tmp/asdf.txt", header=0)                                                                                                                                                                                                                                                                

[nav] In [18]: c                                                                                                                                                                                                                                                                                                              
Out[18]: 
   Rank       Platforms                      Technology
0  high  Windows||Linux                           Unity
1  high           Linux                             NaN
2   low         Windows                          Unreal
3   low    Linux||MacOs  GameMakerStudio||Unity||Unreal
4   low             NaN                 GameMakerStudio

[nav] In [19]: c.Platforms.str.get_dummies ()                                                                                                                                                                                                                                                                                 
Out[19]: 
   Linux  MacOs  Windows
0      1      0        1
1      1      0        0
2      0      0        1
3      1      1        0
4      0      0        0

[nav] In [20]: pd.concat ( [c, c.Platforms.str.get_dummies (), c.Technology.str.get_dummies ()], axis=1 )                                                                                                                                                                                                                     
Out[20]: 
   Rank       Platforms                      Technology  Linux  MacOs  Windows  GameMakerStudio  Unity  Unreal
0  high  Windows||Linux                           Unity      1      0        1                0      1       0
1  high           Linux                             NaN      1      0        0                0      0       0
2   low         Windows                          Unreal      0      0        1                0      0       1
3   low    Linux||MacOs  GameMakerStudio||Unity||Unreal      1      1        0                1      1       1
4   low             NaN                 GameMakerStudio      0      0        0                1      0       0
caxcaxcoatl
  • 8,472
  • 1
  • 13
  • 21
  • is there anyway to prefix the new column's name like `platform_Windows`, `platform_Linux`, etc and `technology_GameMakerStudio`, `technology_Unity`... ? – hydradon Nov 19 '19 at 14:19
  • I remember seeing something like that, but I can't see it on the documentation right now. Perhaps a different method. That might be worth a new question on its own right. Meanwhile, if you do not mind using a multi-index, you can use a dict on the first argument: `pd.concat ( {"orig": c, "plat": c.Platforms.str.get_dummies (), "tech": c.Technology.str.get_dummies ()}, axis=1 )`. The keys will be the column names for the first level of the multi-index – caxcaxcoatl Nov 20 '19 at 02:04