How do you One Hot Encode columns with a list of strings as values?

Question

I'm basically trying to one hot encode a column with values like this:

  tickers
1 [DIS]
2 [AAPL,AMZN,BABA,BAY]
3 [MCDO,PEP]
4 [ABT,ADBE,AMGN,CVS]
5 [ABT,CVS,DIS,ECL,EMR,FAST,GE,GOOGL]
...

First I got all the set of all the tickers(which is about 467 tickers):

all_tickers = list()
for tickers in df.tickers:
    for ticker in tickers:
        all_tickers.append(ticker)
all_tickers = set(all_tickers)

Then I implemented One Hot Encoding this way:

for i in range(len(df.index)):
    for ticker in all_tickers:
        if ticker in df.iloc[i]['tickers']:
            df.at[i+1, ticker] = 1
        else:
            df.at[i+1, ticker] = 0

The problem is the script runs incredibly slow when processing about 5000+ rows. How can I improve my algorithm?

score 13 · Accepted Answer · answered Dec 13 '17 at 06:31

I think you need str.join with str.get_dummies:

df = df['tickers'].str.join('|').str.get_dummies()

Or:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = pd.DataFrame(mlb.fit_transform(df['tickers']),columns=mlb.classes_, index=df.index)
print (df)
   AAPL  ABT  ADBE  AMGN  AMZN  BABA  BAY  CVS  DIS  ECL  EMR  FAST  GE  \
1     0    0     0     0     0     0    0    0    1    0    0     0   0   
2     1    0     0     0     1     1    1    0    0    0    0     0   0   
3     0    0     0     0     0     0    0    0    0    0    0     0   0   
4     0    1     1     1     0     0    0    1    0    0    0     0   0   
5     0    1     0     0     0     0    0    1    1    1    1     1   1   

   GOOGL  MCDO  PEP  
1      0     0    0  
2      0     0    0  
3      0     1    1  
4      0     0    0  
5      1     0    0

It works! That's incredibly faster than my code too! Thanks jezrael! — Castle, Dec 13 '17 at 06:36

score 3 · Answer 2 · answered Dec 13 '17 at 06:35

You can use apply(pd.Series) and then get_dummies():

df = pd.DataFrame({"tickers":[["DIS"], ["AAPL","AMZN","BABA","BAY"], 
                              ["MCDO","PEP"], ["ABT","ADBE","AMGN","CVS"], 
                              ["ABT","CVS","DIS","ECL","EMR","FAST","GE","GOOGL"]]})

pd.get_dummies(df.tickers.apply(pd.Series), prefix="", prefix_sep="")

   AAPL  ABT  DIS  MCDO  ADBE  AMZN  CVS  PEP  AMGN  BABA  DIS  BAY  CVS  ECL  \
0     0    0    1     0     0     0    0    0     0     0    0    0    0    0   
1     1    0    0     0     0     1    0    0     0     1    0    1    0    0   
2     0    0    0     1     0     0    0    1     0     0    0    0    0    0   
3     0    1    0     0     1     0    0    0     1     0    0    0    1    0   
4     0    1    0     0     0     0    1    0     0     0    1    0    0    1   

   EMR  FAST  GE  GOOGL  
0    0     0   0      0  
1    0     0   0      0  
2    0     0   0      0  
3    0     0   0      0  
4    1     1   1      1

This doesn't work if the values in the lists are at different positions in the lists. For example, this fails with the dataframe `pd.DataFrame({"col":[["A","B"], ["B","A"]]})` or even just `pd.DataFrame({"col":[["A","B"], ["B"]]})`. — coderforlife, Apr 11 '22 at 03:01

How do you One Hot Encode columns with a list of strings as values?

2 Answers2

Linked