15

I have a reference list

ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']

And a dataframe

df = pd.DataFrame({'Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']]})
df
    Month_List
0   [July]
1   [August]
2   [July, June]
3   [May, April, March]

I want to check which elements from reference list is present in each row, and convert into binary list

I can achieve this using apply

def convert_month_to_binary(ref,lst):
    s = pd.Series(ref)
    return s.isin(lst).astype(int).tolist()  

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

    Month_List          Binary_Month_List
0   [July]              [0, 0, 1, 0, 0, 0, 0]
1   [August]            [0, 1, 0, 0, 0, 0, 0]
2   [July, June]        [0, 0, 1, 1, 0, 0, 0]
3   [May, April, March] [0, 0, 0, 0, 1, 1, 1]

However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?

Extension:

I wanted to use numpy vectorization because I need to now apply another function on this list

I am trying like this, but performance is very slow. Similar results with apply

def count_one(lst):
    index = [i for i, e in enumerate(lst) if e != 0] 
    return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List']) 
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
  • 1
    FYI - Vectorization works on a case by case basis. There's no magic function that works in all scenarios, if you are looking for one. – Divakar Sep 27 '19 at 15:33

4 Answers4

10

We can using explode with get_dummies, notice explode is available after 0.25

df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]: 
[[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
BENY
  • 317,841
  • 20
  • 164
  • 234
8

In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()

Or with Series.str.join, Series.str.get_dummies and reindex:

df['Binary_Month_List'] = (df['Month_List'].str.join('|')
                                           .str.get_dummies()
                                           .reindex(columns=ref, fill_value=0)
                                           .values
                                           .tolist())
print (df)
            Month_List      Binary_Month_List
0               [July]  [0, 0, 1, 0, 0, 0, 0]
1             [August]  [0, 1, 0, 0, 0, 0, 0]
2         [July, June]  [0, 0, 1, 1, 0, 0, 0]
3  [May, April, March]  [0, 0, 0, 0, 1, 1, 1]

Performance is different:

df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
4

Here's one with NumPy tools -

def isin_lists(df_col, ref):
    a = np.concatenate(df_col)
    b = np.asarray(ref)

    sidx = b.argsort()
    c = sidx[np.searchsorted(b,a,sorter=sidx)]

    l = np.array([len(i) for i in df_col])
    r = np.repeat(np.arange(len(l)),l)

    out = np.zeros((len(l),len(b)), dtype=bool)
    out[r,c] = 1
    return out.view('i1')

Output for given sample -

In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]: 
array([[0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]: 
            Month_List      Binary_Month_List  Value
0               [July]  [0, 0, 1, 0, 0, 0, 0]      1
1             [August]  [0, 1, 0, 0, 0, 0, 0]      1
2         [July, June]  [0, 0, 1, 1, 0, 0, 0]      2
3  [May, April, March]  [0, 0, 0, 0, 1, 1, 1]      3

If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -

In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

I am not sure if this will be faster. But count-vector can also be used in this case.

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(binary=True)

mys=([(','.join(i)) for i in df['Month_List']])
X=vect.fit_transform(mys)
col_names=vect.get_feature_names()
ndf=pd.SparseDataFrame(X, columns=col_names)
df=df.join(ndf).astype(str)
df['Binary_Month_List'] = df.iloc[:, 1:].values.tolist()
shantanuo
  • 31,689
  • 78
  • 245
  • 403