2

I want to create a df starting from this data

item_features = {'A': {1, 2, 3}, 'B':{7, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = {'B', 'C'}
neg = {'A'}

I want to obtain the following dataset:

   1  2  3  7    positive     item_id

0  1  1  0  1           1           B
1  0  1  1  0           1           C
2  1  1  1  0           0           A

So i want that the df:

-have the df columns always ordered by their Number during the    
 creating process ? Like in this case it is 1 -2 - 3- 4 and i want    
 to be sure that i never have an order like 4-1-3-2
- contains only item_id that are in one of the 2 sets ( pos or neg). 
- if the item is positive the corresponding 'positive' column will be set to 1 else 0
- the other columns_names are the value in the item_features dictionary, but only for the items that are either in pos or in neg.
- the value in the column must be 1 if the   corresponding column name is in value of the item_features dict for that specific item.

What is an efficient way to do that ?

Salsa94
  • 45
  • 5

1 Answers1

2

Use:

item_features = {'A': {1, 2, 3}, 'B':{4, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = {'B', 'C'}
neg = {'A'}

#join sets
both = pos.union(neg)

#create Series, filter by both and create indicator columns
df=pd.Series(item_features).loc[both].agg(lambda x: '|'.join(map(str, x))).str.get_dummies()


df['item_id'] = df.index
df['positive'] = df['item_id'].isin(pos).astype(int)
df = df.reset_index(drop=True)
print(df)
   1  2  3  4 item_id  positive
0  0  1  1  0       C         1
1  1  1  0  1       B         1
2  1  1  1  0       A         0

If possible use lists instead sets:

item_features = {'A': {1, 2, 3}, 'B':{4, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = ['B', 'C']
neg = ['A']

both = pos + neg

#create Series, filter by both and create indicator columns
df=pd.Series(item_features).loc[both].agg(lambda x: '|'.join(map(str, x))).str.get_dummies()

df = df.sort_index(axis=1, level=0, key=lambda x: x.astype(int))

df['item_id'] = df.index
df['positive'] = df['item_id'].isin(pos).astype(int)
df = df.reset_index(drop=True)
print(df)
   1  2  3  4 item_id  positive
0  1  1  0  1       B         1
1  0  1  1  0       C         1
2  1  1  1  0       A         0

EDIT: solution for improv performance is:

item_features = {'A': {1, 2, 3}, 'B':{4, 2, 11}, 'C':{3, 2}, 'D':{9, 11} }
pos = ['B', 'C']
neg = ['A']

both = pos + neg

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

d = { k: item_features[k] for k in both }
df = pd.DataFrame(mlb.fit_transform(d.values()),columns=mlb.classes_)
print (df)


df['item_id'] = d.keys()
df['positive'] = df['item_id'].isin(pos).astype(int)

print(df)
   1  2  3  4  11 item_id  positive
0  0  1  0  1   1       B         1
1  0  1  1  0   0       C         1
2  1  1  1  0   0       A         0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thank you so much! Just one more think, is there a way to have the df columns always ordered by their Number" during the creating process ? Like in this case it is 1 -2 - 3- 4 and i want to be sure that i never have an order like 4-1-3-2. @jezrael – Salsa94 May 26 '22 at 09:44
  • @Salsa94 - It is set, so there is not defined order. So not. – jezrael May 26 '22 at 09:45
  • @Salsa94 - https://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set – jezrael May 26 '22 at 09:45
  • something like during the join function i convert the set into a list ? – Salsa94 May 26 '22 at 09:46
  • @Salsa94 - is possible use `list`s instead `set`s for `pos` and `neg` ? – jezrael May 26 '22 at 09:53
  • so the input data must be as you saw for different reasons, but i can convert it before creating the df. But it is not good to convert all the item_features dict because it has tons of entries that will be not used. for the 2 sets pos and neg i can convert the set into a list like this: all_items=list() all_items.extend(list(pos)) all_items.extend(list(neg)) – Salsa94 May 26 '22 at 09:56
  • @Salsa94 - added solution to answer. Main difference is if need same ordered ouput avoid duplicates in lists like `pos = ['B', 'C', 'B']` – jezrael May 26 '22 at 09:57
  • @Salsa94 - Can you add `df = df.sort_index(axis=1, level=0, key=lambda x: x.astype(int))` like in my second solution? – jezrael May 26 '22 at 10:05
  • it works, but do you think that it is faster then df.reindex(natsorted(df.colums), axis=1) ? – Salsa94 May 26 '22 at 10:15
  • @Salsa94 - waht is size `item_features` in real data? – jezrael May 26 '22 at 10:16
  • around 1.000.000 entries - @jezrael – Salsa94 May 26 '22 at 10:19