Matching word in column list in pandas and assign score

Question

I have the following two datasets - a dataset with text:

text = {'Text':[['Nike', 'invests', 'in', 'shoes'], ['Adidas', 'invests', 'in',  't-shirts']]}
text_df = pd.DataFrame(text)
text_df

and a dataset with words and respective scores and topics.

points = {'Text':['invests', 'shoes', 'Adidas'], 'Score':[1, 2, 1], 'Topic':['not_name', 'not_name', 'name' ] }
points_df = pd.DataFrame(points)
points_df

For each row in the text dataset I would like to see if the word exists and, if the word is there, create a column named after the category and create a new list with the score for the relevant word. In case the word is not there, assign a zero.

This is the outcome

text_results = {'Text':[['Nike', 'invests', 'in', 'shoes'], ['Adidas', 'invests', 'in',  't-shirts']], 'not_name': [[0, 1, 0, 2], [0, 1, 0, 0]], 'name': [[0, 0, 0, 0], [1, 0, 0, 0]]}
results_df = pd.DataFrame(text_results)
results_df

Any suggestions? I am a bit lost at sea!

jezrael · Accepted Answer · 2020-06-04T07:21:43.770

1

First are values in points_df pivoting by DataFrame.pivot_table, replaced missing values and created dictionary by DataFrame.to_dict:

df1 = points_df.pivot_table(index='Text',
                            columns='Topic',
                            values='Score', 
                            fill_value=0, 
                            aggfunc='sum')
d = df1.to_dict('index')
print (d)
{'Adidas': {'name': 1, 'not_name': 0}, 
 'invests': {'name': 0, 'not_name': 1}, 
 'shoes': {'name': 0, 'not_name': 2}}

From columns names is created dictionary filled by 0 values used for non exist values:

missd = dict.fromkeys(df1.columns, 0)
print (missd)
{'name': 0, 'not_name': 0}

Then for each value of list in text_df['Text'] are mapped values by dict.get, so if no match is possible use default misssing values dictionary:

L = [[d.get(y, missd) for y in x] for x in text_df['Text']]

Then are change format from list of dicts to dict of lists in list comprehension by this solution:

L = [{k: [dic[k] for dic in x] for k in x[0]} for x in L]
print (L)
[{'name': [0, 0, 0, 0], 'not_name': [0, 1, 0, 2]}, 
 {'name': [1, 0, 0, 0], 'not_name': [0, 1, 0, 0]}]

Last is created DataFrame and added to text_df:

df = text_df.join(pd.DataFrame(L, index=text_df.index))
print (df)
                              Text          name      not_name
0       [Nike, invests, in, shoes]  [0, 0, 0, 0]  [0, 1, 0, 2]
1  [Adidas, invests, in, t-shirts]  [1, 0, 0, 0]  [0, 1, 0, 0]

edited Jun 04 '20 at 07:21

answered Jun 04 '20 at 05:46

jezrael

822,522
95
1,334
1,252

hi there, thanks for this! I get an error right away when I pivot the data in the first line "ValueError: Index contains duplicate entries, cannot reshape". If it can be helpful, this is the actual point_df data https://docs.google.com/spreadsheets/d/e/2PACX-1vQlJ1VW3IbIpfuSCuFC4_6xlP_dbVS9DtCPFoP4yC3-B4t7NEycaED0dtzVqfA5h38fBgmKQwvM9MU-/pubhtml – Filippo Sebastio Jun 04 '20 at 07:18
@FilippoSebastio - Problem is with duplicates, here use `pivot_table` instead `pivot` and I hope solution working well. – jezrael Jun 04 '20 at 07:22
thanks! when running the following I encounter the error: L = [[d.get(y, missd) for y in x] for x in train_data.tokenized_sentences] TypeError: unhashable type: 'list' – Filippo Sebastio Jun 04 '20 at 07:25
@FilippoSebastio - hmmm, it means there is some problem in data. Is possible share both DataFrames/files? – jezrael Jun 04 '20 at 07:27
tained_data.tokenized_sentence is text_df['Text'] and is the tokenized sentence result (a ist) – Filippo Sebastio Jun 04 '20 at 07:27
@FilippoSebastio - Is possible share real data `text_df.to_pickle('text_df.pkl')` and `points_df.to_pickle('points_df.pkl')` if not confidental data? Via wetransform, dropbox, or sent to my email. – jezrael Jun 04 '20 at 07:29
https://docs.google.com/spreadsheets/d/e/2PACX-1vQlJ1VW3IbIpfuSCuFC4_6xlP_dbVS9DtCPFoP4yC3-B4t7NEycaED0dtzVqfA5h38fBgmKQwvM9MU-/pubhtml not confidential! – Filippo Sebastio Jun 04 '20 at 07:29
@FilippoSebastio - I think pickles, not csv. Because it seems some proble with lists, so the best pickle insted csv files. – jezrael Jun 04 '20 at 07:31
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215281/discussion-between-filippo-sebastio-and-jezrael). – Filippo Sebastio Jun 04 '20 at 07:32
1

it worked! sorry, it was my error, your code is fine. thanks again for your help! – Filippo Sebastio Jun 04 '20 at 07:43

Ch3steR · Answer 2 · 2020-06-04T06:51:44.200

1

Another solution using df.reindex

Create a custom function. First, set 'Text' as index using df.set_index, then using df.reindex them. Now using df.where extract 'Score' column where 'Topic' is not_name and name, convert them to either list or NumPy array pd.Series.tolist or pd.Series.to_numpy() Then using df.join join them.

points_df.set_index('Text',inplace=True)
def func(x):
    x = points_df.reindex(x)
    m = x['Score'].where(x['Topic']=='not_name',0).to_numpy()
    n = x['Score'].where(x['Topic']=='name',0).to_numpy()
    return pd.Series([n,m],index=['name','not_name'])

t = text_df['Text'].apply(func)

text_df.join(t) # or df.merge(t,left_index=True,right_index=True)
                              Text                  name              not_name
0       [Nike, invests, in, shoes]  [0.0, 0.0, 0.0, 0.0]  [0.0, 1.0, 0.0, 2.0]
1  [Adidas, invests, in, t-shirts]  [1.0, 0.0, 0.0, 0.0]  [0.0, 1.0, 0.0, 0.0]

edited Jun 04 '20 at 06:51

answered Jun 04 '20 at 06:05

Ch3steR

20,090
4
28
58

thanks for this! it returns me an error when I apply the function to text_df: ValueError: all arrays must be same length, perhaps because the series returned have a different length? – Filippo Sebastio Jun 04 '20 at 07:00
Can you post the data for which this failed? @FilippoSebastio – Ch3steR Jun 04 '20 at 07:07
https://docs.google.com/spreadsheets/d/e/2PACX-1vQlJ1VW3IbIpfuSCuFC4_6xlP_dbVS9DtCPFoP4yC3-B4t7NEycaED0dtzVqfA5h38fBgmKQwvM9MU-/pubhtml here, thanks! – Filippo Sebastio Jun 04 '20 at 07:15

Henry Yik · Answer 3 · 2020-06-04T06:33:20.513

Just another way using explode and merge:

s =  text_df.explode("Text").reset_index().merge(points_df, on="Text", how="left").set_index("index").fillna(0)

print (s.assign(Score=np.where(s["Topic"].eq("name"),0,s["Score"]))
        .replace({"Topic":{"not_name":0, "name":1}})
        .rename(columns={"Score":"not_name","Topic":"name"})
        .groupby(level=0).agg(list))

                                  Text              not_name          name
index                                                                     
0           [Nike, invests, in, shoes]  [0.0, 1.0, 0.0, 2.0]  [0, 0, 0, 0]
1      [Adidas, invests, in, t-shirts]  [0.0, 1.0, 0.0, 0.0]  [1, 0, 0, 0]

hi there, thanks for this, but it returns an errors when running the first line, "TypeError: unhashable type: 'list' " — Filippo Sebastio, Jun 04 '20 at 07:12

score 0 · Answer 4 · answered Jun 04 '20 at 06:02

First it would be better to index the points_df using the Text column

points_df.set_index('Text', inplace=True)

Next we create the result res dataframe by copying text_df and creating separate columns for all the topics

res = text_df.copy()
for category in list(points_df['Topic'].unique()):
    res[category] = res['Text']

for i in range(len(res)):
    for j in res.columns[1:]:
        res.at[i, j] = [0] * len(res.loc[i,'Text'])

The below logic is to change the values in the list as per your needs


for i in range(len(res)):
    l = res.loc[i]

    for i,word in enumerate(l['Text']):
        if word in list(points_df.index):
            cat = points_df.loc[word]['Topic']
            l[cat][i] = points_df.loc[word, 'Score']

Finally the res dataframe is as below :

    Text    not_name    name
0   [Nike, invests, in, shoes]  [0, 1, 0, 2]    [0, 0, 0, 0]
1   [Adidas, invests, in, t-shirts] [0, 1, 0, 0]    [1, 0, 0, 0]

Matching word in column list in pandas and assign score

4 Answers4