2

This is further part of question which I asked here . So I have decided to put it as another question.

Is there any way so that I can add the relevance value beside each matched list name in column matched_list_names. So relevance value formula would be (number of matched words from list/total number of words in that list)*100 in order to get which list name is most relevant.So, for first row for politics relevance would be (1/3)*100=30% i.e 1 word get matched out of total 3 words in list politics same for sports it would be (1/3)*100=0.3 and for miscellaneous value is 100-(sum of total value) i.e (100-(30+30). So, output would be like:-

    word_list                                          matched_list_names
['nuclear','election','usa','baseball']            politics 30,sports 30,miscellaneous 40
['football','united','thriller']                   sports 30,movies 30,miscellaneous 40               
['marvels','spiderman','hockey']                   movies 60,sports 30

....................                               .....................
....................                               .....................
....................                               ....................
Learner
  • 800
  • 1
  • 8
  • 23

1 Answers1

0

Use:

movies=['spiderman','marvels','thriller']
sports=['baseball','hockey','football']
politics=['election','china','usa']
d = {'movies':movies, 'sports':sports, 'politics':politics}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

def f(x):
    a = Counter([d1.get(y, 'miscellaneous') for y in x])
    return ', '.join(['{} {}'.format(k, v / sum(a.values())* 100 ) for k, v in a.items()])

df['matched_list_names'] = df['word_list'].apply(f)
print (df)
                            word_list  \
0  [nuclear, election, usa, baseball]   
1        [football, united, thriller]   
2     [marvels, hollywood, spiderman]   

                                  matched_list_names  
0     miscellaneous 25.0, politics 50.0, sports 25.0  
1  sports 33.33333333333333, miscellaneous 33.333...  
2  movies 66.66666666666666, miscellaneous 33.333...  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I have multiple lists categories so adding new column for each list name would create memory issue. – Learner Jul 30 '18 at 08:00
  • @Mavrick - Sorry, really bad bug, `*100` was used after last `)`, so it was repeating 100 times. Now is is corrected. – jezrael Jul 30 '18 at 08:52
  • @Mavrick - Do you think `return ', '.join(['{} {}'.format(k, round(v / sum(a.values())* 100, 2)) for k, v in a.items()])` ? round to 2 float values? – jezrael Jul 30 '18 at 10:36