4

I'm trying to categorise a list of strings called keywords based on whether they contain 1 or more substrings from another DataFrame.

The substring is weighted so that at the end each keyword has a weight in approximately 4 columns. These columns will be called loan, mortgage, accounts, and cards.

I've attempted to use join(), concat() and merge() to join multiple dataframes, as well as attempting to use where(), isin() and contains().

I've come close a few times to achieve my desired goal but thwarted either by errors or scalability. I have limited experience with Python and Pandas, so there may be a completely different approach that I've not considered.

import pandas as pd

df = pd.read_csv('Inputs/keyword.csv', header=0)
df['loan'] = 0
df2 = pd.read_csv('Essentials/Groups/loans.csv', header=0)

#review the data in df
print(df.head())

output:
           keywords  search_volume  loan
0             loans         132000     0
1    funding circle          81000     0
2        government          36000     0
3  short term loans          30000     0
4           company          27000     0

In the above code I'm loading up my list of keywords including an associated search volume. It currently has no loan column so I've added one with the default value set to 0.

Below, I have another DataFrame that includes a list of terms and an associated weighting. I've arbitrarily picked an integer of 5 that I would like added to the existing total.

#review the data in df2
print(df2.head())

output:
      terms  weight
0      loan       5
1     loans       5
2  personal       3
3  business       3
4       apr       4

There is a problem that I have spotted but don't know how to resolve.

My list includes both loan and loans. It is fine for me to have these duplicates on my end. However, row 3 has a value of 'short term loans' and is marked as False.

Since the term loan and loans both appear in short term loans, I would have expected it to be marked as True. I've tried reversing the .isin() statement so that I'm searching df2['terms'] in df['keywords'], but the result came out the same.

#review the true/false for overlaps
print(df['keywords'].isin(df2['terms']).head())

output:
0     True
1    False
2    False
3    False
4    False

Lastly, once we fix this boolean issue, I don't know how to vectorise a sum change in df['loan'] based on matches. I'm trying to avoid for loops because I'm expecting the keyword list to include around 100,000+ rows and each category DataFrame may include 1,000 terms.

The desired output would look something like this:

output:
           keywords  search_volume  loan  mortgage  accounts  cards
0             loans         132000     10        0         0      0
1    funding circle          81000     0         0         0      0
2        government          36000     0         0         0      0
3  short term loans          30000     10        0         0      0
4           company          27000     0         0         0      0
Ali AzG
  • 1,861
  • 2
  • 18
  • 28
  • 1
    How is loan suddenly 10 in your expected output? – Erfan May 26 '19 at 11:43
  • 1
    @Erfan Both loan and loans would fit into 'loans' so it would be 5 + 5. – Rowan Collins May 26 '19 at 12:05
  • 1
    are you looking for `df1.keywords.str.contains(r'\b{}\b'.format('|'.join(df2.terms)))` which gives `[True, False, False, True, False]` ?? how is the expected output based on the input you have provided? – anky May 26 '19 at 13:37
  • 1
    @anky_91 that looks to be the correct output, but I'm not entirely sure what's going on in the code that you've written. The idea is that I've got a list of around `100,00 keywords` and around `1000 terms` that will then be used to categorise those keywords. For example, if the keyword is a type of loan, it would include loan related terms. Here's an example of a longer keyword that might make it more obvious why I'm weighting each term: `how to get a business start up loan with bad credit` Words such as "how to" indicate a question, while "bad credit" and "loans" indicate a loan. – Rowan Collins May 26 '19 at 13:41
  • 1
    @RowanCollins got it, but 1 question, how are you planning to create new columns here, would 1 column which matches the `term` and populate the value be enough for you? – anky May 26 '19 at 13:44
  • @anky_91 the end output will be that some `keywords` might fall into multiple categories. Such as "small business mortgage loan rates" which includes words related to both `loan` and `mortgage`. In this case, I would use whichever column had the highest `weight` to decide whether the excel output will categorise it as a `mortgage` or as a `loan`. I may even go on to create a secondary categorisation for clarity. – Rowan Collins May 26 '19 at 13:49

2 Answers2

4

Considering the df1 as :

           keywords  search_volume
0             loans         132000
1    funding circle          81000
2        government          36000
3  short term loans          30000
4           company          27000

you can take help of series.str.extract() and df.assign() we can do something like:

d=df2.set_index('terms')['weight']
pat=r'({})'.format('|'.join(df2.terms))
#'(loan|loans|personal|business|apr)'
df1=df1.assign(**{'term_match':df1.keywords.str.extract(pat,expand=False),
              'weight':df1.keywords.str.extract(pat,expand=False).map(d)})
print(df1)

Output

           keywords  search_volume term_match  weight
0             loans         132000       loan     5.0
1    funding circle          81000        NaN     NaN
2        government          36000        NaN     NaN
3  short term loans          30000       loan     5.0
4           company          27000        NaN     NaN

EDIT To find all matching strings lets update df2 as: df2.loc[5]=['term',3] just for testing.

Then using series.str.findall():

s=df1.keywords.str.findall(pat)
df1=df1.assign(**{'term_match':s.apply(','.join),
              'weight':s.apply(lambda x: sum([d.get(item,item)  for item in x]))})
print(df1)

           keywords  search_volume  loan term_match  weight
0             loans         132000     0       loan       5
1    funding circle          81000     0                  0
2        government          36000     0                  0
3  short term loans          30000     0  term,loan       8
4           company          27000     0                  0
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    this looks to be really good, I modified the `weight` column to fall into `loan` while matching the `df2` terms. It uses the highest value at the moment rather than a sum. For example, if there was `[[loans, 5], [term, 3]]` then the phrase "short term loans" should result in 8. Using the above method, I'm almost where I need to be - but it's not the sum of all matching terms - just the weight of a singular matched term. – Rowan Collins May 26 '19 at 14:05
  • 1
    @RowanCollins you can add a column `df1['sum_weights']=df1.groupby('term_match').weight.transform('sum')`. Hope this helps. – anky May 26 '19 at 14:39
  • 1
    I think there's a misunderstanding. I'm looking for _all_ the terms that match, not just one. The `term_match` column lists a _singular_ term that matches. Otherwise, what you suggested here would work, but it's mapping out a single value from `terms` that it matches rather than aggregating a value from all of the `terms` that match. – Rowan Collins May 26 '19 at 14:51
  • 1
    @RowanCollins I have updated my solution further. :) – anky May 26 '19 at 15:02
3

Here's a way to find the amount of matches. This might help you more in the right direction.

First explode our strings to seperate rows using this function:

df = explode_str(df, 'keywords', ' ')

     keywords  search_volume  loan
0       loans         132000     0
1     funding          81000     0
1      circle          81000     0
2  government          36000     0
3       short          30000     0
3        term          30000     0
3       loans          30000     0
4     company          27000     0

Then we use the difflib module to get the closest matches, for example loan with loans:

import difflib

df['loan'] = df.set_index('keywords').index.map(lambda x: difflib.get_close_matches(x, df2.set_index('terms').index))

     keywords  search_volume           loan
0       loans         132000  [loans, loan]
1     funding          81000             []
1      circle          81000             []
2  government          36000             []
3       short          30000             []
3        term          30000             []
3       loans          30000  [loans, loan]
4     company          27000             []

Then we groupby on the index to get our original dataframe back and count the length of the matches:

df = df.groupby(df.index).agg({'keywords':' '.join,
                               'search_volume':'last',
                               'loan':'last'})

df['count'] = df['loan'].str.len()

           keywords  search_volume           loan  count
0             loans         132000  [loans, loan]      2
1    funding circle          81000             []      0
2        government          36000             []      0
3  short term loans          30000  [loans, loan]      2
4           company          27000             []      0

Only thing left is to multiply the count with the weight and you got what you want.

Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    it's a good solution from the looks of things, but each of the `terms` is weighted differently so a multiplier wouldn't work. – Rowan Collins May 26 '19 at 15:41
  • 2
    I see, knew this wasn't a sufficient answer, just to help you more in the right direction or give you ideas :) @RowanCollins – Erfan May 26 '19 at 15:43
  • 2
    btw if you want to try some more out, decided to write an answer on `fuzzy joining`, you can find it [here](https://stackoverflow.com/a/56315491/9081267) @RowanCollins – Erfan May 26 '19 at 17:13