2

I have a large dataframe containing a 'Description' column.

I've compiled a sizeable dictionary of lists, where the key is basically the Category, and the items are lists of possible (sub)strings contained in the description column. I want to use the dictionary to classify each entry in the dataframe based on this description... Unfortunately I can't figure out how to apply a dictionary of lists to map to a dataframes (feels like it would be some sort of concoction of map, isin and str.contains but I have had no joy). I've included code to generate a model dataset below:

df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))

df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']

This model dataset would then ideally be somehow mapped against the following dictionary:

dict = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
        'Animal':['Dog','Horse']}

to generate a new column in the dataframe, with the result as such:

|   | A                    | Description            | Type   |
|---|----------------------|------------------------|--------|
| 0 | -1.4120290137842615  | White Ford Escort      | Car    |
| 1 | -0.3141036399049358  | Irish Draft Horse      | Animal |
| 2 | 0.49374344901643896  | Springer spaniel (dog) | Animal |
| 3 | 0.013654965767323723 | Green Vauxhall Corsa   | Car    |
| 4 | -0.18271952280002862 | White Van              | Car    |
| 5 | 0.9519081000007026   | Labrador dog           | Animal |
| 6 | 0.403258571154998    | Black horse            | Animal |
| 7 | -0.8647792960494813  | Blue Van               | Car    |
| 8 | -0.12429427259820519 | Red Vauxhall Corsa     | Car    |
| 9 | 0.7695980616520571   | Bear                   | -      |

The numbers are obviously irrelevant here, but there are other columns in the dataframes and I wanted this reflecting. I'm happy to use regex, or a perhaps change my dictionary to a dataframe and do a join (i've considered multiple routes).

This feels similar to a recent question, but it's not the same and certainly the answer hasn't helped me.

Sorry if I've been stupid somewhere and this is really simple - it does feel like it should be, but i'm missing something.

Thanks

Abhi
  • 4,068
  • 1
  • 16
  • 29
BAC83
  • 811
  • 1
  • 12
  • 27

2 Answers2

2

You can use fuzzywuzzy library to solve this. Make sure to install it via pip install fuzzywuzzy

from fuzzywuzzy import process

df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))

df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']

d = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
    'Animal':['Dog','Horse']}

# Construct a dataframe from the dictionary
df1 = pd.DataFrame([*d.values()], index=d.keys()).T.melt().dropna()

# Get relevant matches using the library.
m = df.Description.apply(lambda x: process.extract(x, df1.value)[0])

# concat the matches with original df
df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)

df2.columns = [*df.columns, 'matches']

# After merge it with df1
df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')

# Drop columns that are not required and rename.
df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'Type'})

print (df2)

          A             Description    Type
0 -0.423555       White Ford Escort     Car
1  0.294092       Irish Draft Horse  Animal
2  1.949626  Springer spaniel (dog)  Animal
3 -1.315937    Green Vauxhall Corsa     Car
4 -0.250184               White Van     Car
5  0.186645            Labrador dog  Animal
6 -0.052433             Black horse  Animal
7 -0.003261                Blue Van     Car
8  0.418292      Red Vauxhall Corsa     Car
9  0.241607                    Bear     NaN
Abhi
  • 4,068
  • 1
  • 16
  • 29
  • 1
    Love this - thanks very much for the intro to fuzzywuzzy as well. Expect there'll be some edge cases it doesn't catch, but this is really simple and effective - already implemented it, and also quick, so that's a win. Thanks – BAC83 Nov 28 '18 at 09:31
  • @BAC83 Your welcome. I'm glad I was able to help. :) – Abhi Nov 28 '18 at 09:35
0
  1. Consider inverting your dictionary first, while making everything lowercase

  2. Then per row, split Description into words and make them lowercase

    • e.g., 'Springer spaniel (dog)' -> ['springer', 'spaniel', '(', 'dog', ')']
  3. For each lower case word from (2), look it up in the inverted dictionary from (1); using apply

dyz
  • 127
  • 8
  • Thanks @dyz - that is very similar to my current plan (i didn't know about dict inverting, thanks for that), but it still feels like a few steps too many – BAC83 Nov 27 '18 at 21:53