Finding matching words with ngrams

Question

Dataset:

df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]

Id       bigram
1952043  [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916  [(Luxury,Apartments),(Apartments,consisting),(consisting,11),
1645751  [(Flat,available),(available,sale),(sale,Medavakkam),
1270503  [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks),
1495638  [(near,medavakkam),(medavakkam,junction),(junction,calm),

I have a python file(Categories.py) containing the unsupervised classification of the property/Land features.

category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'),
        ('Swimming Pool', 'IN','Recreation_Ammenities'),
        ('Toddler Pool', 'IN', 'Recreation_Ammenities'),
        ('Jogging Tracks', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']

To find the matching words from bigram column nd category list:

tokens=pd.Series(df["bigram"])
Lid=pd.Series(df["Id"])
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))

While running the above code, I am getting this error:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Need help on this.

My desired output is:

 Id       bigram                                  Recreation_Amenities
1952043  [(Swimming,Pool),(Pool,in),(in,the),..   Swimming Pool
1918916  [(Luxury,Apartments),(Apartments,..      Luxury Apartments
1645751  [(Flat,available),(available,sale)..     
1270503  [(Toddler,Pool),(Jogging,Tracks)..      Toddler Pool,Jogging Tracks
1495638  [(near,medavakkam),..

score 1 · Accepted Answer · answered Aug 27 '17 at 08:05

1

Something along those lines should work for you:

def match_bigrams(row):
    categories = []

    for bigram in row.bigram:
        joined = ' '.join(list(bigram))
        if joined in Recreation:
            categories.append(joined)

    return categories

df['Recreation_Amenities'] = df.apply(match_bigrams, axis=1)
print(df)


Id  bigram  Recreation_Amenities
0   1952043 [(Swimming, Pool), (Pool, in), (in, the), (the...   [Swimming Pool]
1   1918916 [(Luxury, Apartments), (Apartments, consisting...   [Luxury Apartments]
2   1645751 [(Flat, available), (available, sale), (sale, ...   []
3   1270503 [(Toddler, Pool), (Pool, with), (with, Jogging...   [Toddler Pool, Jogging Tracks]
4   1495638 [(near, medavakkam), (medavakkam, junction), (...   []

Each bigram is joined by a space so that it can be tested whether that bigram is contained in your list of categories (i.e. if joined in Recreation).

answered Aug 27 '17 at 08:05

Jan Trienes

2,501
1
16
28

,can you explain the 'row' parameter passed in the def function. And I also want to use this function multiple times for each Category, like Recreation, Healthcare, Security, etc. so that i can just call the same function for n number of categories. How can i do so? – Rajitha Naik Sep 09 '17 at 16:39
1

The function `match_bigrams` is applied row-wise (as in each row in the data frame is passed into this function). Regarding your second question, depending: the function matches on the categories in list `Recreation`. So when you extend this list with additional categories it should work for n categories. – Jan Trienes Sep 09 '17 at 16:59
yes, but currently in the function, the condition is- 'if joined in Recreation:' like wise i have multiple categories, and i want to avoid writing this entire function for each category. so can i call the same function by passing the category name in the calling function, here - df.apply(match_bigrams, axis=1) – Rajitha Naik Sep 09 '17 at 17:03
could you please help me with this? – Rajitha Naik Sep 10 '17 at 04:46
Can you update your question so that it demonstrates what you want to achieve? That would really help in understanding your question. If it is too different from what you were originally asking also consider to ask a new question. – Jan Trienes Sep 10 '17 at 07:33

Bharath M Shetty · Answer 2 · 2017-08-27T08:54:30.633

You can join the tuples by space and then find the words present in Recreation using dual list comprehension and apply i.e

df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in  [' '.join(i) for i in x]])

Lets consider you have a dataframe

    Id      bigram
0   1270503 [(Toddler, Pool), (Pool, with), (with, Jogging), (Jogging, Tracks)]
1   1952043 [(Swimming, Pool), (Pool, in), (in, the), (the, roof), (roof, top)]
2   1918916 [(Luxury, Apartments), (Apartments, consisting), (consisting, 11)]
3   1495638 [(near, medavakkam), (medavakkam, junction), (junction, calm)]
4   1645751 [(Flat, available), (available, sale), (sale, Medavakkam)]

And you have list Recreation i.e

Recreation = ['Luxury Apartments', 'Swimming Pool', 'Toddler Pool', 'Jogging Tracks']

Then

df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in  [' '.join(i) for i in x]])

Output : df['Recreation_Amenities']


0    [Toddler Pool, Jogging Tracks]
1    [Swimming Pool]               
2    [Luxury Apartments]           
3    []                            
4    []                            
Name: Recreation_Amenities, dtype: object

Finding matching words with ngrams

2 Answers2