0

I am doing text mining, on table the below dataset

import pandas as pd

data_set = {
    'client_id': [
        100,101,102,103,104,105,106,107,108],
    'Category': [
        'house_Status','house_Status','house_Status','house_Status',
        'house_Status','house_Status','house_Status','house_Status','house_Status'],
    'description': [
        'i am homeless',
        'i am struggling to find home i am homeless',
        'i have home',
        'i am homeless and living in basement',
        'i have in public housing as rental are hard to find',
        'i sleep in garage',
        'I and my family are  homeless',
        'we live in public housing',
        'i am happy']}

data_set = pd.DataFrame(data_set)
data_set

I have another table where i have words to search

word_search={
    'Category' : ['house_Status','house_Status'],
    'Word_to_search':['homeless','public housing']}

word_search = pd.DataFrame(word_search)
word_search

Based on the category i have to find which description has that word

The code i am using is very basic i am joining the 2 datasets on category and creating a simple filter which is

test = pd.merge(
    data_set, word_search, left_on='Category', right_on='Category', how='inner')

ON this i am searching for substrings

test['status'] = test.apply(lambda x: x.Word_to_search in x.description, axis=1)
test

and taking only true values,

The problem with my solution is it is slow , i have many categories and many words to search over a dataset that is very huge, if there is a better way to solve this please let me know

In python or sql

richard
  • 7
  • 1
sra
  • 167
  • 8

2 Answers2

0

you can simply try using np.where conditions and flag the ids where the category and the keyword matches:

data_set['status'] = np.where(
    (data_set['Category'].str.contains(
        "|".join(set(word_search['Category'])), regex=True)) &
    (data_set['description'].str.contains(
        "|".join(set(word_search['Word_to_search'])), regex=True))
        ,1,0)

data_set



   client_id      Category                                        description  status
0        100  house_Status                                      i am homeless    True
1        101  house_Status         i am struggling to find home i am homeless    True
2        102  house_Status                                       i have  home   False
3        103  house_Status               i am homeless and living in basement    True
4        104  house_Status  i have in public housing as rental are hard to...    True
5        105  house_Status                                  i sleep in garage   False
6        106  house_Status                      I and my family are  homeless    True
7        107  house_Status                          we live in public housing    True
8        108  house_Status                                         i am happy   False

for just getting True values, just subset data like:

data_set = data_set[data_set['status'] == 'True']

let me know if this matches with your output. Usually np.where is faster than apply functions (refer to Why is np.where faster than pd.apply for timing related info)

0

I would advise to use the loc method (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html). With your example that would give :

mask_category = data_set.Category.isin(word_search.Category.values) 
mask_description = pd.Series([False]*len(data_set),index=mask_category.index)
for s in word_search.Word_to_search:
    mask_description = mask_description | data_set.description.str.contains(s)

data_set.loc[mask_category & mask_description , :]

Explanation: This method should be perform quicker than the lambda function

data_set.loc[mask_category & mask_description , :] will filter all rows where both mask_category and mask_description are True (& is the logical AND while | is the logical OR). This check is done by index. mask_category and mask_description are simply pandas Series and their index corresponds to the index of data_set. So with the loc method we pass for each row a True or False

data_set.Category.isin(word_search.Category) will return a pandas series with the same index as data_set and a value giving us a True if the value in the column Category of data_set is an item of word_search.Category ie 'house_Status' or 'house_Status' given the two values are identical. You could speedup the process by calling unique to remove doublons: data_set.Category.isin(word_search.Category.unique())

The next blocks produces the series to filter on the description. mask_description will be True if the corresponding (correspondance by index) row of data_set contains one of the words contained in the column Word_to_search of the word_search dataframe.

mask_description = pd.Series([False]*len(data_set), index=mask_category.index)
for s in word_search.Word_to_search:
    mask_description = mask_description | data_set.description.str.contains(s)

You could shorten this using list comprehension but I think the above is more readable.

Do you get better results with this ?

LioWal
  • 56
  • 2