I am doing text mining, on table the below dataset
import pandas as pd
data_set = {
'client_id': [
100,101,102,103,104,105,106,107,108],
'Category': [
'house_Status','house_Status','house_Status','house_Status',
'house_Status','house_Status','house_Status','house_Status','house_Status'],
'description': [
'i am homeless',
'i am struggling to find home i am homeless',
'i have home',
'i am homeless and living in basement',
'i have in public housing as rental are hard to find',
'i sleep in garage',
'I and my family are homeless',
'we live in public housing',
'i am happy']}
data_set = pd.DataFrame(data_set)
data_set
I have another table where i have words to search
word_search={
'Category' : ['house_Status','house_Status'],
'Word_to_search':['homeless','public housing']}
word_search = pd.DataFrame(word_search)
word_search
Based on the category i have to find which description has that word
The code i am using is very basic i am joining the 2 datasets on category and creating a simple filter which is
test = pd.merge(
data_set, word_search, left_on='Category', right_on='Category', how='inner')
ON this i am searching for substrings
test['status'] = test.apply(lambda x: x.Word_to_search in x.description, axis=1)
test
and taking only true values,
The problem with my solution is it is slow , i have many categories and many words to search over a dataset that is very huge, if there is a better way to solve this please let me know
In python or sql