I'm doing homework and I have read similar threads, found one very interesting here : Find string between two substrings
My aim is to use python to search for 3 particular pattern I search in text files, I need to perform a search in uncategorized text files and i need to :
1) start searching from the keyword 'more info' ( bypass information before that)
2) classify documents based on : A1) string : 'big home' and his price A2) string : 'big home' no price found B1) string : 'small home' and his price B2) string : 'small home' no price found C1) strings : 'big home' AND 'small home' and their price C2) strings : 'big home' AND 'small home' their price missing D) No strings found (big home or small home)
for A, B, C , find price and print = 'Big home price 50USD', if no price found mention that.
I'm doing text research with python and it's returning the taxonomy of keywords found, i need to classify documents ( text files) based on the abovementioned patterns A, B, C and D
data_train['classi'] = data_train['text'].apply(lambda x: len([x for x in x if x.startswith('classi')]))
data_train[['text','classi']].head()
Here's the output:
text classi
0 [big home, forrest, suburb, more info, 0
1 [town, pool, more info, 0
2 [small home,more info, forrest, suburb 1
3 [big home, more info, forrest, price 50 1
4 [big home, forrest, more info, city 0
I expect to : 1) start searching from the keyword 'more info' 2) classify the text documents I search in A, B, C, D (get the strings with the price, if no price mention that.
Any support highly appreciated !
EDIT:
maybe it's interesting to use NLTK here, any idea ?
Actually playing with https://pythex.org/