0

I'm doing homework and I have read similar threads, found one very interesting here : Find string between two substrings

My aim is to use python to search for 3 particular pattern I search in text files, I need to perform a search in uncategorized text files and i need to :

1) start searching from the keyword 'more info' ( bypass information before that)

2) classify documents based on : A1) string : 'big home' and his price A2) string : 'big home' no price found B1) string : 'small home' and his price B2) string : 'small home' no price found C1) strings : 'big home' AND 'small home' and their price C2) strings : 'big home' AND 'small home' their price missing D) No strings found (big home or small home)

for A, B, C , find price and print = 'Big home price 50USD', if no price found mention that.

I'm doing text research with python and it's returning the taxonomy of keywords found, i need to classify documents ( text files) based on the abovementioned patterns A, B, C and D

data_train['classi'] = data_train['text'].apply(lambda x: len([x for x in x if x.startswith('classi')]))
data_train[['text','classi']].head()

Here's the output:

text    classi
0   [big home, forrest, suburb, more info,          0
1   [town, pool, more info,                         0
2   [small home,more info,  forrest, suburb         1
3   [big home, more info,  forrest, price 50        1
4   [big home, forrest,  more info,  city           0

I expect to : 1) start searching from the keyword 'more info' 2) classify the text documents I search in A, B, C, D (get the strings with the price, if no price mention that.

Any support highly appreciated !

EDIT:

  • maybe it's interesting to use NLTK here, any idea ?

  • Actually playing with https://pythex.org/

obscure
  • 11,916
  • 2
  • 17
  • 36
HappyMan
  • 75
  • 9
  • Share the sample text file – min2bro Apr 16 '19 at 08:10
  • Hi, thanks connecting, of course, here are the examples, 1.txt ( Big home price only): https://pastebin.com/YHfWwYG7 List created from 1.txt ( I use lists since I filter the text file and remove white spaces and other noise) : https://pastebin.com/ziyGnBgZ 2.txt ( small home and big home with their prices) https://pastebin.com/hLD9RJM1 and the lsit created from 2.txt ( both small home and big home there with their prices) : https://pastebin.com/fMaBYjiJ – HappyMan Apr 16 '19 at 08:46
  • Edit : the list created from 2.txt : https://pastebin.com/fMaBYjiJ – HappyMan Apr 16 '19 at 08:52
  • Edit: important info: sometimes the information about 'Big house' or 'small house' is on top or in middle of the file instead of the text 'Presentation of the project' shown in the example files.... this could complicate the whole stuff... – HappyMan Apr 16 '19 at 09:00
  • there is an example of code that could be used ( finding multiple elements in a list)... : find = lambda searchList, elem: [[i for i, x in enumerate(searchList) if x == e] for e in elem] Example: find([1,4,1,4,6,5,5,5,4,2,3],[1,3,5]) Will Return: [[0, 2], [10], [5, 6, 7]] Maybe there is a solution for my project using this ? – HappyMan Apr 16 '19 at 11:09

1 Answers1

0

I would do something similar to this:

from pathlib import Path
for file in Path("my_folder").glob("*.txt"):
    with file.open('r') as f:
        more_info_flag = False
        for line in f:
            if not more_info_flag:
                if "more info" in line:
                    more_info_flag = True
                else:
                    continue
            if "big_home" in line:
                if "price is" in line:
                    price = int(line.split("price is")[1].split(" ")[0])
                else:
                    price = None
                do_something(price)

I think this would work for the file you posted, would need adaptation if other formats are different...

albeksdurf
  • 124
  • 8