How to build the correctly regex to findo words in txt file using python?

Question

I have a txt file and I want to search for specific words there and save it in another txt file with the number of times that it has appeared. Example: I want to search for this words "jardim guanabara", "jd guanabara", "jd gb", "norte", "zona norte", "vale dos sonhos", "asa branca" and "joao paulo".

This is what I've tried until now, but I don't know very well how to handle this. Can you guys help me how can I write the right regex to find this words? I appreciate any help.

[import re

regex = r"((?<=zona )norte\w+|(?<=jardim )guanabara|(?<=jardim )gb\w+)|((?<=joao )paulo\w+|(?<=zn)norte|(?<=gato)dorm\w+)"


with open('file.txt','r') as f: 
    #input_file = f.readlines()

    for line in f:
      x = re.search(regex, line)
      print(x)]

I expect some thing like this saved into another txt file. 1

Emma · Answer 1 · 2019-07-30T03:09:02.670

I'm guessing that you might want to design an expression somewhat similar to:

^(?=.*(?:\bjardim\s+guanabara\b|\bjd\s+guanabara\b|\bjd\s+gb\b|\bnorte\b|\bzona\s+norte\b|\bvale\s+dos\b\s+sonhos\b|\basa\s+branca\b|\bjoao\s+paulo\b)).*$

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

TEST

import re

regex = r"^(?=.*(?:\bjardim\s+guanabara\b|\bjd\s+guanabara\b|\bjd\s+gb\b|\bnorte\b|\bzona\s+norte\b|\bvale\s+dos\b\s+sonhos\b|\basa\s+branca\b|\bjoao\s+paulo\b)).*$"

test_str = """
I want to search for this words jardim guanabara.
I want to search for this words jd guanabara.
I want to search for this words jd gb.
I want to search for this words norte.
I want to search for this words zona norte.
I want to search for this words vale dos sonhos.
I want to search for this words asa branca and joao paulo.

I don't want to search for this words nojardim guanabara.
I don't want to search for this words nojd guanabara.
I don't want to search for this words nojd gb.
I don't want to search for this words nonorte.
I don't want to search for this words nozona norte.
I don't want to search for this words novale dos sonhos.
I don't want to search for this words noasa branca and joao paulo.
"""

print(re.findall(regex, test_str, re.M))

OUTPUT

['I want to search for this words jardim guanabara.', 'I want to search for this words jd guanabara.', 'I want to search for this words jd gb.', 'I want to search for this words norte.', 'I want to search for this words zona norte.', 'I want to search for this words vale dos sonhos.', 'I want to search for this words asa branca and joao paulo.', "I don't want to search for this words nozona norte.", "I don't want to search for this words noasa branca and joao paulo."]

RegEx Circuit

jex.im visualizes regular expressions:

Thank you very much @Emma. Now its more clear for me how to build a regex expression. I have this code: `with open('file.txt, 'r') as file: for line in file: for match in re.findall(regex, line): #finditer print(match)` how can I save the result into another txt file @Emma? Again, thank you for your explanation, it was very clear :) — Guilherme Schults, Jul 30 '19 at 03:21

Thomas · Answer 2 · 2019-07-30T08:43:09.040

A way to do this could be as follows (assuming your .txt file is called in.txt):

search_terms = [
    "asa branca",
    "joao paulo",
]

with open("in.txt") as f:
    text = f.read()

    occurence_map = {term: text.count(term) for term in search_terms}

This uses a "dict comprehension" which is a feature introduced in Python >2.7, >3.0. Basically, it's building a dictionary: for every term we want to search, use the term as a key, and the count of the term in the text as the value.

A little less succinct, but you can do this in a more straightforward manner like so:

with open("in.txt") as f:
    text = f.read()

    occurence_map = dict()

    for term in search_terms:
        occurence_map[term] = text.count(term)

You could then write that to file using the format that you prefer. For example:

with open("out.txt", "w") as f:
    for term, count in occurence_map.items():
        f.write("{}: {}\n".format(term, count))

Note: this solution will only be suitable if you want exact matches of the string and they don't need to be separated by word boundaries. In other words, the following will match when searching for foo bar:

Somethingfoo barsomething.
Something foo bar something.

...and these will not:

Something foo bar. (multiple spaces are not rendering)
foo\tbar
Foo bar.
foo Bar.

If this is necessary, it's better to use regular expressions. I can edit my answer if this is the case.

thank you for your answer. I tried here just the way you teach, and it works, but, I have to search for strings wrote on differente way. take a look: Asa Branca: 1 João Paulo: 43 João Paulo 2: 4 João Paulo II: 12 Vera Cruz: 14 vera cruz: 1 vale dos sonhos: 20 Vale dos Sonhosregião norte: 0 norte: 3 jardim Guanabara: 13 jd. guanabara: 0 Jardim Guanabara: 17 Jardim Guanabara 1: 0 Jardim Guanabara 2: 0 jardim Guanabara 2: 0 Jardim Guanabara 3: 1 jardim Guanabara 3: 1 guanabara: 30 I think with regex this could be more easy, but I'm very new on that. — Guilherme Schults, Jul 30 '19 at 15:33
You can still do this without a regular expression, though it becomes a little less intuitive/explicit. You could first call `.lower()` on `text` to make it all lower-case then replace out the non-ASCII characters manually (or by using a library like `unicodedata`). You might want to see [this post](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string). — Thomas, Jul 31 '19 at 01:33

How to build the correctly regex to findo words in txt file using python?

2 Answers2

TEST

OUTPUT

RegEx Circuit