1

I'm trying to get counts of patterns generated from a string hello awesome world found in a large text. Patterns are generated by permuting words and replacing one word with * in between. In this example i use only 4 patterns to simplify things. I'm not really familiar with regex so my code doesn't match everything i need yet. I'll probably figure it out soon but i'm not sure if it would scale well when i feed real data.

The questions are how do i fix my code and are there better/faster ways to achieve my goal? Here's my code bellow with explainations.

import re
from collections import Counter


# Input text. Could consist of hundreds of thousands of sentences.
txt = """
Lorèm ipsum WORLD dolor AWESOME sit amèt, consectetur adipiscing elit. 
Duis id AWESOME HELLO lorem metus. Pràesent molestie malesuada finibus. 
Morbi non èx a WORLD HELLO AWESOME erat bibendum rhoncus. Quisque sit 
ametnibh cursus, tempor mi et, sodàles neque. Nunc dapibus vitae ligula at porta. 
Quisque sit amet màgna eù sem sagittis dignissim et non leo. 
Quisque WORLD, AWESOME dapibus et vèlit tristique tristique. Sed 
efficitur dui tincidunt, aliquet lèo eget, pellentesque felis. Donec 
venenatis elit ac aliquet varius. Vestibulum ante ipsum primis in faucibus
orci luctus et ultrices posuere cubilia Curae. Vestibulum sed ligula 
gravida, commodo neque at, mattis urna. Duis nisl neque, sollicitudin nec 
mauris sit amet, euismod semper massa. Curabitur sodales ultrices nibh, 
ut ultrices ante maximus sed. Donec rutrum libero in turpis gravida 
dignissim. Suspendisse potenti. Praesent eu tempor quam, id dictum felis. 
Nullam aliquam molestie tortor, at iaculis metus volutpat et. In dolor 
lacus, AWESOME sip HELLO volutpat ac convallis non, pulvinar eu massa.
"""

txt = txt.lower()

# Patterns generated from a 1-8 word input string. Could also consist of hundreds of 
# thousands of patterns
patterns = [
    'world',
    'awesome',
    'awesome hello', 
    'world hello awesome',
    'world (.*?) awesome'   # '*' - represents any word between
]

regex = '|'.join(patterns)
result = re.findall(regex, txt)
counter = Counter(result)
print(counter)
# >>> Counter({'awesome': 5, 'world': 3})

# For some reason i can't get strings with more than one word to match

# Expected output
found_pattern_counts = {
    'world': 3,
    'awesome': 5,
    'awesome hello': 1, 
    'world hello awesome': 1,
    'world * awesome': 2
}
Superbman
  • 787
  • 1
  • 8
  • 24

2 Answers2

1

You didn't use wildcard properly, I fixed it and now it works as you described and now you can create additional function for this operation:

patterns = [
    'world',
    'awesome',
    'awesome hello', 
    'world hello awesome',
    'world (.*?) awesome'
]


result = {} 
for pattern in patterns:
   rex = re.compile(fr'{pattern}') 
   count = len(rex.findall(txt))   
   result[pattern] = result.get(pattern, 0) + count

print(result)
EntGriff
  • 775
  • 6
  • 14
  • 1
    cool, thank you. Is there a way to speed it up? When i multiply txt by 10,000 and increase number of patterns to 1,000 it gets really slow - it took 5 minutes to finish. – Superbman Feb 02 '19 at 21:48
  • 1
    of course it will be slow, regex add some overhead. I think that it will be better to use normal string search ( 'word 'in' txt) where you don't have wildcars. – EntGriff Feb 02 '19 at 22:08
  • 1
    also, see this article : https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f – EntGriff Feb 02 '19 at 22:09
  • Wow, FlashText looks like what i need, because my number of patterns can blow up to 1,000,000 in some cases. I'm gonna test it out. Thank you – Superbman Feb 02 '19 at 22:14
  • EntGriff, i just tested it. It can't handle overlaps, plus no wildcard support, although it's fast. – Superbman Feb 02 '19 at 22:46
  • I just linked it to udnerstand how regex works and how to improve it, maybe it's not exactly solution fro your problem, but you can consider the idea of this article. btw you can try what I said in previous comment - use regex only if you have wildcards, another cases use plain text search – EntGriff Feb 02 '19 at 22:50
  • this looks like a solution https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3 but the problem is that i have multiple words in a pattern instead of a single world. – Superbman Feb 02 '19 at 23:00
0

You could look into

re.finditer()

Iterators save you a lot of resources if you don't need all the data at once (which you hardly ever do). This way you don't need to hold so much information in memory. Look into this Do iterators save memory in Python?

Auss
  • 451
  • 5
  • 9
  • i'll look into it, but i still can't match patters with a space between them. I tried replacing `*` with `(.*?)` to get an 'any word between pattern', but it got even worse – Superbman Feb 02 '19 at 20:29