regular expression code size limit exceeded python

Question

I'm using a dict file and Regular Expressions to change some words in a script but have now come across this error

Exception caught in plugin < class 'pagerprinter.plugins.tts.TTS' >
regular expression code size limit exceeded

my dict has some 5300 entries long set out as:

'SE': 'South East',
'NE': 'North East',

You get the idea changing abbreviations to full words. on average 6 - 8 abbreviations are changed.

for this I'm using

from abbreviations import abbreviations #mydict
pattern = re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')
    msg = pattern.sub(lambda x: abbreviations[x.group()], msg)

but I also use a further 4 more regexes for other tasks like removing words and numbers from the a number of strings.

What is the cause of the error I get? if I remove my dict it works if I have 300 entries it works.

looking into it from Google most people say that there are no limits on dict sizes.

I tried to reproduce your error using a 99,000 element dict (based on a list of English words), but the code worked fine. A more complete example would help, but that's tricky given the 5000-entry dictionary. — Rory Yorke, Oct 11 '15 at 10:01
The limit is on the length of regular expressions, if I'm not mistaken. Just go through the dictionary in smaller chunks and do the replacements for each of them. — L3viathan, Oct 11 '15 at 10:22
@Roy Yorke the dict can be downloaded from git hub if required — shaggs, Oct 11 '15 at 10:27
I'm not quite sure, but I think there's simply a size limit for regular expressions. — L3viathan, Oct 11 '15 at 10:28
@L3viathan any idea what the limit is? My test re string is 938853 chars — Rory Yorke, Oct 11 '15 at 10:28
@RoryYorke What does your test string look like? [It appears](http://stackoverflow.com/questions/1998261/pythons-regular-expression-source-string-length) that there is a limit of an individual item, not on the entire string, but I don't know what that looks like exactly. — L3viathan, Oct 11 '15 at 10:37
@L3viathan it's r'\b(word1|word2|....|word99000)\b', much like the question. — Rory Yorke, Oct 11 '15 at 10:43
@shaggs, I think having a link to github in the question would help — Rory Yorke, Oct 11 '15 at 10:46
My string or dict is no where near that size so why do I get an error? Would I have to maybe split up my dict into groups e.g {'north east': ['NE', N/E, ] so do it in reverse? — shaggs, Oct 11 '15 at 10:46
Wait a sec will change code to run without you needing to mess with .ini file — shaggs, Oct 11 '15 at 10:49
I've got abbreviations.py. It has a bug (missing comma on line 740) ? Hm. [edit: not all] A few of the lines after 740 are missing a trailing comma. — Rory Yorke, Oct 11 '15 at 10:52
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/91950/discussion-between-rory-yorke-and-shaggs). — Rory Yorke, Oct 11 '15 at 10:56

Sjuul Janssen · Answer 1 · 2015-10-11T12:12:03.310

2

Just as L3viathan mentions. You're building a regex pattern that is to long. This line is your problem:

re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')

The longer your abbreviations dict grows the longer the regex pattern grows. You'll have to either use 2 regexes or another solution.

Edit to answer a question below, you could do it like this:

from abbreviations import dct1, dct2, dct3
import re

for dct in (dct1, dct2, dct3):
    pattern = re.compile(r'\b(' + '|'.join(dct.keys()) + r')\b')
    msg = pattern.sub(lambda x: dct[x.group()], msg)

Where dct1 2 and 3 are you categories

edited Oct 11 '15 at 12:12

answered Oct 11 '15 at 11:31

Sjuul Janssen

1,772
1
14
28

ok so i moved the above code above to one part of the script to find on 3 things in the list and i still got the error ? – shaggs Oct 11 '15 at 11:38
is it possible to split the dict up ? and say look for `road-use= {'RD': 'Road'} Directions= {'NE': 'North East'}` – shaggs Oct 11 '15 at 11:40
I'm guessing you don't have any context by which you can split the dict into the categories you suggest. You will either have to do that manually or [split the dict into chunks](http://stackoverflow.com/questions/22878743/how-to-split-dictionary-into-multiple-dictionaries-fast) – Sjuul Janssen Oct 11 '15 at 11:55
To do manually isnt hard as already in "sections" by use of # so if I changed it how would I accomplish said way? – shaggs Oct 11 '15 at 11:57

regular expression code size limit exceeded python

1 Answers1