2

I'm using a dict file and Regular Expressions to change some words in a script but have now come across this error

Exception caught in plugin < class 'pagerprinter.plugins.tts.TTS' >
regular expression code size limit exceeded

my dict has some 5300 entries long set out as:

'SE': 'South East',
'NE': 'North East',

You get the idea changing abbreviations to full words. on average 6 - 8 abbreviations are changed.

for this I'm using

from abbreviations import abbreviations #mydict
pattern = re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')
    msg = pattern.sub(lambda x: abbreviations[x.group()], msg)

but I also use a further 4 more regexes for other tasks like removing words and numbers from the a number of strings.

What is the cause of the error I get? if I remove my dict it works if I have 300 entries it works.

looking into it from Google most people say that there are no limits on dict sizes.

shA.t
  • 16,580
  • 5
  • 54
  • 111
shaggs
  • 600
  • 7
  • 27
  • I tried to reproduce your error using a 99,000 element dict (based on a list of English words), but the code worked fine. A more complete example would help, but that's tricky given the 5000-entry dictionary. – Rory Yorke Oct 11 '15 at 10:01
  • 1
    The limit is on the length of regular expressions, if I'm not mistaken. Just go through the dictionary in smaller chunks and do the replacements for each of them. – L3viathan Oct 11 '15 at 10:22
  • How do you mean length? As in code in one line? – shaggs Oct 11 '15 at 10:26
  • @Roy Yorke the dict can be downloaded from git hub if required – shaggs Oct 11 '15 at 10:27
  • I'm not quite sure, but I think there's simply a size limit for regular expressions. – L3viathan Oct 11 '15 at 10:28
  • @L3viathan any idea what the limit is? My test re string is 938853 chars – Rory Yorke Oct 11 '15 at 10:28
  • @L3viathan I looked on google but no defined answer – shaggs Oct 11 '15 at 10:30
  • @RoryYorke What does your test string look like? [It appears](http://stackoverflow.com/questions/1998261/pythons-regular-expression-source-string-length) that there is a limit of an individual item, not on the entire string, but I don't know what that looks like exactly. – L3viathan Oct 11 '15 at 10:37
  • @L3viathan it's r'\b(word1|word2|....|word99000)\b', much like the question. – Rory Yorke Oct 11 '15 at 10:43
  • @shaggs, I think having a link to github in the question would help – Rory Yorke Oct 11 '15 at 10:46
  • My string or dict is no where near that size so why do I get an error? Would I have to maybe split up my dict into groups e.g {'north east': ['NE', N/E, ] so do it in reverse? – shaggs Oct 11 '15 at 10:46
  • github.com/Shaggs/cfsprinter – shaggs Oct 11 '15 at 10:47
  • Wait a sec will change code to run without you needing to mess with .ini file – shaggs Oct 11 '15 at 10:49
  • I've got abbreviations.py. It has a bug (missing comma on line 740) ? Hm. [edit: not all] A few of the lines after 740 are missing a trailing comma. – Rory Yorke Oct 11 '15 at 10:52
  • Yeah pushing update a ive fixe that – shaggs Oct 11 '15 at 10:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/91950/discussion-between-rory-yorke-and-shaggs). – Rory Yorke Oct 11 '15 at 10:56

1 Answers1

2

Just as L3viathan mentions. You're building a regex pattern that is to long. This line is your problem:

re.compile(r'\b(' + '|'.join(abbreviations.keys()) + r')\b')

The longer your abbreviations dict grows the longer the regex pattern grows. You'll have to either use 2 regexes or another solution.

Edit to answer a question below, you could do it like this:

from abbreviations import dct1, dct2, dct3
import re

for dct in (dct1, dct2, dct3):
    pattern = re.compile(r'\b(' + '|'.join(dct.keys()) + r')\b')
    msg = pattern.sub(lambda x: dct[x.group()], msg)

Where dct1 2 and 3 are you categories

Sjuul Janssen
  • 1,772
  • 1
  • 14
  • 28
  • ok so i moved the above code above to one part of the script to find on 3 things in the list and i still got the error ? – shaggs Oct 11 '15 at 11:38
  • is it possible to split the dict up ? and say look for `road-use= {'RD': 'Road'} Directions= {'NE': 'North East'}` – shaggs Oct 11 '15 at 11:40
  • I'm guessing you don't have any context by which you can split the dict into the categories you suggest. You will either have to do that manually or [split the dict into chunks](http://stackoverflow.com/questions/22878743/how-to-split-dictionary-into-multiple-dictionaries-fast) – Sjuul Janssen Oct 11 '15 at 11:55
  • To do manually isnt hard as already in "sections" by use of # so if I changed it how would I accomplish said way? – shaggs Oct 11 '15 at 11:57