I'm not certain that regex is the best approach for this, but it seems to be fairly well suited. Essentially I'm currently parsing some pdfs using pdfminer, and the drawback is that these pdf's are exported powerpoint slides, which means that all animations show up as fairly long copies of strings. Ideally I would like just one copy of each of these strings instead of a copy for each stage of an animation. Right now the current regex pattern I'm using is this:
re.sub(r"([\w^\w]{10,})\1{1,}", "\1", string)
For some reason though, this doesn't seem to change the input string. I feel like for some reason python isn't recognizing the capture group, but I'm not sure how to remedy that issue. Any thoughts appreciated.
Examples:
I would like this
text to be
reduced
I would like this
text to be
reduced
output:
I would like this
text to be
reduced
Update: To get this to pass the pumping lemma I had to specifically make the assertion that all duplicates were adjacent. This was implied before, but I am now making it explicit to ensure that a solution is possible.