Remove all replicas of a string more than x characters long (regex?)

Question

I'm not certain that regex is the best approach for this, but it seems to be fairly well suited. Essentially I'm currently parsing some pdfs using pdfminer, and the drawback is that these pdf's are exported powerpoint slides, which means that all animations show up as fairly long copies of strings. Ideally I would like just one copy of each of these strings instead of a copy for each stage of an animation. Right now the current regex pattern I'm using is this:

re.sub(r"([\w^\w]{10,})\1{1,}", "\1", string)

For some reason though, this doesn't seem to change the input string. I feel like for some reason python isn't recognizing the capture group, but I'm not sure how to remedy that issue. Any thoughts appreciated.

Examples:

I would like this
text to be

reduced
I would like this
text to be

reduced

output:

I would like this
text to be

reduced

Update: To get this to pass the pumping lemma I had to specifically make the assertion that all duplicates were adjacent. This was implied before, but I am now making it explicit to ensure that a solution is possible.

How do you expect someone to answer your question if you don't give a sample of what the regexp is supposed to match? — michaelmeyer, Jun 19 '13 at 22:15
Would this help : http://stackoverflow.com/questions/88615/what-algorithm-can-you-use-to-find-duplicate-phrases-in-a-string ? — SylvainD, Jun 19 '13 at 22:16
@Josay I think using a suffix tree here overcomplicates the problem, that said, if I can't figure this out I might head down that path. — Slater Victoroff, Jun 19 '13 at 22:21
I'm not a python coder, but maybe [this will help](http://regex101.com/r/fP6kH2): `(?s)(.{10,})\s*(?:\1)+`. `(?s)` match new lines with dots, `(.{10,})` group and match anycharacter 10 or more times, `\s*` match 0 or more whitespaces, `(?:\1)+` use group 1 and repeat 1 or more times. Of course you should replace by group 1 `\1`. — HamZa, Jun 19 '13 at 22:47

score 3 · Accepted Answer · edited May 23 '17 at 12:05

regexps are not the right tool for that task. They are based on the theory of context free languages, and they can't match if a string contains duplicates and remove the duplicates. You may find a course on automata and regexps interesting to read on the topic.

I think Josay's suggestion can be efficient and smart, but I think I got a more simple and pythonic solution, though it has its limits. You can split your string into a list of lines, and pass it through a set():

>>> s = """I would like this
... text to be
... 
... reduced
... I would like this
... text to be
... 
... reduced"""
>>> print "\n".join(set(s.splitlines()))
I would like this

text to be
reduced
>>>

The only thing with that solution is that you will loose the original order of the lines (the example being a pretty counter example). Also, if you have the same line in two different contexts, you will end up having only one line.

To fix the first problem, you may have to then iterate over your original string a second time to put that set back in order, or simply use an ordered set.
If you got any symbol separating each slide, it would help you merge only the duplicates, fixing the second problem of that solution.

Otherwise a more sophisticated algorithm would be needed, so you can take into account proximity and context. For that a suffix tree could be a good idea, and there are python libraries for that (cf that SO answer).

edit:

using your algorithm I could make it work, by adding support of multiline and adding spaces and endlines to your text matching:

>>> re.match(r"([\w \n]+)\n\1", string, re.MULTILINE).groups()
('I would like this\ntext to be\n\nreduced',)

Though, afaict the \1 notation is not a regular regular expression syntax in the matching part, but an extension. But it's getting late here, and I may as well be totally wrong. Maybe shall I reread those courses? :-)

I guess that the regexp engine's pushdown automata is able to push matches, because it is only a long multiline string that it can pop to match. Though I'd expect it to have side effects...

The example I gave was simplistic, but sadly in my actual data set I won't be able to count on the string being so neatly separated by newlines. Correct me if I'm wrong, but I believe these two grammars are of precisely the same level and issue of context-free grammars does not apply here. For example the regex: `re.sub(r"(.)\1{4,}", r"\1", string)` would work for single characters and I believe the two problems are of the same grammar level and the proof that one is possible should prove that this is possible. Let me know if I've overlooked something though. — Slater Victoroff, Jun 19 '13 at 22:46
I just checked this against the pumping lemma (http://en.wikipedia.org/wiki/Pumping_lemma_for_context-free_languages) and this description is in fact a context-free grammar, so I believe that it can be done in theory. Though I could only prove it context-free under the assumption that the repeating strings are adjacent, which they are. Not sure if that changes your answer. — Slater Victoroff, Jun 19 '13 at 22:54
Going to check out your update answer, but I've got to go home now so it'll take a bit. — Slater Victoroff, Jun 19 '13 at 23:03

Remove all replicas of a string more than x characters long (regex?)

1 Answers1

Linked