5

Suppose I have a text like this:

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ
EMC 

It is part of a PDF file. The line

[(\()-2(Y)7(o)7(u've )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ

contains the text "(You've got it)". So I first need to match text lines

^[(.*)]TJ$

Having the capture group of that, I can apply \(((.*?)\)[-0-9]*) and replace all matches by \2.

Is it possible to do this in one step?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 2
    Not possible with `re` in Python. Possible with `regex` package, but [you don't want to do it](https://stackoverflow.com/questions/15268504/collapse-and-capture-a-repeating-pattern-in-a-single-regex-expression/15418942#15418942) unless you have no choice but to use a sinlge regex. I'm not sure if there is any exotic feature in `regex` that would help, though. – nhahtdh Jul 20 '17 at 14:43
  • @nhahtdh: the `regex` module has all the features of your most crazy dreams. – Casimir et Hippolyte Jul 20 '17 at 14:46
  • @nhahtdh I see. Could you please post a link to the documentation of the `regex` module? – Martin Thoma Jul 20 '17 at 14:51
  • 1
    https://pypi.python.org/pypi/regex/ – nhahtdh Jul 20 '17 at 14:51
  • So it is unusual to capture the matches of nested groups? – Martin Thoma Jul 20 '17 at 14:52
  • Normally, most languages only gives you the last thing a capturing group captured. .NET and `regex` package support getting all of them. Replacement is another issue - I don't know if `regex` support replacement with the multiple capture results of a capturing group. I guess you would need to do that manually. – nhahtdh Jul 20 '17 at 14:56
  • Anyway, your current approach (apart from the lax regex - but if it works for your purpose, it's fine) is the simplest approach to the problem. – nhahtdh Jul 20 '17 at 14:58
  • What do you mean by "lax regex"? – Martin Thoma Jul 20 '17 at 15:29
  • Nested groups are nearly impossible with regex. If you do manage to accomplish it, it is a terrible sight. The next step is a recursive descent parser such as `parsimonious`. It is elegant and simple. – pylang Jul 29 '17 at 08:16
  • @nhahtdh when you say "not possible" you mean nested capturing groups only? because I tested non-capturing groups and they seem to work on python regex testers online (not having access to python3 wight now) – FarO Mar 28 '22 at 09:29

2 Answers2

3

Using regular expressions to parse nested groups can be difficult, illegible or impossible to achieve.

One approach for addressing nested groups is to use a parsing grammar. Here is a 3-step example using the parsimonious library by Eric Rose.

Given

import itertools as it

import parsimonious as pars


source  = """\
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t )4(i)-3(t)(\))]TJ
EMC"""

Code

  1. Define a Grammar
rules = r"""

    root            = line line message end

    line            = ANY NEWLINE
    message         = _ TEXT (_ TEXT*)* NEWLINE
    end             = "EMC" NEWLINE*

    TEXT            = ~r"[a-zA-Z ]+" 
    NEWLINE         = ~r"\n"
    ANY             = ~r"[^\n\r]*"

    _               = meaninglessness*
    meaninglessness = ~r"(TJ)*[^a-zA-Z\n\r]*"    

    """
  1. Parse source text and Build an AST
grammar = pars.grammar.Grammar(rules)
tree = grammar.parse(source)
# print(tree)
  1. Resolve the AST

class Translator(pars.NodeVisitor):
    
    def visit_root(self, node, children):
        return children

    def visit_line(self, node, children):
        return node.text
    
    def visit_message(self, node, children):
        _, s, remaining, nl = children
        return (s + "".join(it.chain.from_iterable(i[1] for i in remaining)) + nl)
        
    def visit_end(self, node, children):
        return node.text
    
    def visit_meaninglessness(self, node, children):
        return children
    
    def visit__(self, node, children):
        return children[0]
    
    def visit_(self, node, children):
        return children
    
    def visit_TEXT(self, node, children):
        return node.text
    
    def visit_NEWLINE(self, node, children):
        return node.text
    
    def visit_ANY(self, node, children):
        return node.text

Demo

tr = Translator().visit(tree)
print("".join(tr))

Output

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
You got it
EMC

Details

  1. Instead of a rigid (sometimes illegible regular expression), we define a set of regex/EBNF-like grammar rules see docs for details. Once a grammar is defined, it can be much easier to adjust if required.
  • Note: the original text was modified, adding a space to 2(t) (line 3) as it was believed to be missing from the OP.
  1. The parsing step is simple. Just parse the source text base on the grammar. If the grammar is sufficiently defined, an AST is created with nodes that reflect the structure of your source. Having an AST is key as it makes this approach flexible.
  2. Define what to do when each node is visited. One can resolve an AST using any desired technique. As an example, here we demonstrate the Visitor Pattern through subclassing NodeVisitor from parsmonious.

Now for new or unexpected texts encountered in your PDFs, simply modify the grammar and parse again.

pylang
  • 40,867
  • 14
  • 129
  • 121
2

With the regex module you can use this pattern:

pat=r'(?:\G(?!\A)\)|\[(?=[^]]*]))[^](]*\(([^)\\]*(?:\\.[^)\\]*)*)(?:\)[^(]*]TJ)?'
regex.sub(pat, r'\1', s)

demo

pattern details:

(?: # two possible starts
    \G     # contiguous to a previous match
    (?!\A) # not at the start of the string
    \)     # a literal closing round bracket
  | # OR
    \[          # an opening square bracket
     (?=[^]]*]) # followed by a closing square bracket
)
[^](]* # all that isn't a closing square bracket or an opening round bracket
\(     # a literal opening round bracket
(      # capture group 1
    [^)\\]* # all characters except a closing round bracket or a backslash
    (?:\\.[^)\\]*)* # to deal with eventual escaped characters 
)
(?: \) [^(]* ] TJ )? # eventual end of the square bracket parts
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125