Using regular expressions to parse nested groups can be difficult, illegible or impossible to achieve.
One approach for addressing nested groups is to use a parsing grammar. Here is a 3-step example using the parsimonious
library by Eric Rose.
Given
import itertools as it
import parsimonious as pars
source = """\
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t )4(i)-3(t)(\))]TJ
EMC"""
Code
- Define a Grammar
rules = r"""
root = line line message end
line = ANY NEWLINE
message = _ TEXT (_ TEXT*)* NEWLINE
end = "EMC" NEWLINE*
TEXT = ~r"[a-zA-Z ]+"
NEWLINE = ~r"\n"
ANY = ~r"[^\n\r]*"
_ = meaninglessness*
meaninglessness = ~r"(TJ)*[^a-zA-Z\n\r]*"
"""
- Parse source text and Build an AST
grammar = pars.grammar.Grammar(rules)
tree = grammar.parse(source)
# print(tree)
- Resolve the AST
class Translator(pars.NodeVisitor):
def visit_root(self, node, children):
return children
def visit_line(self, node, children):
return node.text
def visit_message(self, node, children):
_, s, remaining, nl = children
return (s + "".join(it.chain.from_iterable(i[1] for i in remaining)) + nl)
def visit_end(self, node, children):
return node.text
def visit_meaninglessness(self, node, children):
return children
def visit__(self, node, children):
return children[0]
def visit_(self, node, children):
return children
def visit_TEXT(self, node, children):
return node.text
def visit_NEWLINE(self, node, children):
return node.text
def visit_ANY(self, node, children):
return node.text
Demo
tr = Translator().visit(tree)
print("".join(tr))
Output
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
You got it
EMC
Details
- Instead of a rigid (sometimes illegible regular expression), we define a set of regex/EBNF-like grammar rules see docs for details. Once a grammar is defined, it can be much easier to adjust if required.
- Note: the original text was modified, adding a space to
2(t)
(line 3) as it was believed to be missing from the OP.
- The parsing step is simple. Just
parse
the source text base on the grammar. If the grammar is sufficiently defined, an AST is created with nodes that reflect the structure of your source. Having an AST is key as it makes this approach flexible.
- Define what to do when each node is visited. One can resolve an AST using any desired technique. As an example, here we demonstrate the Visitor Pattern through subclassing
NodeVisitor
from parsmonious
.
Now for new or unexpected texts encountered in your PDFs, simply modify the grammar and parse again.