1

I have a lot of XML documents with mixed content, i.e. they contain paragraphs of normal text with interspersed XML for formatting etc., unrelated to the document structure).

I need to segment these paragraphs within the existing XML document:

  • identify permissible "breakpoints" based on textual criteria (sentence boundaries - full stop, tab etc.)

  • divide the paragraph into segments defined by adjacent pairs of breakpoints (segment start and end points), i.e. wrap everything between the two breakpoints in a <seg> tag.

  • The paragraph start and end are also valid breakpoints.

  • But a breakpoint pair cannot be used if it clashes with the XML structure.

The simplest example goes like this:

<par>Hello <x>you</x>. How are you?</par> might be segmented into <par><seg>Hello <x>you</x>.</seg> <seg>How are you?</seg></par>

But when the interspersed tags span across a potential breakpoint:

<par>Hello <x>you. How are you</x>?</par> cannot be split up and I can only make <par><seg>Hello <x>you. How are you</x>?</seg></par>

A complication is that a breakpoint, if defined simply as a text index, is ambiguous in terms of the XML structure, e.g.:

<par><x>Hello you. How are you?</x></par> can only be split with all breakpoints inside the <x> tag as <par><x><seg>Hello you.</seg> <seg>How are you?</seg></x></par>

I've been trying to do this with lxml, but that quickly became rather complicated. Each segment's start and end breakpoints have to be at the same "level" within the tree, but that could mean being in the text property of one tag and the tail of another; inserting a new tag means moving some of the surrounding text to other tags; the "level" is ambiguous for empty text/tails, etc etc. It didn't feel very natural at all.

What's a better way to do this?

Thank you so much!

Monkimo
  • 11
  • 2

1 Answers1

1

The best option to transform xml is xslt. Maybe you take a look here as a starting point: "How to transform an XML file using XSLT in Python?"

And this question: "Tokenize mixed content in XSLT" explains some basics in that what you also might want.

Siebe Jongebloed
  • 3,906
  • 2
  • 14
  • 19