Numbering the sentences inside a
in a .xml file?

Question

I'm a beginner programmer and I'm stuck on this possibly easy problem: I want to automatically add numbers to the sentences contained in the P tags of an .xml file. So a sample paragraph in the the .xml file looks like:

<P>Sentence1. Sentence2. Sentence3.</P>

I want to transform this into:

<P><SUP>1</SUP>Sentence1.<SUP>2</SUP> Sentence2.<SUP>3</SUP> Sentence3.</P>

However only the P tags containing at least 2 sentences should be numbered, if it contains only 1 sentence I want to leave it unchanged.

Here is the approach I have come up with so far, using regular expressions:

\.\s.*
# Reliably finds the second sentence, Insert <SUP>2</SUP> after it.
<P>[^>]*<SUP>2
# Finds the beginning of the first sentence if a second sentence exists.

However I feel like this is a really awkward approach that I wouldn't really know how to extend for Paragraphs containing 20 sentences or more, or .xml documents containing many paragraphs. Is there a better regular expression to achieve this or a better (Python) tool than regular expressions?

[Regular expressions can't really parse XML.](http://www.stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Martin Ender, Sep 28 '12 at 16:35
pyparsing would likely be a much better option for this ... or something other than regex... regex is not a good solution for this problem — Joran Beasley, Sep 28 '12 at 16:36
Not only can regular expressions not count, they can't grok XML, either. Use an XML library to process XML - it's the sensible choice! `lxml` is good - flexible and unobtrusive. — D_Bye, Sep 28 '12 at 16:42
Thanks for the comments! I edited the question to remove the regex focus. — Elip, Sep 28 '12 at 16:46

score 2 · Accepted Answer · answered Sep 28 '12 at 16:46

2

Something like this (very untested) might work

import xml.etree.ElementTree as ET
tree = ET.parse(XML_FILE)
root = tree.getroot()


for p in root.iter('p'):
   sentences = p.text.split('.')
   p.text = ".".join([("<sup>%i<sup>" % count) + sentence for count, sentence in enumerate(sentences)])

tree.write(XML_FILE)

answered Sep 28 '12 at 16:46

JeffS

2,647
2
19
24

Obviously you'd want to do something smarter than .split() to actually decide what the sentence boundaries are – JeffS Sep 28 '12 at 16:49
Thank you very much for your answer! ElementTree is certainly the way to go. I'm unfamiliar with the "% count) + sentence for count, sentence in enumerate(sentences)" bit, what would I need to do if I wanted to use the value count+1? (Since at the moment sup starts at 0) Simply adding +1 leads to problems because of concatenation of integer and string. – Elip Sep 28 '12 at 17:45
`("^{%i^{" % (count + 1))` would work. The % is string formatting}} – JeffS Sep 28 '12 at 18:04
1

to split the text on sentence boundaries more robustly you could use `nltk.tokenize.punkt.PunktSentenceTokenizer`. Here's a [usage example](http://stackoverflow.com/a/12030877/4279). – jfs Sep 30 '12 at 12:43

Numbering the sentences inside a
in a .xml file?

1 Answers1

Linked

Related

Numbering the sentences inside a in a .xml file?

1 Answers1

Linked

Related

Numbering the sentences inside a
in a .xml file?