1

I'm a beginner programmer and I'm stuck on this possibly easy problem: I want to automatically add numbers to the sentences contained in the P tags of an .xml file. So a sample paragraph in the the .xml file looks like:

<P>Sentence1. Sentence2. Sentence3.</P>

I want to transform this into:

<P><SUP>1</SUP>Sentence1.<SUP>2</SUP> Sentence2.<SUP>3</SUP> Sentence3.</P>

However only the P tags containing at least 2 sentences should be numbered, if it contains only 1 sentence I want to leave it unchanged.

Here is the approach I have come up with so far, using regular expressions:

\.\s.*
# Reliably finds the second sentence, Insert <SUP>2</SUP> after it.
<P>[^>]*<SUP>2
# Finds the beginning of the first sentence if a second sentence exists.

However I feel like this is a really awkward approach that I wouldn't really know how to extend for Paragraphs containing 20 sentences or more, or .xml documents containing many paragraphs. Is there a better regular expression to achieve this or a better (Python) tool than regular expressions?

Elip
  • 551
  • 1
  • 4
  • 14
  • 2
    Regular expressions can't really count. – JeffS Sep 28 '12 at 16:32
  • 3
    [Regular expressions can't really parse XML.](http://www.stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Martin Ender Sep 28 '12 at 16:35
  • pyparsing would likely be a much better option for this ... or something other than regex... regex is not a good solution for this problem – Joran Beasley Sep 28 '12 at 16:36
  • 1
    Not only can regular expressions not count, they can't grok XML, either. Use an XML library to process XML - it's the sensible choice! `lxml` is good - flexible and unobtrusive. – D_Bye Sep 28 '12 at 16:42
  • Thanks for the comments! I edited the question to remove the regex focus. – Elip Sep 28 '12 at 16:46

1 Answers1

2

Something like this (very untested) might work

import xml.etree.ElementTree as ET
tree = ET.parse(XML_FILE)
root = tree.getroot()


for p in root.iter('p'):
   sentences = p.text.split('.')
   p.text = ".".join([("<sup>%i<sup>" % count) + sentence for count, sentence in enumerate(sentences)])

tree.write(XML_FILE)
JeffS
  • 2,647
  • 2
  • 19
  • 24
  • Obviously you'd want to do something smarter than .split() to actually decide what the sentence boundaries are – JeffS Sep 28 '12 at 16:49
  • Thank you very much for your answer! ElementTree is certainly the way to go. I'm unfamiliar with the "% count) + sentence for count, sentence in enumerate(sentences)" bit, what would I need to do if I wanted to use the value count+1? (Since at the moment sup starts at 0) Simply adding +1 leads to problems because of concatenation of integer and string. – Elip Sep 28 '12 at 17:45
  • `("%i" % (count + 1))` would work. The % is string formatting – JeffS Sep 28 '12 at 18:04
  • 1
    to split the text on sentence boundaries more robustly you could use `nltk.tokenize.punkt.PunktSentenceTokenizer`. Here's a [usage example](http://stackoverflow.com/a/12030877/4279). – jfs Sep 30 '12 at 12:43