0

Here's one XML doc I'm working with:

<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

I'm looping through a set of XML docs to retrieve all sentences that begin with a space. I have no trouble capturing all the errors (leading spaces) with this:

>>> import re, os, sys
>>> import xml.etree.ElementTree as etree
>>> sentences = {}

>>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files

>>> for docAddr in xmlAddresses:
>>>    parser = etree.XMLParser(encoding=u'utf-8') 
>>>    tree = etree.parse(docAddr, parser=parser) 
>>>    sentences = getTokenTextFeature(docAddr,tree,sentences) 

>>> rgxLeadingSpace = re.compile('^\"? .')
>>> for sent in sentences.keys():
>>>    text = sentences[sent]['sentence']
>>>    if rgxLeadingSpace.findall(text):    
>>>        print text                        # the second sentence is from the above XML doc

" It rallied on ideas the market was oversold , " a trader said . 

" The result of the second year-half is expected to improve on the early part of the year , " Atria said .

" The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow . 

What I need to do is, after finding the errors, loop through all the XML files which contain those errors and adjust their START attributes. For example, this is a sentence from the above XML doc that contained a leading space:

<charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

It should look like this:

<charseq START="207" END="310">The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

I think I provided all the necessary code. If someone can help me I will create a million StackOverflow accounts and upvote you a million times! :) Thanks!

tmthyjames
  • 1,588
  • 2
  • 22
  • 30
  • I'm using a parser and not soley relying on regex so don't post this link about parsing XML with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – tmthyjames Sep 18 '14 at 20:18

2 Answers2

1

The approach I would use would be to not extract out and then search the matching sentences in a separate array as you're doing, but instead while traversing the nodes of the dom check each sentence element against your pattern. That way when you find one, you can use the element object you're visiting directly and modify its START attribute, and then simply write out the modified dom to a new (or replacement) XML file.

Corbell
  • 1,283
  • 8
  • 15
1

I don't know what getTokenTextFeature does, but here is a program that modifies the XML in the manner you asked for.

xml='''<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>
</extent></span></document>
'''

import re
import xml.etree.ElementTree as etree

root = etree.XML(xml)
for charseq in root.findall(".//span[@type='sentence']/extent/charseq[@START]"):
  match = re.match('^("? +)(.*)', charseq.text)
  if match:
    space,text = match.groups()
    charseq.set('START', str(int(charseq.get('START')) + len(space)))
    charseq.text = text
print etree.tostring(root)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Thank you! Is there a way to do this without stripping out the namespace from the XML? I used your method to write new XML docs without the errors, but the new docs don't have the original namespace. – tmthyjames Sep 19 '14 at 20:27
  • That is a whole new question. Please prepare a short, stand-alone demonstration program (like the one I put in the answer), and ask a new question, copy-pasting that entire short program into your new question. (The answer will be: "They actually do have the original namespace, just not the same prefix. But the prefix doesn't matter, anyway.") – Robᵩ Sep 19 '14 at 20:34
  • 1
    @tmthyjames - And, if you want specific prefixes, look at the [`register_namespace`](https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.register_namespace) function. – Robᵩ Sep 19 '14 at 20:40