0

I'm working on a program that parses the various sgml files of reuters dataset. But the documents I found don't contain a root node, that encompasses all the children. It just has a set of <reuters>..</reuters> tags after DTD. So parsing the tree and using getroot() gives only the first <reuters> tag, and not the whole list. How can I work around it without changing the input files ? My code is given below:

import os
from lxml import etree as ET

dirname = 'dataset'

for filename in os.listdir(dirname):
    filepath = os.path.join(dirname, filename)

    parser = ET.parser(encoding='utf-8', recover=True)

    tree = ET.parse(filepath, parser)

    root = tree.getroot()

this root element is just the first <reuters> tag, while the sgml file is as below:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<reuters> .. </reuters>
<reuters> .. </reuters>
<reuters> .. </reuters>

What I want is to have all <reuters> tags, one at a time and work on their contents.

mzjn
  • 48,958
  • 13
  • 128
  • 248
ggauravr
  • 190
  • 2
  • 11
  • You could try converting the SGML to XML and work with that: http://stackoverflow.com/a/12534420/407651. – mzjn Sep 13 '13 at 18:32
  • Or maybe you can use BeautifulSoup: http://stackoverflow.com/a/10508687/407651. – mzjn Sep 15 '13 at 09:40
  • @mzjn .. Tried BeautifulSoup. while it's working fine in my system, it doesn't seem to work on my friends', even with the same code. might be some dependencies problem. But Thanks :) – ggauravr Sep 16 '13 at 09:55

0 Answers0