0

I have created 10 xml documents of different types like one has book details, other has movie details or news headlines etc. One of such xml document is books.xml which is as follows:

<bookstore>
    <book category="COOKING">
         <title lang="english">Everyday Italian</title>
         <author>Giada De Laurentiis</author>
         <year>2005</year>
         <price>200.00</price>
    </book>

    <book category="CHILDREN">
         <title lang="english">Harry Potter</title>
         <author>J K. Rowling </author>
         <year>2005</year>
         <price>625.00</price>
    </book> 
</bookstore>

I want to count words entered by the user by searching them in all of the 10 xml documents. These words are nothing but the elements having attributes and their respective values.

For eg: user enters category

from above example one can see category is an attribute and written twice . So, according to this the output should be "2" and if this word category is present in other 9 docs, then the count is accordingly upgraded. How can I do it for single doc without specifying the element name. Its basically xml parsing , so how can I do it ? It's new to me and I'm facing some problems.

//////////////////////////////////////////////////////////////////////////////////

What if I want to use simple doc but not the schema?It's related to xml parsing, can you tell me how to use the nodelist object in dom model object.

please help.

POOJA GUPTA
  • 2,295
  • 7
  • 32
  • 60
  • NB: This is an extension of the question from http://stackoverflow.com/questions/11279589/extract-text-from-xml-documents-in-python – Jon Clements Jul 01 '12 at 10:42
  • I have tried with ElementTree parsing with its methods like getkeys(), items() for eg: tree = ElementTree() tree.parse() root = tree.getroot() root[0].keys() which gives me the output [('category':COOKING),('category':CHILDREN)] but I'm not getting the internal data like tile's attributes and value "english" and similarly if we have more child's childnodes having attributes, how can I detect it? – POOJA GUPTA Jul 01 '12 at 10:45
  • it's a tree.parse("file.xml") – POOJA GUPTA Jul 01 '12 at 10:59
  • What if I want to use simple doc but not the schema than? – POOJA GUPTA Jul 02 '12 at 06:17

1 Answers1

0

If you're going to have loads of such XML documents you can do the following steps

  1. Get rid of data in attributes. Change the document format for

    <book>
         <category>CHILDREN</category>
         <lang>english</lang>
         <title>Harry Potter</title>
         <author>J K. Rowling </author>
         <year>2005</year>
         <price>625.00</price>
    </book> 
    
  2. Use Sphinx to index the documents using xmlpipe data source

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77