2

The link below gives us the list of ingredients in recipelist. I would like to extract the names of the ingredient and save it to another file using python. http://stream.massey.ac.nz/file.php/6087/Eva_Material/Tutorials/recipebook.xml

So far I have tried using the following code, but it gives me the complete recipe not the names of the ingredients:

from xml.sax.handler import ContentHandler
import xml.sax
import sys
def recipeBook(): 
    path = "C:\Users\user\Desktop"
    basename = "recipebook.xml"
    filename = path+"\\"+basename
    file=open(filename,"rt")
    # find contents 
    contents = file.read()

    class textHandler(ContentHandler):
      def characters(self, ch):
      sys.stdout.write(ch.encode("Latin-1"))
    parser = xml.sax.make_parser()
    handler = textHandler( )
    parser.setContentHandler(handler)
    parser.parse("C:\Users\user\Desktop\\recipebook.xml")



  file.close()

How do I extract the name of each ingredient and save them to another file?

dgund
  • 3,459
  • 4
  • 39
  • 64
Neha Nalawade
  • 41
  • 1
  • 1
  • 3
  • 1
    Unrelated: You should look at `os.path` and raw strings rather than dealing with windows paths like that. – Daenyth May 07 '12 at 01:35
  • 1
    Put the XML string as example in the question; it is easier to give specific answer when we can see what data you are working with. Your URL is inaccessible to those without a Massey ID. Also consider navigating the XML tree using [ElementTree](http://docs.python.org/library/xml.etree.elementtree.html). – daedalus May 07 '12 at 04:28
  • 2
    PS: [This answer, using ElementTree](http://stackoverflow.com/a/5533742/1290420) should get you there. – daedalus May 07 '12 at 04:58

3 Answers3

3

@Neha

I guess you have solved your request by now, here is a little piece I put together using the tutorial at http://lxml.de/tutorial.html. The XML file is saved in 'rough_data.xml'

import xml.etree.cElementTree as etree

xmlDoc = open('rough_data.xml', 'r')
xmlDocData = xmlDoc.read()
xmlDocTree = etree.XML(xmlDocData)

for ingredient in xmlDocTree.iter('ingredient'):
    print ingredient[0].text

To all experienced Python programmers reading this, kindly improve this "newbie" code.

Note: The lxml package looks very good, it definitely worth using. Thanks

squirrell
  • 41
  • 2
1

Please place the relevant XML text in order to receive a proper answer. Also please consider using lxml for anything xml specific (including html) .

try this :

from lxml import etree

tree=etree.parse("your xml here")
all_recipes=tree.xpath('./recipebook/recipe')
recipe_names=[x.xpath('recipe_name/text()') for x in all_recipes]
ingredients=[x.getparent().xpath('../ingredient_list/ingredients') for x in recipe_names]
ingredient_names=[x.xpath('ingredient_name/text()') for x in ingredients]

Here is the beginning only, but i think you get the idea from here -> get the parent from each ingredient_name and the ingredients/quantities from there and so on. You can't really do any other kind of search i think due to the structured nature of the document.

you can read more on [www.lxml.de]

omu_negru
  • 4,642
  • 4
  • 27
  • 38
  • Omelette eggs 2 milk 50 ml tomato 1 oil 1 ts Sandwich bread 2 slices butter – Neha Nalawade May 08 '12 at 08:08
  • thats the xml file I am working on and i just need the ingredients from the file i.e. milk, eggs etc. i need to extract the ingredients and write them to another file. thanks – Neha Nalawade May 08 '12 at 08:09
0

Some time ago I made a series of screencasts that explain how to collect data from sites. The code is in Python and there are 2 videos about XML parsing with the lxml library. All the videos are posted here: http://railean.net/index.php/2012/01/27/fortune-cowsay-python-video-tutorial

The ones you want are:

  • XPath experiments and query examples
  • Python and LXML, examples of XPath queries with Python
  • Automate page retrieving via HTTP and HTML parsing with lxml

You'll learn how to write and test XPath queries, as well as how to run such queries in Python. The examples are straightforward, I hope you'll find them helpful.

ralien
  • 1,448
  • 11
  • 24