0

I've file which contains name of scientist in following format <scientist_names> <scientist>abc</scientist> </scientist_names> i want to use python to strip out name of scientists from above format How should I do it?? I would like to use regular epressions but don't know how to use it...please help

Marcin
  • 48,559
  • 18
  • 128
  • 201
username_4567
  • 4,737
  • 12
  • 56
  • 92

4 Answers4

2

This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).

Here is an example:

from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""

tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
    print scientist.text
Uku Loskit
  • 40,868
  • 9
  • 92
  • 93
2

DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])

Use an xml/html parser, take a look at BeautifulSoup.

Community
  • 1
  • 1
Rik Poggi
  • 28,332
  • 6
  • 65
  • 82
  • You might want to have a look into BS sources. You'll be surprised. – georg Feb 13 '12 at 13:27
  • @thg435: You are comparing apples with oranges. No-one should write their own custom parsing using regexes, because that approach is fragile. Beautiful soup uses regexes to deal with malformed markup as part of an effort to write one, well-tested, well-designed library to do just that. – Marcin Feb 13 '12 at 16:07
0

As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).

Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).

In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML

7heo.tk
  • 1,074
  • 12
  • 23
Marcin
  • 48,559
  • 18
  • 128
  • 201
0

Here is an simple example that should handle the xml tags for you

#import library to do http requests:
import urllib2

#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations

#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

If you find anything unclear just let me know

Lucian Enache
  • 2,510
  • 5
  • 34
  • 59