I've file which contains name of scientist in following format
<scientist_names>
<scientist>abc</scientist>
</scientist_names>
i want to use python to strip out name of scientists from above format How should I do it??
I would like to use regular epressions but don't know how to use it...please help

- 48,559
- 18
- 128
- 201

- 4,737
- 12
- 56
- 92
-
This looks like XML. Check out [xml.dom.minidom](http://docs.python.org/library/xml.dom.minidom.html). – Tim Pietzcker Feb 13 '12 at 11:55
-
If I have such continuos lines `
abc xzz ` then can anyone please tell me fastest way to extract data – username_4567 Feb 13 '12 at 18:47
4 Answers
This is XML and you should use a XML parser like lxml
instead of regular expressions (because XML is not a regular language).
Here is an example:
from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""
tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
print scientist.text

- 40,868
- 9
- 92
- 93
DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])
Use an xml/html parser, take a look at BeautifulSoup.
-
-
@thg435: You are comparing apples with oranges. No-one should write their own custom parsing using regexes, because that approach is fragile. Beautiful soup uses regexes to deal with malformed markup as part of an effort to write one, well-tested, well-designed library to do just that. – Marcin Feb 13 '12 at 16:07
As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).
Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).
In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML
-
-
I would like to use but I'm not getting Quickstart guide to learn SAX – username_4567 Feb 13 '12 at 16:10
Here is an simple example that should handle the xml tags for you
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData
If you find anything unclear just let me know

- 2,510
- 5
- 34
- 59
-
error while executing `data = file.read()` str object has no atrribute 'read' – username_4567 Feb 13 '12 at 12:43