Stripping (XML?) markup from a document using python

Question

I've file which contains name of scientist in following format <scientist_names> <scientist>abc</scientist> </scientist_names> i want to use python to strip out name of scientists from above format How should I do it?? I would like to use regular epressions but don't know how to use it...please help

This looks like XML. Check out [xml.dom.minidom](http://docs.python.org/library/xml.dom.minidom.html). — Tim Pietzcker, Feb 13 '12 at 11:55
If I have such continuos lines `abcxzz` then can anyone please tell me fastest way to extract data — username_4567, Feb 13 '12 at 18:47

Uku Loskit · Answer 1 · 2012-02-13T11:59:49.577

2

This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).

Here is an example:

from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""

tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
    print scientist.text

edited Feb 13 '12 at 11:59

answered Feb 13 '12 at 11:54

Uku Loskit

40,868
9
92
93

score 2 · Answer 2 · edited May 23 '17 at 11:43

2

DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])

Use an xml/html parser, take a look at BeautifulSoup.

edited May 23 '17 at 11:43

Community

1
1

answered Feb 13 '12 at 11:55

Rik Poggi

28,332
6
65
82

You might want to have a look into BS sources. You'll be surprised. – georg Feb 13 '12 at 13:27
@thg435: You are comparing apples with oranges. No-one should write their own custom parsing using regexes, because that approach is fragile. Beautiful soup uses regexes to deal with malformed markup as part of an effort to write one, well-tested, well-designed library to do just that. – Marcin Feb 13 '12 at 16:07

score 0 · Accepted Answer · edited Apr 24 '15 at 14:39

0

As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).

Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).

In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML

edited Apr 24 '15 at 14:39

7heo.tk

1,074
12
23

answered Feb 13 '12 at 11:58

Marcin

48,559
18
128
201

@user997704: You don't. Learn to use the right tool for the job. – Marcin Feb 13 '12 at 16:05
I would like to use but I'm not getting Quickstart guide to learn SAX – username_4567 Feb 13 '12 at 16:10

score 0 · Answer 4 · answered Feb 13 '12 at 12:07

Here is an simple example that should handle the xml tags for you

#import library to do http requests:
import urllib2

#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations

#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

If you find anything unclear just let me know

error while executing `data = file.read()` str object has no atrribute 'read' — username_4567, Feb 13 '12 at 12:43

Stripping (XML?) markup from a document using python

4 Answers4