Parsing an xml file using Regex in python

Question

I have a xml file as text file,as follows:-

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2011-01-11" key="journals/acta/Simon83">
<author>Hans-Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>http://dx.doi.org/10.1007/BF01257084</ee>
</article>

If i type 'Parallel',then i should obtain, the Entire title name,followed by 'author','pages','year','volume','journal'

as sample output as:-

Sanjeev Saxena
Parallel Integer Sorting and Simulation Amongst CRCW Models.
607-619
1996
33
Acta Inf.

How can i perform the above actions using regex? Please help!

Thanks in advance!

Please [read this answer](http://stackoverflow.com/a/1732454/918959) - it applies to XML as well. For Python you can use `lxml`, BeautifulSoup 4 (`bs4`), or about anything from [the standard library](https://docs.python.org/3/library/xml.html) — Antti Haapala -- Слава Україні, Mar 16 '15 at 12:47

Mazdak · Answer 1 · 2015-03-16T13:28:08.517

The best way for parsing a xml or html doc is using a proper html parser, like beautifulsoup or lxml module, but as an alternative you can use the following pattern :

>>> s="""<?xml version="1.0" encoding="ISO-8859-1"?>
... <!DOCTYPE dblp SYSTEM "dblp.dtd">
... <dblp>
... <article mdate="2011-01-11" key="journals/acta/Saxena96">
... <author>Sanjeev Saxena</author>
... <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
... <pages>607-619</pages>
... <year>1996</year>
... <volume>33</volume>
... <journal>Acta Inf.</journal>
... <number>7</number>
... <url>db/journals/acta/acta33.html#Saxena96</url>
... <ee>http://dx.doi.org/10.1007/BF03036466</ee>
... </article>
... <article mdate="2011-01-11" key="journals/acta/Simon83">
... <author>Hans-Ulrich Simon</author>
... <title>Pattern Matching in Trees and Nets.</title>
... <pages>227-248</pages>
... <year>1983</year>
... <volume>20</volume>
... <journal>Acta Inf.</journal>
... <url>db/journals/acta/acta20.html#Simon83</url>
... <ee>http://dx.doi.org/10.1007/BF01257084</ee>
... </article>"""
>>> import re
>>> l=['author','pages','year','volume','journal']
>>> pat=r'|'.join(('<{}>(.*)</{}>'.format(i,i) for i in l))
>>> [j  for i in re.findall(pat,s) for j in i if j]
['Sanjeev Saxena', '607-619', '1996', '33', 'Acta Inf.', 'Hans-Ulrich Simon', '227-248', '1983', '20', 'Acta Inf.']

and if you want to get the words from input, you need the following extra commands :

names=raw_input('enter the named (separate with space): ')
l=names.split()

@adsalila no not at all! the proper way is using a `xml` parser! any way you can try and bechmark by yourself too see the result! — Mazdak, Mar 16 '15 at 13:00
can u please tell me the variable name used above,that searches for 'Parallel' or any user input? — adsa lila, Mar 16 '15 at 13:13
@adsalila you mean that instead an initialed list on names (`l`)you want to ge them from input? — Mazdak, Mar 16 '15 at 13:17

score 0 · Answer 2 · answered Mar 16 '15 at 12:54

Use an XML Parser instead.

Working example using lxml:

import lxml.etree as ET

data = """<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
        <article mdate="2011-01-11" key="journals/acta/Saxena96">
                <author>Sanjeev Saxena</author>
                <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
                <pages>607-619</pages>
                <year>1996</year>
                <volume>33</volume>
                <journal>Acta Inf.</journal>
                <number>7</number>
                <url>db/journals/acta/acta33.html#Saxena96</url>
                <ee>http://dx.doi.org/10.1007/BF03036466</ee>
                </article>
                <article mdate="2011-01-11" key="journals/acta/Simon83">
                <author>Hans-Ulrich Simon</author>
                <title>Pattern Matching in Trees and Nets.</title>
                <pages>227-248</pages>
                <year>1983</year>
                <volume>20</volume>
                <journal>Acta Inf.</journal>
                <url>db/journals/acta/acta20.html#Simon83</url>
                <ee>http://dx.doi.org/10.1007/BF01257084</ee>
        </article>
</dblp>
"""

root = ET.fromstring(data)

title = 'Parallel'
article = root.xpath('.//article[starts-with(title, "%s")]' % title)[0]

for prop in ['author', 'pages', 'year', 'volume', 'journal']:
    print article.findtext(prop)

Prints:

Sanjeev Saxena
607-619
1996
33
Acta Inf.

its an large xml file of 1GB what i am to use,Can i use this for that also? — adsa lila, Mar 16 '15 at 12:58
@adsalila yup, you may want to switch to [`iterparse()`](http://lxml.de/parsing.html#iterparse-and-iterwalk) in this case. — alecxe, Mar 16 '15 at 13:01
@adsalila if you've stick to `lxml` here, please accept the answer. Thanks. — alecxe, Mar 16 '15 at 15:03

Parsing an xml file using Regex in python

2 Answers2