Regular expression for attribute value in XML

Question

I need your help with regular expression. I have xml text like this:

<w><ana lex="совершенно" gr="ADV"></ana>соверш`енно</w>

and I need to extract совершенно, ADV and соверш`енно. I have tried, but I know regular expressions not so good.

See [How do I parse XML in Python?](http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python) — Wiktor Stribiżew, Apr 14 '17 at 09:04
Regular expressions are not the tool for handling XML. They are incapable (as in: it is technically impossible to do it correctly) of this task. Use an XML parser. There is one built-in in Python and there are others more you can install as a module. Use the right tool for the job. — Tomalak, Apr 14 '17 at 09:14
@Tomalak Regex is a tool for text. XML is (mostly) text. Sometimes it IS the right tool for the job. If I were looking for the definition of a Python function, I _could_ partially compile it and extract the code from the function object, or I could do a text search for `def func_name(`. The same principle holds for regex and markup languages. Regexes are a bad tool for parsing XML into a tree. Regexes are not (always) a bad tool for extracting information from XML. — leewz, Apr 14 '17 at 09:22
@leewz Yes, regex are a tool for text. No, XML is not text. Regex is *never* the right tool for XML and if you think otherwise then I'm sorry to say that you are lacking quite some experience with both. The fact that it's possible to write a regex that deals with the meager snippet the OP posted does not prove anything. — Tomalak, Apr 14 '17 at 09:26
If you impose arbitrary restrictions on the possible input, then yes, you can parse text with angle brackets that somewhat looks like XML with regex. But then you are not parsing XML anymore but a sub-set. Plus your code will break anytime that subset changes in ways that are legal in XML and not anticipated in your code. Writing code that way is not particularly smart. And there's a free XML parser in Python, there is no reason to not use it. — Tomalak, Apr 14 '17 at 09:29
@Tomalak Then is it safe to assume you parse your source code into syntax trees to find definitions? Do you use an English grammar compiler to find words and phrases in articles? Of course not. If you know for sure that the data is in a regular pattern, you can consider regexes. XML isn't a regular language, but _some_ XML documents have _some_ data in a pattern. — leewz, Apr 14 '17 at 09:36
"English" and "code" are not a fair comparison. Natural languages operate on a different level than programming languages. But yes, if I want to find a definition in code (and not just something in a string literal or comment that resembles a definition) then I parse code into an AST. There is no reason not to, other than inexperience or laziness. — Tomalak, Apr 14 '17 at 09:40
That being said, you can't generalize from "*some* documents have *some* properties that allow me to short-circuit proper handling in a very confined set of circumstances" to "therefore it's ok to parse XML with regex". That's intellectually dishonest. — Tomalak, Apr 14 '17 at 09:47

score 0 · Answer 1 · answered Apr 14 '17 at 10:12

0

You can try with BeautifulSoup.

answered Apr 14 '17 at 10:12

Stefano

60
1
1
8

score 0 · Answer 2 · answered Apr 14 '17 at 19:56

0

Better use BeautifulSoup instead of regular expressions in your case.

>>> import BeautifulSoup as bs
>>> xml = '<w><ana lex="совершенно" gr="ADV"></ana>соверш`енно</w>'
>>> soup = bs.BeautifulSoup(xml)
>>> print(soup.find('ana', {'lex':unicode}).get('lex'))
совершенно

answered Apr 14 '17 at 19:56

nowox

25,978
39
143
293

What is unicode here? NameError: name 'unicode' is not defined – Nastja Kr Apr 26 '17 at 17:03

score -1 · Answer 3 · answered Apr 16 '17 at 07:57

following is the method from python regular expression model which will return position of data which you want to find in your answer.

import re
data=re.search("соверш`енно","<w><ana lex="совершенно" gr="ADV">
</ana>соверш`енно</w>")

re.search() function returns position of your string in text and extract other strings also like that.

score -3 · Answer 4 · answered Apr 14 '17 at 09:07

-3

lex=\"(.)\" gr=\"(.)\"></ana>(.*)</w>

Regex101.com

answered Apr 14 '17 at 09:07

Dennisvdh

120
5

2

@t.m.adam That's a stupid reason to upvote a bad answer. – Tomalak Apr 14 '17 at 09:11

Regular expression for attribute value in XML

4 Answers4