-2

I have this xml string

<aof xmlns="http://tsng.jun.net/jppos/conig/hello"><num>3</num><desc>addy02</desc><tpcs>5</tpcs></aof>'

I need to extract 5 using regex.

What I have done is:

regex = re.compile(r'tag+</.+>\s*(.+)\s*<.+>')

Where tag is 'tpcs' but its returning empty tag.

Can someone please help.

aydinugur
  • 1,208
  • 2
  • 14
  • 21
Adrija
  • 71
  • 4
  • 11

2 Answers2

3

Don't use regexps for XML / HTML! Read this, one of the most voted & highest ranked answers on this site!

Use XPath instead:

//tpcs/text()

or (namespace-gnostic):

//*[local-name()='tpcs']/text()

will print 5, as expected.

General Grievance
  • 4,555
  • 31
  • 31
  • 45
madhead
  • 31,729
  • 16
  • 153
  • 201
-1

As posted in the comments, this regex does the trick :

(?<=<tpcs>).*?(?=<\/tpcs>)

As seen in this demo.

Explanation :

  • (?<=<tpcs>) is a positive lookbehind (?<=...), it asserts that a certain string, <tpcs> is placed before the string to match.
  • .*? the dot matches any character, zero or multiple times because it's followed by a *. Finally, the ? character next to it is a lazy quantifier which means that it's gonna match until the first occurence of what's coming next.
  • (?=<\/tpcs>) is a positive lookahead (?=...), it asserts that the string follows the pattern.
Paul-Etienne
  • 796
  • 9
  • 23
  • Did you figure out how to pass the tag through a variable ? Was about to look into it (I haven't done much python in my life) – Paul-Etienne Oct 30 '17 at 17:11
  • Yes I got it. This is the one '(?<=<{0}>).*(?=<\/{0}>)'.format(tag) – Adrija Oct 30 '17 at 17:15
  • Thanks again :) Do you know of any such tutorial, where I can get to know regex? – Adrija Oct 30 '17 at 17:20
  • Hum, you can try [this website](https://regexone.com/), looks like there are some exercices to help you understand and learn. Otherwise, you can find regex cheat sheets quite easily, like [this one](http://www.rexegg.com/regex-quickstart.html) for example. They're a good way to keep the regex alphabet. At last, I'd recommend to just practice on an online tester, [Regex101.com](https://regex101.com/) is really good, I use it all the time. And there's a quick reference on the bottom right corner. It details also the matching process on the top right corner. – Paul-Etienne Oct 30 '17 at 17:23
  • Thanks so much. I have one last question, if there are multiple tpcs in the text will I be able to find it using this particular regex? – Adrija Oct 30 '17 at 17:24
  • Yeah, I actually updated the regex to use a non greedy research. Now it works on multiple . You can look at the updated demo, I've added some of these tags to the sample text. You'd just have to iterate through the matches to treat the data. Actually, let me add some explanations to my answer, this way it's gonna be easier for you to understand how this all works. – Paul-Etienne Oct 30 '17 at 17:27
  • For some reason it's actually taking only one, unfortunately. – Adrija Oct 30 '17 at 17:35
  • I don't know what function you're using, but there should be an equivalent to match all occurences and iterate through them. Look at [here](https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python) maybe. – Paul-Etienne Oct 30 '17 at 17:36