Regex against markup after XPath?

Question

Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.

The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings. The regular expression will come from a configuration for each string seperately. (since they differ)

The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?

The string I am using at the moment as example is the following:

<strong>Kontaktdaten des Absenders:</strong> 
<br> 
<strong>Name:</strong> Wanted data 
<br> 
<strong>Telefon:</strong> 
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a> 
<br>

From this string I am trying to get the "Wanted data"

My regular expression so far is the following:

(?<=<\/strong> )(.*)(?= <br>)

But this returns the whole:

<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>

I thought I could solve this with a repeat group

((:?(?<=<\/strong> )(.*)(?= <br>))+)

But this returns the same output as without the repeat group.

I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.

Thank you for the support already so far.

Obligatory - An HTML parser like HTML Agility Pack is the best way to parse HTML - comment. — Alex K., Mar 16 '18 at 12:39
I am using HTML Agility Pack already, as I said this is the deepest I can dig into my Html and thus can not get the "wanted data" out this way Have edited the HTML code so that you can see what I mean (the enters should not be here, just one string, but to make it more readable) — svenQ, Mar 16 '18 at 12:47
@AlexK. is right. [***Never parse markup with regex.***](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) You even have XPath at your disposal. [**You can finish the job with XPath alone.**](https://stackoverflow.com/a/49321162/290085) — kjhughes, Mar 16 '18 at 12:56
We can't tell you how to find the string "Wanted data" without knowing what pattern to look for. Presumably it won't always say "Wanted data", it might say something else (or you wouldn't be searching for it). So the question is, which parts of your content are fixed and which are variable? — Michael Kay, Mar 16 '18 at 16:54

score 1 · Accepted Answer · answered Mar 16 '18 at 12:48

1

Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:

This XPath,

strong[.='Name:']/following-sibling::text()[1]

when appended to your original XPath,

//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]

will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.

answered Mar 16 '18 at 12:48

kjhughes

106,133
27
181
240

Thank you, this is indeed a much cleaner solution to my problem. Did not know that it was also possible to search on text with a XPath. Will have a look if I can use this on more data, because at the moment I had solved everything with the regex :) – svenQ Mar 16 '18 at 13:04
Since your comment and your link to the great post [Never parse markup with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) I have been changing as many unneeded regular expressions. But for the following I am wondering if there is also a way to solve it with XPath: Data `Name: Herr FirstName LastName` XPath so far: `//body//div/div/table/tr/td/div/table/tr[3]/td/div/table/tr/td/p[1]/span` Here I use following regex on: `(?<=Herr |Frau ).*` – svenQ Mar 16 '18 at 13:26
@svenQ: Would be happy to help you find an XPath solution to your other problem but please post it as a new question -- it gets too messy trying to do too much in comments. Thank you. – kjhughes Mar 16 '18 at 13:33
I am sorry it is my first question on SO so I thought it was done this way :) I have posted here my question as an answer since I can not open a new question yet. – svenQ Mar 16 '18 at 14:04
@svenQ: Np, but you'll want to post a new *question*, not [**an new *answer***](https://stackoverflow.com/a/49322209/290085). Thanks. – kjhughes Mar 16 '18 at 16:01

score -1 · Answer 2 · answered Mar 16 '18 at 12:48

-1

You can try to match everything but tag markers:

(?<=<\/strong> )([^<>]*)(?= <br>)

Demo

answered Mar 16 '18 at 12:48

mrzasa

22,895
11
56
94

Thank you! This solved indeed my problem and gave me the wanted data. – svenQ Mar 16 '18 at 12:54
Consider accepting the answer by clicking green tick close to the arrows and the number. – mrzasa Mar 16 '18 at 12:56
I am sorry didn't see @kjhughes his solution when I responded to yours, but his solution is more clean to solve my problem. I was jsut searching myself into a wrong direction. But still I appreciated your correct answer. – svenQ Mar 16 '18 at 13:12

Regex against markup after XPath?

2 Answers2

Linked