-1

I am having a hard time finding a regular expression to find specific parts of some html code. A code snippet can be seen below.

<td valign="top">23.01.2019</td>
<td valign="top">DOE/ELT</td>
<td valign="top">Laser Projection Subunits for the Extremely Large Telescope</td>

I am trying to find the last part with the "Laser Projection". This is the closest i have been able to come to this result.

<td valign=\"top\">[^[0-9]{2}.[0-9]{2}.[0-9]{4}]|[^[A-Z]{3}]|[A-Z a-z]*</td>
CKMA
  • 49
  • 1
  • 4
  • Please, add some code :) – Gianmarco Varriale Mar 01 '19 at 09:32
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Liam Mar 01 '19 at 09:33
  • 3
    RegEx is not the best choice to parse HTML. Better use a real HTML Parser, for example `HTML Agility Pack` – Flat Eric Mar 01 '19 at 09:34
  • 2
    [...he comes](https://stackoverflow.com/a/1732454/542251) – Liam Mar 01 '19 at 09:34
  • Do you know where i can find any documentation on `HTML Agility Pack`? I have not been able to find anything useful. – CKMA Mar 01 '19 at 09:43
  • 1
    I would start here: https://html-agility-pack.net/documentation. The chapter "selectors" should be relevant for you – Flat Eric Mar 01 '19 at 10:02
  • 1
    Usually, the best way to find a specific tag with unknown content is to just see in the original html what data they're _surrounded_ by, and including it inside your pattern to match only a block that has the correct structure. In this case, this is a table, so you could specifically check for the third column by adding a match for the previous tags before it. – Nyerguds Mar 01 '19 at 10:04
  • 1
    @Liam I know, obligatory link is obligatory, but this question is about _finding_ something in HTML, not necessarily _parsing_ it. I've done this for web scraping too, and given the messy state of html, it's actually a very valid option, if adapted specifically to the scraped page. – Nyerguds Mar 01 '19 at 10:07
  • 1
    I suppose the question then is (given @Nyerguds comment) what does *"last part with the "Laser Projection""* mean? You'll need to be more clear on what your criteria is here – Liam Mar 01 '19 at 10:14

1 Answers1

0

If I understand the question, I think what you actually mean to ask here is how you find the value to the right of the text "Laser Projection Subunits for the".

If that's the case, and I'm assuming each time that a closing statement would follow, I'd use this regex:

Laser Projection Subunits for the (?<Extracted_value>.*?)<\/td>

https://regex101.com/r/p6Sr7k/1