-1

I would like to use a regular expression to extract the following text from an HTML file: ">ABCDE</A></td><td>

I need to extract: ABCDE

Could anybody please help me with the regular expression that I should use?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • 1
    As far as I know, [you cannot use regex to parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – mkrieger1 Jun 08 '20 at 11:50
  • 1
    Does this answer your question? [Parsing HTML using Python](https://stackoverflow.com/questions/11709079/parsing-html-using-python) – mkrieger1 Jun 08 '20 at 11:51

2 Answers2

0

You can try using this regular expression in your specific example:

/">(.*)<\/A><\/td><td>/g

Tested on string:

Lorem ipsum">ABCDE</A></td><td>Lorem ipsum<td></td>Lorem ipsum

extracts:

">ABCDE</A></td><td>

Then it's all a matter of extracting the substring from each match using any programming language. This can be done removing first 2 characters and last 13 characters from the matching string from regex, so that you can extract ABCDE only.

I also tried:

/">([^<]*)<\/A><\/td><td>/g

It has same effect, but it won't include matches that include additional HTML code. As far as I understand it, ([^<]*) is a negating set that won't match < characters in that region, so it won't catch other tag elements inside that region. This could be useful for more fine control over if you're trying to search some specific text and you need to filter nested HTML code.

Fabio Crispino
  • 711
  • 1
  • 6
  • 22
0

Leaning on this, https://stackoverflow.com/a/40908001/11450166

(?<=(<A>))[A-Za-z]+(?=(<\/A>))

With that expression, supposing that your tag is <A> </A>, works fine.

This other match with your input form.

(?<=(>))[A-Za-z]+(?=(<\/A>))
Biowav
  • 140
  • 7