Regular expression to extract info from HTML file

Question

I would like to use a regular expression to extract the following text from an HTML file: ">ABCDE</A></td><td>

I need to extract: ABCDE

Could anybody please help me with the regular expression that I should use?

As far as I know, [you cannot use regex to parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — mkrieger1, Jun 08 '20 at 11:50
Does this answer your question? [Parsing HTML using Python](https://stackoverflow.com/questions/11709079/parsing-html-using-python) — mkrieger1, Jun 08 '20 at 11:51

Fabio Crispino · Answer 1 · 2020-06-08T12:07:52.530

You can try using this regular expression in your specific example:

/">(.*)<\/A><\/td><td>/g

Tested on string:

Lorem ipsum">ABCDE</A></td><td>Lorem ipsum<td></td>Lorem ipsum

extracts:

">ABCDE</A></td><td>

Then it's all a matter of extracting the substring from each match using any programming language. This can be done removing first 2 characters and last 13 characters from the matching string from regex, so that you can extract ABCDE only.

I also tried:

/">([^<]*)<\/A><\/td><td>/g

It has same effect, but it won't include matches that include additional HTML code. As far as I understand it, ([^<]*) is a negating set that won't match < characters in that region, so it won't catch other tag elements inside that region. This could be useful for more fine control over if you're trying to search some specific text and you need to filter nested HTML code.

score 0 · Accepted Answer · answered Jun 08 '20 at 12:01

0

Leaning on this, https://stackoverflow.com/a/40908001/11450166

(?<=(<A>))[A-Za-z]+(?=(<\/A>))

With that expression, supposing that your tag is <A> </A>, works fine.

This other match with your input form.

(?<=(>))[A-Za-z]+(?=(<\/A>))

answered Jun 08 '20 at 12:01

Biowav

140
7

Regular expression to extract info from HTML file

2 Answers2