How can I parse this HTML with regex to get what I need

Question

<strong>Description</strong>                                    This is some test description 1<strong>Areas</strong>

I would want to get the text between (strong) Description (/strong) (strong) Something(this varies, not always Areas) (/strong)

I have been trying with this regex 'Description (.+)' but without results.

What would be the right expression to get 'This is some test description 1'

*I'm using Python's regex library

post the full html code.. – Avinash Raj Nov 18 '16 at 10:50 — Avinash Raj, Nov 18 '16 at 10:50

score 0 · Answer 1 · edited May 23 '17 at 12:24

It's not recommended to parse HTML using regex

If it's something very simple and not exactly parsing you can try but I would suggest to use some HTML/XML parser. You can use Python HTML parser instead, or some library like BeautifulSoup.

Anyway if you want to try to extract the data between tags you need to be more clear. I'm not sure if what you want is to get text always between and tags. If so you should be able to do something like:

import re
matches = re.search(r'</strong>(.+)<strong>', '<strong>Description</strong>                                    This is some test description 1<strong>Areas</strong>')
matches.group(1) # '                                    This is some test description 1'

If you want something more specific for Description opening and any other text closing you can say use the regex:

<strong>Description<\/strong>(.+)<strong>(.+)<\/strong>

But again I would say to you to have a look into some actual HTML/XML parser.

I am using it but for this specific bit I need regex. For the rest I didn't have any issues but the thing is. The whole set of pages it's super unstructured and for this specific thing I am finding regex useful. — Alex moro fernandez, Nov 18 '16 at 11:05

How can I parse this HTML with regex to get what I need

1 Answers1