Python Regex (pattern + wildcard + pattern)[return](pattern)

Question

Scraping with selenium and parsing with re in python from the string

<div type="copy3" class="sc-bxivhb dHqnfT">756 W Peachtree St NW Atlanta GA 30308</div>

I'm looking to return

756 W Peachtree St NW Atlanta GA 30308

This regex

("copy3").*?(?=</div>)

Gives me back

"copy3" class="sc-bxivhb dHqnfT">756 W Peachtree St NW Atlanta GA 30308

But I'd like to exclude everything up to the > before the 756

How do I include this?

Obligatory [don't parse HTML using regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — Fynn Becker, Jan 22 '19 at 22:57
What are you scraping and parsing with? there might be a better option than a complicated RegEx — Dalvenjia, Jan 22 '19 at 22:58

score 2 · Accepted Answer · answered Jan 22 '19 at 23:01

2

scraping with selenium, use selenium to get that...

my_element = driver.find_element_by_css_selector('div[type="copy3"]')
address = my_element.text

answered Jan 22 '19 at 23:01

Dalvenjia

1,953
1
12
16

score 1 · Answer 2 · answered Jan 22 '19 at 22:58

Match a >, then capture non-<s that follow in a group, and extract that group:

type="copy3"[^>]+>([^<]+)

https://regex101.com/r/BX2tVj/1

If you want to match only what comes after the first <, you'll either have to use lookbehind (which will only be reliable if you know exactly what the class="" attribute may contain):

(?<=type="copy3" class="sc-bxivhb dHqnfT">)[^<]+

https://regex101.com/r/BX2tVj/2

Or use the regex module instead, so you can use \K:

type="copy3"[^>]+>\K[^<]+

https://regex101.com/r/BX2tVj/3

import regex
str = '<div type="copy3" class="sc-bxivhb dHqnfT">756 W Peachtree St NW Atlanta GA 30308</div>'
match = regex.search(r'type="copy3"[^>]+>\K[^<]+', str)

Python Regex (pattern + wildcard + pattern)[return](pattern)

2 Answers2