0

Scraping with selenium and parsing with re in python from the string

<div type="copy3" class="sc-bxivhb dHqnfT">756 W Peachtree St NW Atlanta GA 30308</div>

I'm looking to return

756 W Peachtree St NW Atlanta GA 30308

This regex

("copy3").*?(?=</div>)

Gives me back

"copy3" class="sc-bxivhb dHqnfT">756 W Peachtree St NW Atlanta GA 30308

But I'd like to exclude everything up to the > before the 756

How do I include this?

user2723494
  • 1,168
  • 2
  • 15
  • 26
  • 2
    Obligatory [don't parse HTML using regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Fynn Becker Jan 22 '19 at 22:57
  • What are you scraping and parsing with? there might be a better option than a complicated RegEx – Dalvenjia Jan 22 '19 at 22:58

2 Answers2

2

scraping with selenium, use selenium to get that...

my_element = driver.find_element_by_css_selector('div[type="copy3"]')
address = my_element.text
Dalvenjia
  • 1,953
  • 1
  • 12
  • 16
1

Match a >, then capture non-<s that follow in a group, and extract that group:

type="copy3"[^>]+>([^<]+)

https://regex101.com/r/BX2tVj/1

If you want to match only what comes after the first <, you'll either have to use lookbehind (which will only be reliable if you know exactly what the class="" attribute may contain):

(?<=type="copy3" class="sc-bxivhb dHqnfT">)[^<]+

https://regex101.com/r/BX2tVj/2

Or use the regex module instead, so you can use \K:

type="copy3"[^>]+>\K[^<]+

https://regex101.com/r/BX2tVj/3

import regex
str = '<div type="copy3" class="sc-bxivhb dHqnfT">756 W Peachtree St NW Atlanta GA 30308</div>'
match = regex.search(r'type="copy3"[^>]+>\K[^<]+', str)
CertainPerformance
  • 356,069
  • 52
  • 309
  • 320