0

I have this text:

a href="#" class="s-navigation--item js-gps-track js-products-menu" aria-controls="products-popover" data-controller="s-popover" data-action="s-popover#toggle" data-s-popover-placement="bottom" data-s-popover-toggle-class="is-selected" data-gps-track="top_nav.products.click({location:2, destination:1})" data-ga="["top navigation","products menu click",null,null,null]" aria-expanded="false"

With this regex:

attr_regex = '(?:\w+[-\.]*)+(?:=+[\'\"][\w\d\s:;,$@#!\[\]^&?%*\/+(){}.=-]*[\'\"])*'

I want to separate this text into the individual words or variables there are, like this: enter image description here

But instead, in python code the output gets like this (in a list):

['a', 'aria-controls="products-popover"', 'aria-expanded="false"', 'class="s-navigation--item js-gps-track js-products-menu"', 'data-action="s-popover#toggle"', 'data-controller="s-popover"', 'data-ga', 'top', 'navigation', 'products', 'menu', 'click', 'null', 'null', 'null', 'data-gps-track="top_nav.products.click({location:2, destination:1})"', 'data-s-popover-placement="bottom"', 'data-s-popover-toggle-class="is-selected"', 'href="#"']

As you can see there are some words which are not supposed to come out like that, because they are inside the value of the variable.

Python code:

elements = re.findall(attr_regex, str(text))
print(elements)

Using raw string doesn't fix the problem!

How can I fix this problem, and better, how can I make this regex work successfully in every text possible?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
RifloSnake
  • 327
  • 1
  • 8
  • Please show your python code. – Barmar Feb 08 '23 at 16:39
  • FYI you don't need to escape `.` inside `[]`, and you don't need to escape quotes in a regexp (you need to escape the quotes in the python string that match the delimiting quotes, but you can avoid that by using triple quotes around the string). – Barmar Feb 08 '23 at 16:41
  • 2
    You should generally avoid using regular expressions to parse HTML. Use a DOM parser like Beautiful Soup. – Barmar Feb 08 '23 at 16:45
  • I have to deal with other type of documents and using regex is just my main objective. – RifloSnake Feb 08 '23 at 16:49
  • When I try your code `elements` begins with `['a', 'href="#"', 'class="s-navigation--item js-gps-track js-products-menu"',` which looks correct. – Barmar Feb 08 '23 at 17:00
  • The code for setting `attr_regex` is not valid, it's missing quotes around the string. I put the regex inside a raw string. Please show your *actual* code. – Barmar Feb 08 '23 at 17:02
  • I forgot the quotes sorry! – RifloSnake Feb 08 '23 at 17:04

0 Answers0