Need help Regex for python

Question

So I want to get the link in this html code and I have tried using regex for it

<div class="title" onclick="ta.setEvtCookie('Search_Results_Page', 'POI_Name', '', 0, '/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html')"><span>Lake Travis <span class="highlighted">Zipline</span> Adventures</span></div>

I have done this so far but this isn't catching till the end part

/Attraction_Review-\w+-\w+-\w+

it only catches

/Attraction_Review-g1787072-d2242305-Reviews

How can I make it catch till the .html part?

I want it to catch the whole link

Also, the link is being generated dynamically so there isnt any fixed length

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — ivan_pozdeev, Nov 01 '15 at 00:26

alecxe · Accepted Answer · 2015-11-01T00:33:40.887

How about an alternative to regex approach: use HTML parser to get the onclick attribute value and use Javascript parser to extract the last function argument.

Here I'm using BeautifulSoup and slimit parsers:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """<div class="title" onclick="ta.setEvtCookie('Search_Results_Page', 'POI_Name', '', 0, '/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html')"><span>Lake Travis <span class="highlighted">Zipline</span> Adventures</span></div>"""

soup = BeautifulSoup(data)

# get onclick value
onclick = soup.find("div", class_="title", onclick=True)["onclick"]

# parse onclick js code
parser = Parser()
tree = parser.parse(onclick)
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.FunctionCall):
        print(node.args[-1].value)

Prints:

'/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html'

I understand that using a Javascript parser for such a simple and straightforward piece of Javascript code might be a little bit too much - feel free to replace that part with regex. But, make sure the HTML itself is parsed with an HTML parser.

Yes, it is being parsed with BeautifulSoup. I thought it'd be easier to do regex to get the link. — Syed Shamikh Shabbir, Nov 01 '15 at 00:39
It would not, because html cannot be parsed with regexps. Check the question that was linked at the top for detailed explanations. Using an HTML parser like alecxe says is not only easier, but also the only way to have it work correctly. — spectras, Nov 01 '15 at 01:11

Need help Regex for python

1 Answers1