-2

So I want to get the link in this html code and I have tried using regex for it

<div class="title" onclick="ta.setEvtCookie('Search_Results_Page', 'POI_Name', '', 0, '/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html')"><span>Lake Travis <span class="highlighted">Zipline</span> Adventures</span></div>

I have done this so far but this isn't catching till the end part

/Attraction_Review-\w+-\w+-\w+

it only catches

/Attraction_Review-g1787072-d2242305-Reviews

How can I make it catch till the .html part?

I want it to catch the whole link

Also, the link is being generated dynamically so there isnt any fixed length

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Syed Shamikh Shabbir
  • 1,252
  • 1
  • 14
  • 18
  • 2
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – ivan_pozdeev Nov 01 '15 at 00:26

1 Answers1

3

How about an alternative to regex approach: use HTML parser to get the onclick attribute value and use Javascript parser to extract the last function argument.

Here I'm using BeautifulSoup and slimit parsers:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """<div class="title" onclick="ta.setEvtCookie('Search_Results_Page', 'POI_Name', '', 0, '/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html')"><span>Lake Travis <span class="highlighted">Zipline</span> Adventures</span></div>"""

soup = BeautifulSoup(data)

# get onclick value
onclick = soup.find("div", class_="title", onclick=True)["onclick"]

# parse onclick js code
parser = Parser()
tree = parser.parse(onclick)
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.FunctionCall):
        print(node.args[-1].value)

Prints:

'/Attraction_Review-g1787072-d2242305-Reviews-Lake_Travis_Zipline_Adventures-Volente_Texas.html'

I understand that using a Javascript parser for such a simple and straightforward piece of Javascript code might be a little bit too much - feel free to replace that part with regex. But, make sure the HTML itself is parsed with an HTML parser.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195