-1

I've got the following python code;

links = []
links.append(re.findall(b'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', body))
return simplejson.dumps({"link": links})

When ran it returns an undefined value within the HTML page

Any help explaining why this happens would be great

cBest
  • 29
  • 1
  • 6
  • 1
    If you don't show us what is in `body`, how do you expect us to be able to help? – BoarGules Apr 03 '18 at 12:12
  • 2
    Have a look at https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and https://stackoverflow.com/questions/11709079/parsing-html-using-python – serv-inc Apr 03 '18 at 12:12
  • 1
    Can't you just use BS instead? https://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-using-python – Olaf Górski Apr 03 '18 at 12:18

1 Answers1

0

You're on the right track. Looks like you're scraping an html body for all links, right? There's certainly other ways to do it, but here's an example I did using this page, your code as a base, and stripping out other stuff for brevity.

Note 2 things I've changed: I'm using json, not simplejson. The other, I'm just assigning the links list to the return value of re.findall. There's no need to do an append; that will just give you a list of lists.

>>> import json
>>> import re
>>> 
>>> body = """
... <body class="question-page new-topbar">
...     <a href="https://stackoverflow.com" class="-logo js-gps-track" data-gps-track="top_nav.click({is_current:false, location:2, destination:8})">
...     <a href="https://stackoverflow.com">current community</a>
...     <a href="https://chat.stackoverflow.com" class="js-gps-track" data-gps-track="site_switcher.click({ item_type:6 })">chat</a>
...     <a href="https://stackoverflow.com/users/logout" class="js-gps-track" data-gps-track="site_switcher.click({ item_type:8 })">log out</a>
...     <a href="https://stackoverflow.com" class="current-site-link site-link js-gps-track" data-id="1" data-gps-track="site_switcher.click({ item_type:3 })">
...     <a href="https://meta.stackoverflow.com" class="site-link js-gps-track" data-id="552" data-gps-track="site.switch({ target_site:552, item_type:3 }),site_switcher.click({ item_type:4 })">
... </body>
... """
>>> 
>>> links = re.findall(b'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', body)
>>> json.dumps(links)
'["https://stackoverflow.com", "https://stackoverflow.com", 
"https://chat.stackoverflow.com", 
"https://stackoverflow.com/users/logout", 
"https://stackoverflow.com", "https://meta.stackoverflow.com"]'

Now, that all looks right if you're trying to return serialized JSON to the front end. You haven't shown your front end code, described what template library you're using, what python web framework you're using, etc., so we're left guessing as to where else it might be going wrong.

wholevinski
  • 3,658
  • 17
  • 23