How to extract url GET parameter from tag, from the full html text

Question

So I have an html page. It's full of various tags, most of them have sessionid GET parameter in their href attribute. Example:

...
<a href="struct_view_distrib.asp?sessionid=11692390">
...
<a href="SHOW_PARENT.asp?sessionid=11692390">
...
<a href="nakl_view.asp?sessionid=11692390">
...
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
...

So, as you see, sessionid is the same, i just need to get it's value into variable, no matter from which one: x=11692390 I'm newbie in regex, but google wasn't helpful. Thx a lot!

Do not use RegEx to parse HTML. Obligatory Link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Oded, Aug 17 '10 at 09:07

score 11 · Accepted Answer · answered Aug 17 '10 at 09:24

11

This does not use regexes, but anyway, this is what you would do in Python 2.6:

from BeautifulSoup import BeautifulSoup
import urlparse

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True)

for link in links:
  href = link['href']
  url = urlparse.urlparse(href)
  params = urlparse.parse_qs(url.query)
  if 'sessionid' in params:
    print params['sessionid'][0]

answered Aug 17 '10 at 09:24

Constantin

27,478
10
60
79

+l for `urlparse`, this library is wonderful and it would be a real shame to try and solve a problem like this without it – jwg Mar 19 '15 at 08:13
3

`import urllib.parse` and `urllib.parse.parse_qs(urllib.parse.urlparse(href).query)` in python3 – AbdealiLoKo Aug 29 '19 at 08:14

score 5 · Answer 2 · answered Aug 17 '10 at 09:09

5

Parse your HTML with a DOM parsing library and use getElementsByTagName('a') to grab anchors, iterate through them and use getAttribute('href') and then extract the string. Then you can use regex or split on ? to match/retrieve the session id.

answered Aug 17 '10 at 09:09

meder omuraliev

183,342
71
393
434

mplungjan · Answer 3 · 2010-08-17T12:52:37.257

2

I would do this - before I was told it was a python issue ;)

<script>
function parseQString(loc) {
  var qs = new Array();
  loc = (loc == null) ? location.search.substring(1):loc.split('?')[1];
  if (loc) {
    var parms = loc.split('&');
    for (var i=0;i<parms.length;i++) {
      nameValue = parms[i].split('=');
      qs[nameValue[0]]=(nameValue.length == 2)? unescape(nameValue[1].replace(/\+/g,' ')):null; // use null or ""
    }
  }
  return qs;
}
var ids = []; // will hold the IDs
window.onload=function() {
  var links = document.links;
  var id;
  for (var i=0, n=links.length;i<n;i++) {
    ids[i] = parseQString(links[i].href)["sessionid"];
  }
  alert(ids); // remove this when happy
  // here you can do 
  alert(ids[3]); 
  //to get the 4th link's sessionid
}


</script>

<a href="struct_view_distrib.asp?sessionid=11692390">
...</a>
<a href="SHOW_PARENT.asp?sessionid=11692390">
...</a>
<a href="nakl_view.asp?sessionid=11692390">
...</a>
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
...</a>

edited Aug 17 '10 at 12:52

answered Aug 17 '10 at 09:17

mplungjan

169,008
28
173
236

Erm okee, so where did the python rear its head? Not tagged as such when I answered – mplungjan Aug 17 '10 at 09:31
sorry, my first question here, i thought question was only about regexes and forget to tag it for python also – creitve Aug 17 '10 at 10:20
Interesting. So, there is no standard way to parse a uri in browser js? – Constantin Aug 17 '10 at 13:59
@Constantin: what do you mean? location.protocol, location.hostName, location.port, location.href, location.search, location.hash is what you can use, but location.search and .hash are strings that are not further atomised – mplungjan Aug 17 '10 at 15:12

score 1 · Answer 4 · answered Aug 17 '10 at 09:12

1

Below is an regex you can use to match hrefs and extract its value:

\b(?<=(href="))[^"]*?(?=")

answered Aug 17 '10 at 09:12

Gopi

10,073
4
31
45

3

I wouldn't encourage using regular expressions to grab the attributes. Won't vote it down, but I wouldn't want to upvote it either. – Rob Aug 17 '10 at 09:27
Unless the DOM was not accessible, I completely agree. You have document.links[x].href and document.getElementsByTagName("a")[x].href right off the bat without using jQuery or regExp – mplungjan Aug 17 '10 at 09:34
Yes I completely agree regex is bad idea to parse html. If you see my previous regex answers, I have been telling this to every one. Now since someone already said this in another answer before me, and that I am tired of saying same thing again and again I just put the regex here. – Gopi Aug 17 '10 at 10:11

score 1 · Answer 5 · answered Nov 07 '19 at 07:27

Complete example for Python3, inspired by AbdealiJK:

response = """...
<a href="struct_view_distrib.asp?sessionid=11692390">
...
<a href="SHOW_PARENT.asp?sessionid=11692390">
...
<a href="nakl_view.asp?sessionid=11692390">
...
<a href="move_sum_to_7300001.asp?sessionid=11692390&mode_id=0">
..."""

from bs4 import BeautifulSoup
import urllib.parse
soup = BeautifulSoup(response, "lxml")
for i in soup.find_all('a', href=True):
    try:
        print(urllib.parse.parse_qs(urllib.parse.urlparse(i['href']).query)["sessionid"])
    except:
        pass

score 1 · Answer 6 · answered Nov 07 '19 at 08:09

bs4 4.7.1.+ has all the functionality you need for this. Use css AND syntax combined with :not to specify url with param sessionid only and select_one to limit to first match, then split on that param and grab the ubound array value

soup.select_one("[href*='asp?sessionid']:not([href*='&'])")['href'].split('sessionid=')[-1]

How to extract url GET parameter from tag, from the full html text

6 Answers6