0

I am attempting to search for a specific value within an html response using the requests library

import requests

while True:

 url = 'https://www.example.com/'

 page = requests.get(url, allow_redirects=True, verify=False)

 var = page.content

The value would appear to be like a dictionary, but I cannot convert the whole response.content to a dict using : var = dict(page.content) as it gives error "dictionary update sequence element #0 has length 1; 2 is required"

I have attempted to use the re.search method such as this :

  searchObj = re.search( r'(.*)id="X" value=(.*?) .*', var, re.M)
 if searchObj:
  print "search --> searchObj.group() : ", searchObj.group()

but it is not what I am looking for - the end goal is to find a specific value within the content returned from a website request, it would look something like this in the content : <input type="hidden" autocomplete="off" name="test" id="test" value="12345" /> - with the data needing to be extracted as value="12345" or more specifically just the 12345

Thanks in advance

In the stars
  • 253
  • 4
  • 17
  • 2
    Use beautifulsoup, find the tag and extract the attribute – Padraic Cunningham Aug 14 '15 at 18:33
  • I really really hope you either own this site or have spoken with the owner, because if you don't know what you doing and you're hitting that page with an infinite loop, someone is going to be very justifiably angry with you. – Two-Bit Alchemist Aug 14 '15 at 18:35
  • @Two-BitAlchemist while you are right that he shouldn't scrape a page with an infinite loop without a timeout, it really won't matter. Most modern sites are built with either Apache or Nginx, and they will close his connection if too many connection attempts are made within a short period of time. – nivix zixer Aug 14 '15 at 18:40
  • @nivixzixer And I run an anti-virus but I'm still going to be justifiably angry if you try to infect my machine. – Two-Bit Alchemist Aug 14 '15 at 19:03

3 Answers3

1

Don't use regex for this, use a library that was made for that, for example BeautifulSoup:

import bs4 as bs
import requests

resp = requests.get('http://www.google.com')
soup = bs.BeautifulSoup(resp.text)
element = soup.find(attrs={'id': 'hplogo'}) # will search for the 'google' logo
print element

>> <div align="left" id="hplogo" onload="window.lol&amp;&amp;lol()" style="height:110px;width:276px;background:url(/images/srpr/logo9w.png) no-repeat" title="Google"><div nowrap="" style="color:#777;font-size:16px;font-weight:bold;position:relative;top:70px;left:218px">׳™׳©׳¨׳�׳�</div></div>
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
  • One of my favorite posts on the site! – Deacon Aug 14 '15 at 18:39
  • This works to print the whole element - but I am looking only for the 1 value="" etc - works great to print only that subsection that I was looking to dissect but I need to refine it further. Ty so far – In the stars Aug 14 '15 at 18:48
1

This should work for you:

import re
import requests

VALUE_RGX = re.compile(r'id="X" value="([A-Za-z0-9_\-]+)"')

url = 'https://www.example.com/'
page = requests.get(url, allow_redirects=True, verify=False)

matched_groups = VALUE_RGX.match(page.text)
if len(matched_groups) > 1:
    print("Found Value: {}".format(matched_groups[1]))
else:
    print("Did not find value..")
nivix zixer
  • 1,611
  • 1
  • 13
  • 19
0

It is preferable to parse XML and HTML using a specialised library but if it is a one-off operation and the output is predictable, it is fine to use a regex. The following re should work.

r'id=\"test\"\svalue=\"(.*?)\"'

The (.*) in the beginning is capturing everything.

Ben Beirut
  • 733
  • 3
  • 12
  • searchObj = re.search( r'id=\"X\"\svalue=\"(.*?)\"', var, re.M|re.I) if searchObj: print searchObj.group(1) else: print "Nothing found!!" Using your comment I was able to start extracting only what I need from responses - appreciate it – In the stars Aug 14 '15 at 19:09