I am trying to scrape a list from website. The list is extended to 4 different pages. Parameter in URL that changes for each page is "offset". So for,
1st page offset = 0
2nd page offset = 100
3rd page offset = 200
4th page offset = 300
I have written following code: -
import re
import urllib
urlHandle = urllib.urlopen("http://sampleurl.com?request=1&offset=0")
content = urlHandle.read()
pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>')
for match in pattern1.finditer(content):
print(match.group(1))
Above code retrieves values as required for "offset=0". I have appended "offset=0" in url itself. Now as it is extended to 4 pages, i tried to write following code
import re
import urllib
import urllib2
for i in range(0,400,100):
targeturl = "http://sampleurl.com?request=1&"
values = {'offset':i}
data = urllib.urlencode(values)
# req = urllib2.Request(targeturl,data)
finalurl = targeturl + data
urlHandle = urllib.urlopen(finalurl)
content = urlHandle.read()
pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>')
for match in pattern1.finditer(content):
print(match.group(1))
Somehow it does not return anything. What am i doing wrong?
<< EDIT >>
I also tried below. It is also not working
import re
import urllib
import urllib2
for i in range(0,400,100):
targeturl = "http://sampleurl.com?request=1&offset=0"
urlHandle = urllib.urlopen(targeturl)
content = urlHandle.read()
pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>')
for match in pattern1.finditer(content):
print(match.group(1))