1

The situation is: I'm scraping one website, the urls for the pages follow the pattern:

http://www.pageadress/somestuff/ID-HERE/

Nothing unusual. I have a lot of id's that i need to scrape and most of them work correctly. However, the page behaves in portal-like way. In browser, when you enter such address, you get redirected to:

http://www.pageadress/somestuff/ID-HERE-title_of_subpage

What might be problematic is that sometimes that title might contain non-ascii characters (approximately 0.01% of cases), therefore (i think that's the issue) i get the exception:

  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.4/urllib/request.py", line 501, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 684, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 463, in open
    response = self._open(req, data)
  File "/usr/lib/python3.4/urllib/request.py", line 481, in _open
    '_open', req)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.4/urllib/request.py", line 1182, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.4/http/client.py", line 1088, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.4/http/client.py", line 1116, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.4/http/client.py", line 973, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 38-39: ordinal not in range(128).

The bizarre thing is that no unicode characters in url i'm redirected to are actually on position 38-39, but there are on others.

The code being used:

import socket
import urllib.parse
import urllib.request
socket.setdefaulttimeout(30)
url = "https://www.bettingexpert.com/archive/tip/3207221"
headers = {'User-Agent': 'Mozilla/5.0'}
content = urllib.request.urlopen(urllib.request.Request(url, None, headers)).read().decode('utf-8')

Any method to get around it, preferably without using other libraries?

//Oh the glorious world of python, creating 1000s of problems i wouldn't even think were possible if i was writing in ruby.

piezol
  • 915
  • 6
  • 24
  • The code that you provide is not complete (cannot be executed). What is the value of `id`? Same question for the actual URL (which is not provided here). If you want others to be able to debug the problem, then you need to provide the real URL. – barak manos Aug 14 '16 at 09:59
  • 1
    Did you try to use `urllib.parse.quote(id.__str__())` instead of plain `id.__str__()`? – Phillip Aug 14 '16 at 10:02
  • Id is an integer @Phillip – piezol Aug 14 '16 at 10:06
  • Sorry barak, thought it might be more generic problem. The url i try to get to is: – piezol Aug 14 '16 at 10:07
  • https://www.bettingexpert.com/archive/tip/3207221 @barakmanos – piezol Aug 14 '16 at 10:07
  • 1
    @piezol If its an option for you, switch to Python 3.5. They have a workaround for your issue in there. The problem is that bettingexpert formally doesn't return a valid URL in the `Location` header, because it doesn't properly quote the `ß` character. If switching is not an option, you can either use another library like `requests` or install a custom urllib opener that mitigates the issue. See e.g. [this related question](http://stackoverflow.com/questions/4389572/how-to-fetch-a-non-ascii-url-with-python-urlopen). – Phillip Aug 14 '16 at 10:15
  • Related question only deals with urls that are being passed, not the one that are being redirected to. Also, 'requests' library raises TooManyRedirects exception (but works fine with the urls that don't redirect to non-ascii urls). Unfortunately switching is not an option :/ @Phillip – piezol Aug 14 '16 at 10:26
  • @piezol, are you limited to Python3 or you can use Python2? – Victor Aug 14 '16 at 11:00
  • Python 3.4 (part of django app) @Victor – piezol Aug 14 '16 at 11:01

1 Answers1

0

So, i've found out solution to my specific problem. I've just gathered the remaining part of the 'url' from their api, and after some minor transformations i can access the page without any redirections. That, of course, doesn't mean that i solved the general problem- it might come back later in the future, so i've developed a 'solution'.

By posting this code here i've basically guaranteed myself that i won't ever be employed as programmer, so don't look at it if you're eating.

"Capybara" gem and poltergeist needed because why not?

#test.py
import socket
import urllib.parse
import urllib.request
import os
tip_id = 3207221
socket.setdefaulttimeout(30)
url = "http://www.bettingexpert.com/archive/tip/" + tip_id.__str__()
headers = {'User-Agent': 'Mozilla/5.0'}

try:
    content = urllib.request.urlopen(urllib.request.Request(url, None, headers)).read().decode('utf-8')
except UnicodeEncodeError:
    print("Overkill activated")
    os.system('ruby test.rb ' + tip_id.__str__())
    with open(tip_id.__str__(), 'r') as file:
        content = file.read()
    os.remove(tip_id.__str__())
print(content)

.

#test.rb
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'

Capybara.register_driver :poltergeist_no_timeout do |app|
  driver = Capybara::Poltergeist::Driver.new(app, timeout: 30)
  driver.browser.url_blacklist = %w(
    http://fonts.googleapis.com
    http://html5shiv.googlecode.com
  )
  driver
end
Capybara.default_driver = :poltergeist_no_timeout
Capybara.run_server = false
include Capybara::DSL
begin
  page.reset_session!
  page.visit("http://www.bettingexpert.com/archive/tip/#{ARGV[0]}")
rescue
  retry
end
File.open(ARGV[0], 'w') do |file|
  file.print(page.html)
end
piezol
  • 915
  • 6
  • 24