1

I am trying scrape with BS4 via TOR, using the To Russia With Love tutorial from the Stem project.

I've rewritten the code a bit, using i.a. this answer, and it now looks like this,

SOCKS_PORT=7000

def query(url):

output = io.BytesIO()

query = pycurl.Curl()
query.setopt(pycurl.URL, url)
query.setopt(pycurl.PROXY, 'localhost')
query.setopt(pycurl.PROXYPORT, SOCKS_PORT)
query.setopt(pycurl.PROXYTYPE, pycurl.PROXYTYPE_SOCKS5_HOSTNAME)
query.setopt(pycurl.WRITEFUNCTION, output.write)

try:
    query.perform()
    return output.getvalue()
except pycurl.error as exc:
    return "Unable to reach %s (%s)" % (url, exc)

def print_bootstrap_lines(line):
    if "Bootstrapped " in line:
       print(term.format(line, term.Color.BLUE))

print(term.format("Starting Tor:\n", term.Attr.BOLD))

tor_process = stem.process.launch_tor_with_config(
   tor_cmd = '/Applications/TorBrowser.app/Contents/MacOS/Tor/tor.real',
   config = {
      'SocksPort': str(SOCKS_PORT),
      'ExitNodes': '{ru}',
      'GeoIPFile': r'/Applications/TorBrowser.app/Contents/Resources/TorBrowser/Tor/geoip',
      'GeoIPv6File' : r'/Applications/TorBrowser.app/Contents/Resources/TorBrowser/Tor/geoip6'
},
       init_msg_handler = print_bootstrap_lines,
)

print(term.format("\nChecking our endpoint:\n", term.Attr.BOLD))
print(term.format(query("https://www.atagar.com/echo.php"), term.Color.BLUE))

I am able to Establish a Tor circuit, but at "checking our endpoint", I receive a the following error,

Checking our endpoint:

Traceback (most recent call last):

File "<ipython-input-804-68f8df2c050b>", line 40, in <module>
print(term.format(query('https://www.atagar.com/echo.php'), term.Color.BLUE))

File "/Applications/anaconda/lib/python3.6/site-packages/stem/util/term.py", line 139, in format
if RESET in msg:

TypeError: a bytes-like object is required, not 'str'

What should I change to see the endpoint?

I've temporarily solved it by changing the last line of the above code with,

test=requests.get('https://www.atagar.com/echo.php')
soup = BeautifulSoup(test.content, 'html.parser')
print(soup)

but I'd like to know how to get the 'original' line working.

Community
  • 1
  • 1
LucSpan
  • 1,831
  • 6
  • 31
  • 66
  • you might want to post your code, otherwise people can't help you! Looks like you're giving it a string when it wants a bytes-like object, you can convert e.g. by using `b` . [This](http://stackoverflow.com/questions/14010551/how-to-convert-between-bytes-and-strings-in-python-3) SO post might be helpful. – patrick Mar 18 '17 at 12:40
  • Possible duplicate of [python 3.5: TypeError: a bytes-like object is required, not 'str' when writing to a file](http://stackoverflow.com/questions/33054527/python-3-5-typeerror-a-bytes-like-object-is-required-not-str-when-writing-t) – tripleee Mar 18 '17 at 13:59
  • @patrick. I added the code. – LucSpan Mar 18 '17 at 15:54
  • have you tried changing the `url` variable in here: `query.setopt(pycurl.URL, url)` to a byte string? see here on [input handling](http://pycurl.io/docs/latest/unicode.html#unicode): *Under Python 3, as PycURL invokes the write callback with bytes argument, the response must be written to a BytesIO object* ; also has a template to copy – patrick Mar 18 '17 at 16:12
  • I realise I do not fully understand the code. I've checked your advice and have set `url=b'https://www.atagar.com/echo.php'`. Besides that I've left the code unchanged but to the last line, where I now have `print(term.format(query(url), term.Color.BLUE))` in stead of my temporary solution. This results in the same error :( – LucSpan Mar 18 '17 at 16:33
  • 1
    Changing your return to `return output.getvalue().decode("utf-8")` should fix it. Note that you may need to change utf-8 to another encoding but I'd try that first. – drew010 Mar 18 '17 at 19:45
  • @drew010: Thanks! Worked like a charm :D – LucSpan Mar 19 '17 at 09:13

1 Answers1

0

You must be using Python 3, when that code was made for Python 2. In Python 2, str and bytes are the same thing, and in Python 3, str is Python 2's unicode. You have to add a b directly before the string to make it a byte string in Python 3, e.g.:

b"this is a byte string"
Julien
  • 5,243
  • 4
  • 34
  • 35
  • Thank you for your answer. However, this doesn't seem to solve my problem. I think it might be something with the `query()` command in the last line of the code. – LucSpan Mar 18 '17 at 16:07