3

If I build a barebone, it's working nicely. But in my actual code, I receive an Unicode related error.

temp_url = "http://search.jd.com/Search?keyword=" + quote(self.keywords)

File "/usr/lib/python3.5/urllib/parse.py", line 706, in quote string = string.encode(encoding, errors)

UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in position 0: surrogates not allowed

I'm using an argument to pass my search string to Scrapy (1.4):

scrapy crawl jdspider -a keywords="电灯"

and the relevant code looks like:

# -*- coding: utf-8 -*-
import scrapy, re
from urllib.parse import quote

def __init__(self, keywords=''):
    self.keywords = keywords.strip()

    temp_url = "http://search.jd.com/Search?keyword=" + quote(self.keywords)
    print ( temp_url )

So the print won't even get executed - sth. doesn't work with the quote method.

Python 3.5.2 Scrapy 1.4.0 Kubuntu 16.04

What am I doing wrong?

Chris
  • 1,265
  • 4
  • 18
  • 37
  • Not 100% sure about this but if I'm using Chinese characters in the scraper I need to define the string as unicode. Example: `u'电灯'`. I think you can't do something in the crawl command, but maybe you can define it in the function like this: `def __init__(self, keywords=u''):` – Casper Jun 11 '17 at 14:10
  • Also take a look at [this SO question](https://stackoverflow.com/questions/29486331/python-convert-chinese-characters-in-url). You might be able to adopt a similar method before constructing the `temp_url` variable on the `keywords` variable. – Casper Jun 11 '17 at 14:15
  • @Casper This is Python 3 (see the tags), where every string is unicode (in Python-2 terms), unless you explicitly use `b'byte sequences'`. – lenz Jun 11 '17 at 20:32
  • @Chris The `keywords` argument might be decoded wrongly. What does `print(ascii(keywords))` give you (before quoting it)? – lenz Jun 11 '17 at 20:38
  • @lenz Yes, Python 3 - so it should be unicode. Printing the `keywords` gives me: `'\udce7\udc94\udcb5\udce7\udc81\udcaf'` And that's what I see in the error as well: `UnicodeEncodeError: 'utf-8' codec can't encode character '\udce7' in position 0: surrogates not allowed` – Chris Jun 11 '17 at 23:06
  • Ok, I don't understand why this happens (how exactly the keywords parameter is decoded in a wrong way). But I see a pattern: The correct UTF-8 byte sequence for "电灯" is `E7 94 B5 E7 81 AF`, so each UTF-8 byte corresponds to the lower half of one of the useless codepoints you get. – lenz Jun 11 '17 at 23:21
  • Actually .. now that you mention the parameter decoding .. I'm using SSH to copy and paste the scrapy call including parameters on a terminal. Then it might be something related to the bash/ssh config? – Chris Jun 12 '17 at 00:39
  • Thanks for pointing me towards it. I checked the locals `locals`, commented `AcceptEnv LANG LC_*` in `/etc/ssh/sshd_config` out and installed Chinese `sudo locale-gen zh_CN.UTF-8` and now it's working. – Chris Jun 12 '17 at 03:25
  • Please, provide your solution as an answer to give it visibility. – Gallaecio Jan 31 '19 at 13:33

1 Answers1

0

These kind of problems are common when you are using chinese or any other language characters or symbols.
Try to encode the string with any other applicabel encoder other then utf-8.
https://docs.python.org/3/library/codecs.html#standard-encodings

But, first question, would removing this character make the information useless or maybe might not be as usefull in some way.

If that's not the problem try removing the character. It seems like it is the first character in string.

Use Try and Except to catch the exception and then
-- Remove the first char
or Better
use for loop to check every single char and remove the chars you can not encode.

vintol
  • 48
  • 4