How to get rid of special characters while extracting data from web?

Question

I am extracting data from the website and it has an entry that contains a special character i.e. Comfort Inn And Suites�? Blazing Stump. When I try to extract it, it throws an error:

    Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
    taskObj._oneWorkUnit()
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
    result = next(self._iterator)
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
    yield it.next()
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
    for x in result:
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
    print repr(business.select('a[@class="name"]/text()').extract()[0])
  File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
    result = self.xpathev(xpath)
  File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)

  File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)

  File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)

  File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)

  File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)

  File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)

  File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte

I have tried a lot of different things after searching on the web such as decode('utf-8'), unicodedata.normalize('NFC',business.select('a[@class="name"]/text()').extract()[0]) but the problem persists?

The source URL is "http://www.truelocal.com.au/find/hotels/97/" and on this page it is fourth entry which I am talking about.

That's a Mojibake in the original data that cannot be repaired. It's a question of Garbage In, and you cannot but keep it around as Garbage. — Martijn Pieters, Aug 26 '14 at 09:29

Martijn Pieters · Answer 1 · 2014-08-26T15:30:50.540

4

You have a bad Mojibake in the original webpage, probably due to bad handling of Unicode in the data entry somewhere. The actual UTF-8 bytes in the source are C3 3F C2 A0 when expressed in hexadecimal.

I think it was once a U+00A0 NO-BREAK SPACE. Encoded to UTF-8 that becomes C2 A0, interpret that as Latin-1 instead then encode to UTF-8 again becomes C3 82 C2 A0, but 82 is a control character if interpreted as Latin-1 again so that was substituted by a ? question mark, hex 3F when encoded.

When you follow the link to the detail page for that venue then you get a different Mojibake for the same name: Comfort Inn And SuitesÃ‚Â Blazing Stump, giving us the Unicode characters U+00C3, U+201A, U+00C2 a   HTML entity, or unicode character U+00A0 again. Encode that as Windows Codepage 1252 (a superset of Latin-1) and you get C3 82 C2 A0 again.

You can only get rid of it by targeting this directly in the source of the page

pagesource.replace('\xc3?\xc2\xa0', '\xc2\xa0')

This 'repairs' the data by substituting the train wreck with the original intended UTF-8 bytes.

If you have a scrapy Response object, replace the body:

body = response.body.replace('\xc3?\xc2\xa0', '\xc2\xa0')
response = response.replace(body=body)

edited Aug 26 '14 at 15:30

answered Aug 26 '14 at 09:40

Martijn Pieters

1,048,767
296
4,058
3,343

That being said, it is worst one you follow the link on the original site: `SuitesÃ‚Â Blazing` (`53 75 69 74 65 73 [c3 83 e2 80 9a c3 82 c2 a0] 42 6c 61 7a 69 6e 67`). Seems like multiple encoding maybe? – Sylvain Leroux Aug 26 '14 at 09:44
@SylvainLeroux: That confirms it is a non-breaking space, actually. That's a [U+201A SINGLE LOW-9 QUOTATION MARK](http://codepoints.net/U+201a) in the middle. In Windows Codepage 1252 that is hex `82`. – Martijn Pieters Aug 26 '14 at 09:48
Maybe ',' (`c2 a0 0a 0a`) encoded *twice* from Windows CP-1252 to UTF-8 ? – Sylvain Leroux Aug 26 '14 at 09:59
1

@SylvainLeroux: no, just a single ``; try `u'\xc3\u201a\xc2\xa0'.encode('cp1252').decode('utf8').encode('latin1').decode('utf8')` (although `cp1252` would work for the second `latin1` too). – Martijn Pieters Aug 26 '14 at 10:01
@SylvainLeroux: where `u'\xc3\u201a\xc2\xa0'` is the result of decoding the UTF-8 HTML source to Unicode and replacing the ` ` with `\x0a`. – Martijn Pieters Aug 26 '14 at 10:02
@MartijnPieters I tried your solution but it is not working and giving me the same problem – Mubashir Kamran SW Engineer Aug 26 '14 at 10:36
1

@MughalWalana: what is the full traceback of your original error (please add that to your question post). – Martijn Pieters Aug 26 '14 at 10:47
@MughalWalana: also, what is does `print repr(value_you_need_to_repair)` show? That'll show me the exact contents as well as the type of the object. – Martijn Pieters Aug 26 '14 at 11:53
I'm sorry to hijack this comment, but in this case I just *have* to tip my hat to @MartijnPieters. That's some really nice detective work you have done there! – exhuma Aug 26 '14 at 12:06
@MartijnPieters Editing Done. and also it does not print anything on the suggested print statement . However for other names successfully coming it just prints their unicode representation such as u'Wompoo Cottage' – Mubashir Kamran SW Engineer Aug 26 '14 at 12:35
@MughalWalana: ah! AH! Both BeautifulSoup and my browser 'repaired' the content; the error is in the UTF-8 encoding of the page. – Martijn Pieters Aug 26 '14 at 12:37
Off course it is the problem in utf-8. How it can be corrected programmatically – Mubashir Kamran SW Engineer Aug 26 '14 at 12:40
@MughalWalana: `lxml` on the other hand choked on the invalid UTF-8 bytes in the source. How does scrapy give you the source, or does it give you the `lxml` parse tree here? – Martijn Pieters Aug 26 '14 at 12:40
it gives us basically htmlxpathselector through which we can extract elements of webpage by their DOM ids – Mubashir Kamran SW Engineer Aug 26 '14 at 12:45
@MughalWalana: I'll try and figure out how to get at the raw bytes in the Scrapy response. Can you show us a little more context for the selection line? I'd like to see what object you have (I am assuming that `business` is a `HtmlResponse` object perhaps?). – Martijn Pieters Aug 26 '14 at 12:53
hxs = HtmlXPathSelector(response) businesses = hxs.select('//div[@class="media-content"]/div[@class="media"]/div[@class="media-content"]') we get list of all matching nodes in business and then loop through it for getting data as mentioned in my above question. – Mubashir Kamran SW Engineer Aug 26 '14 at 13:00

score 0 · Answer 2 · answered Aug 23 '18 at 18:48

Don't use "replace" to fix Mojibake, fix the database and the code that caused the Mojibake.

But first you need to determine whether it is simply Mojibake or "double-encoding". With a SELECT col, HEX(col) ... determine whether a single character turned into 2-4 bytes (Mojibake) or 4-6 bytes (double encoding). Examples:

`é` (as utf8) should come back `C3A9`, but instead shows `C383C2A9`
The Emoji `` should come back `F09F91BD`, but comes back `C3B0C5B8E28098C2BD`

Review "Mojibake" and "double encoding" here

Then the database fixes are discussed here :

CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset:

First, lets assume you have this declaration for tbl.col:

col VARCHAR(111) CHARACTER SET latin1 NOT NULL

Then to convert the column without changing the bytes via this 2-step ALTER:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;

Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER, as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

CHARACTER SET utf8mb4 with double-encoding: UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);
CHARACTER SET latin1 with double-encoding: Do the 2-step ALTER, then fix the double-encoding.

How to get rid of special characters while extracting data from web?

2 Answers2

Linked