0

I want to scrap data from top gainers(%) from the link but it return UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 211: invalid start byte

import requests
from lxml import html
page_indo = requests.get('http://www.sinarmassekuritas.co.id/id/index.asp')
indo = html.fromstring(page_indo.content)
indo = indo.xpath('//tr/td/text()')

I do not found anything weird in line 211 when I view the source of the page. Please guide how to avoid this error and get the data in the top gainer(%) in table

Updated

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<script type="text/javascript">
<!--
function MM_reloadPage(init) {  //reloads the window if Nav4 resized
  if (init==true) with (navigator) {if ((appName=="Netscape")&&(parseInt(appVersion)==4)) {
    document.MM_pgW=innerWidth; document.MM_pgH=innerHeight; onresize=MM_reloadPage; }}
  else if (innerWidth!=document.MM_pgW || innerHeight!=document.MM_pgH) location.reload();
}
MM_reloadPage(true);`

I am not sure what is the 211 try to point out. Triplee said it is 211th character from the beginning of the offending line

  1. If it counted from <!DOCTYPE html, then the character is (... reloads the window ...) i

  2. if counted from <script type="text/javascript">, then it will be document.MM**_**

I am not sure how one of these two will cause the error

tripleee
  • 175,061
  • 34
  • 275
  • 318
bkcollection
  • 913
  • 1
  • 12
  • 35
  • 1
    Possible duplicate of [Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?](https://stackoverflow.com/questions/46180610/python-3-unicodedecodeerror-how-do-i-debug-unicodedecodeerror) – tripleee Feb 26 '18 at 13:14
  • @tripleee I don't see similarities as it is with xpath and I don't find something which is unicode that cause the error – bkcollection Feb 26 '18 at 13:18
  • 1
    At the very least the duplicate should show you what sort of information *we* need to see in order for this to be a question we can help you with. See also the [help] and ideally [edit] your question to turn it into a proper [mcve]. – tripleee Feb 26 '18 at 13:19
  • "Position 211" means the 211th character on the offending line. – tripleee Feb 26 '18 at 13:20
  • is that mean 211 from ` – bkcollection Feb 26 '18 at 13:25
  • That would be extremely helpful for you to find out and include in the question (though with the duplicate and the HTML data we don't have you probably just want to delete this question and take it from here). – tripleee Feb 26 '18 at 13:26

2 Answers2

1

I downloaded a copy of this data and found the offending character at offset 103826. The error message from lxml isn't very helpful for debugging this.

The context around that place in the file is (wrapped for legibility)

b'tas Pancasakti Tegal dengan tema : \x93Pasar Modal sebagai'
b' indikator perekonomian negaradan peluang investasi pasar '
b'modal\x94.</td>'

I don't speak this language (Indonesian Malay?) so I have no idea what the offending character is supposed to represent, but https://tripleee.github.io/8bit#93 suggests a left curly quote U+201C in some legacy Windows 8-bit encoding, and the \x94 at the end of this fragment seems to reinforce this guess.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks. It is indonesia Malay. How it can be encoded? – bkcollection Feb 26 '18 at 13:43
  • The site I link to enumerates a number of possible interpretations. The speculation about a left curly quote is compatible with a number of Windows code pages in the range 1250 and up. If there are no other offending characters, any one of those encodings should work. Ideally, get in touch with the provider of this data and ask them. – tripleee Feb 26 '18 at 13:45
0

For anyone else looking to solve this issue of unicode and XPath, it works for me: Assuming, page = requests.get(url), instead of creating html tree of lxml using this way:

tree = html.fromstring(page.content)

Use this:

tree = html.fromstring(page.content.decode("utf-8", "replace"))
Saber
  • 2,440
  • 1
  • 25
  • 40