0

Trying to retrieve data from Ukrainian dictionary online everything works fine with:

url= "http://www.toponymic-dictionary.in.ua/index.phpoption=com_content&view=section&layout=blog&id=8&Itemid=9"
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//p[@class="MsoNormal"]//text()')

But nothing works out with another link:

from urllib.parse import urlparse, parse_qs, urlencode

url = 'http://sum.in.ua/?swrd=автор'
parsed_url = urlparse(url)
parameters = parse_qs(parsed_url.query)
url = parsed_url._replace(query=urlencode(parameters)).geturl()
page = urllib.request.urlopen(url)

pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata.xpath('//div[@itemprop="articleBody"]')

It gets me an empty list. Xpath is fine, while I double-checked it with Xpath Helper in Chrome.

Any ideas?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Khrystyna
  • 123
  • 2
  • 9
  • There is no such `
    ` in the HTML loaded from the URL. The *browser* can be served different HTML based on headers, or JavaScript could have altered the object tree.
    – Martijn Pieters Apr 04 '15 at 13:30
  • The URL redirects to http://sum.in.ua/s/[.avtor.] which contains no results. – Martijn Pieters Apr 04 '15 at 13:31
  • the Xpath from browser perspective is different from another xml and html parsers , for more detail read this question http://stackoverflow.com/questions/29220031/convert-the-xpath-gotten-from-browser-to-usable-xpath-for-scrapy – Mazdak Apr 04 '15 at 13:33
  • @MartijnPieters Thanks second time, but is there the other way to deal with either Ukrainian symbols in URL or XPath itself? – Khrystyna Apr 04 '15 at 13:52
  • @KhrystynaSkopyk: you misunderstand me; this is nothing to do with encodings. The returned page simply doesn't contain the element you are looking for; the XPath expression is otherwise fine. – Martijn Pieters Apr 04 '15 at 13:54
  • @MartijnPieters Here http://sum.in.ua/s/avtor and sum.in.ua/s/[.avtor.] The former leads to what is needed, the latter - to nothing. I believe, the second link was created after You helped me in encoding/decoding it. – Khrystyna Apr 04 '15 at 13:58
  • @KhrystynaSkopyk: ah! there is actually a small bug in my previous answer. – Martijn Pieters Apr 04 '15 at 14:01
  • @KhrystynaSkopyk: corrected my [other answer](https://stackoverflow.com/a/29436298) to add `doseq=True`; the `parse_qs()` result always uses lists for the values (sequences) and you need to explicitly tell `urlencode()` to support that style. Mea Culpa! – Martijn Pieters Apr 04 '15 at 14:03
  • @KhrystynaSkopyk: That bug was the underlying issue for your question here. I'll dupe this to your previous question to indicate this; feel free to delete it if you feel it is not otherwise useful. – Martijn Pieters Apr 04 '15 at 14:04
  • @MartijnPieters Saw your correction) Thank you so much, but I feel bad to say that it still gives an empty list( – Khrystyna Apr 04 '15 at 14:15
  • @KhrystynaSkopyk: the code works for me, now, actually. – Martijn Pieters Apr 04 '15 at 14:16
  • @KhrystynaSkopyk: https://gist.github.com/mjpieters/0a72eaf8332da9c4166f – Martijn Pieters Apr 04 '15 at 14:20
  • @MartijnPieters Unbelievable) IT WORKS:) There are not enough thanks I want to give You) – Khrystyna Apr 04 '15 at 14:27

0 Answers0