2

I want to read content of few pages on one website, for few oh them my code works ok, but for the rest of them not. Strange chars appears: łą and more.

articles = ""
url = "http://www.someurl.com"
sock = urllib.urlopen(url)

content = sock.read()
sock.close()
soup = BeautifulSoup(content)

div = soup.find("div", class_="col-d")
ps = div.find_all("p")
for p in ps:
    print type(p.get_text())
    print type(p.get_text().encode('utf-8'))
    print p.get_text()

The output is:

<type 'unicode'><type 'str'>różni się znacząco. Dziś, zgodnie z danymi Lion’s House i Home Brokera, przeciętnego  zapłacić niespełna 2,1 tys. zł miesięcznie. Gdyby taką samą nieruchomość kupić na kredyt, to w pierwszym roku część ods

Do you know any solutions to make this work?

user985541
  • 717
  • 1
  • 8
  • 11
  • What do you get when you call `print(sock.headers['content-type'])` ? – Jason Sperske Jan 09 '13 at 22:00
  • @Jason Sperske: text/html; charset=utf-8 – user985541 Jan 09 '13 at 22:02
  • 2
    Using the [Requests library](http://docs.python-requests.org/en/latest/) this seems to work: `print requests.get("http://pl.bab.la/slownik/polski-niemiecki/zgodnie-z").text` Maybe your URL is incorrectly reporting it's character encoding – Jason Sperske Jan 09 '13 at 22:08
  • http://stackoverflow.com/questions/7219361/python-and-beautifulsoup-encoding-issues this seems to be an issue with conflicting information – Jason Sperske Jan 09 '13 at 22:13
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/22485/discussion-between-user985541-and-jason-sperske) – user985541 Jan 09 '13 at 22:18

1 Answers1

2

Here is an approach that uses the Requests library (and a random Polish website).

import requests
from bs4 import BeautifulSoup

r = requests.get("http://pl.bab.la/slownik/polski-niemiecki/zgodnie-z")

soup = BeautifulSoup(r.text, fromEncoding="UTF-8")
soup.find(id="showMoreCSDiv").text

This code looks for this HTML:

<div id="showMoreCSDiv"><a class="btn" id="showMoreCS" href="javascript:babGetMoreCS(20,'zgodnie z');">więcej</a></div>

It returns this:

więcej
Jason Sperske
  • 29,816
  • 8
  • 73
  • 124