0

I want to print some strings scraped by BeautifulSoup and they are not printed right. For "ü" I get "ü" and so on.

Here is my code

from bs4 import BeautifulSoup
import requests
import re

with open('index.html') as html_file:

for link in soup.find_all('a'):
    print(link.get('href'))

EDIT

Found the solution myself. You have to open the file with the right encoding as follows:

open('index.html', encoding='utf8')
neolith
  • 699
  • 1
  • 11
  • 20
  • That is a syntax error, which really has nothing to do with what you are asking. I suspect you are using Python 3, which you should, in which case, you need to use `print` as a function, not a statement. the good news is that Python 3 makes all of this much simpler – juanpa.arrivillaga Oct 02 '19 at 12:57
  • How does that address the question of the error about the coding? I also tried to put it in parentheses. That does not solve it. – neolith Oct 02 '19 at 13:02
  • 2
    That answers the question about the **error you are showing**. If you are getting **another error** then edit the question and provide a [mcve]. – juanpa.arrivillaga Oct 02 '19 at 13:05
  • I edited the question to show you my actual code – neolith Oct 02 '19 at 13:06
  • that looks like an error related to your *source-file encoding*. you've said that it is `iso-8859-1` using `# -*- coding:` but likely it is not. Have you tried using utf-16? Also, **please provide a [mcve]**, not your whole code, but a snippet which actually reproduces your problem, which is likely something that can be one or two lines – juanpa.arrivillaga Oct 02 '19 at 13:09
  • utf-8 throws no error. Now only the last error remains – neolith Oct 02 '19 at 13:09
  • Just do `print(message)`, whatever encoding this is, it is already a string, no decoding necessary. – L3viathan Oct 02 '19 at 13:09
  • ... why are you doing `print (message.decode("iso-8859-1").encode(stdout_encoding))` ??? It doesn't make sense to `.decode` a string. – juanpa.arrivillaga Oct 02 '19 at 13:10
  • @juanpa.arrivillaga because they copied some code from somewhere old that was assuming Python 2. – L3viathan Oct 02 '19 at 13:10
  • I am doing ```print (message.decode("iso-8859-1").encode(stdout_encoding))```, since that is the approach in the tutorial. If I just use print(message), it doesn't print me the ä, ö and ü. I have hundreds of lines and would have to edit them by hand. – neolith Oct 02 '19 at 13:12
  • 1
    @neolith don't just blindly follow a tutorial, especially since it is clearly for Python 2 not Python 3, which handle strings fundamentally incompatibly. What happens when you `print(message)` **exactly**? Are you sure the problem isn't simply that the terminal you are using doesn't support the encoding you are trying to use? Perhaps get a better terminal, or look for instructions on how to change your terminal settings. – juanpa.arrivillaga Oct 02 '19 at 13:14
  • 1
    Your tutorial is 13 years old, things have changed since then. – L3viathan Oct 02 '19 at 13:15
  • Yes, you are right. I haven't looked at the date that closely. How would you approach it for Spanish and ñ? – neolith Oct 02 '19 at 13:19
  • 2
    @neolith the *language* has no bearing. Again, what exactly is the behavior you see when you simply `print(message)`? – juanpa.arrivillaga Oct 02 '19 at 13:21
  • Words like "Öl" are printed as "Öl" and "prüfen" becomes "prüfen" – neolith Oct 02 '19 at 13:23
  • What happens if you remove the "encoding cookie"? If you run a script containing as the only line (!) `print("Öl")`? – L3viathan Oct 02 '19 at 13:23
  • Found the solution. I had to open it with the right encoding as follows: open('index.html', encoding='utf8') – neolith Oct 02 '19 at 13:24
  • @neolith *what?* So you were opening a file? – juanpa.arrivillaga Oct 02 '19 at 13:28
  • Yes, that is the problem with the minimal examples. I am a noob and don't know which lines might be relevant for such an example. I changed the question accordingly and provided the solution. Thank you a lot for your help tronco! – neolith Oct 02 '19 at 13:31

0 Answers0