0

I wrote a script to scrape the titles of a YouTube playlist page

Everything works fine, according to print statements, until I try to write the titles into a text file, at which point I get "UnicodeEncodeError: 'charmap' codec can't encode characters in position..."

I've tried adding "encoding='utf8'" when I open the file, and while that fixes the error, all the Chinese characters are replaced by random, gibberish characters

I also tried encoding the output string with 'replace', then decoding it, but that also just replaces all the special characters with question marks

Here is my code:

from bs4 import BeautifulSoup as BS
import urllib.request
import re

playlist_url = input("gib nem: ")

with urllib.request.urlopen(playlist_url) as response:
  playlist = response.read().decode('utf-8')
  soup = BS(playlist, "lxml")

title_attrs = soup.find_all(attrs={"data-title":re.compile(r".*")})
titles = [tag["data-title"] for tag in title_attrs]

titles_str = '\n'.join(titles)#.encode('cp1252','replace').decode('cp1252')

print(titles_str)
with open("playListNames.txt", "a") as f:
    f.write(titles_str)

And here is the sample playlist I've been using to test: https://www.youtube.com/playlist?list=PL3oW2tjiIxvSk0WKXaEiDY78KKbKghOOo

  • 2
    Are you sure that the gibberish is not due to your editor/whatever you use to display the results? – Arne Feb 08 '18 at 09:17
  • I just copied this code into a file and ran it without errors using the URL to gave. I'd show you some output but I'm having copy-and-paste issues, sorry. Are you running this code under Python 2, by any chance? – holdenweb Feb 08 '18 at 09:22
  • @ArneRecknagel I don't think so; I'm using Sublime Text 2 – Honest Escape Feb 08 '18 at 09:31
  • @holdenweb That's strange; I'm running the code under Python 3 with the Spyder IDE – Honest Escape Feb 08 '18 at 09:32
  • 1
    You are appending to `playListNames.txt`. Are you sure the file's encoding is UTF? Did you try `creating a new file open("playListNames_new.txt", "w")` (and perhaps set encoding)? – Dušan Maďar Feb 08 '18 at 10:27
  • @dm295 Oh, that worked!! Opening up the original file with Window's Notepad, it seems that the original encoding was ANSI. Would I be correct to assume that Windows writes files with ANSI encoding by default, and that Python's "encoding=" cannot change an existing file's encoding? – Honest Escape Feb 08 '18 at 19:18
  • @HonestEscape: "*Would I be correct to assume that Windows writes files with ANSI encoding by default*" - no, there is no default encoding at the OS layer. NOTEPAD defaults to ANSI unless you specify otherwise. "*and that Python's "encoding=" cannot change an existing file's encoding?*" - correct, if you are not overwriting the file's existing content. – Remy Lebeau Feb 08 '18 at 23:18

2 Answers2

1

Using an encoding will fix your problem. Windows defaults to an ANSI encoding that on US Windows is Windows-1252. It doesn't support Chinese. You should use utf8 or utf-8-sig as the encoding. Some Windows editors prefer the latter and assume ANSI otherwise.

with open('playListNames.txt','w',encoding='utf-8-sig') as f:
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • 1
    To be more accurate, *Windows* doesn't care about file encodings at all, it is *Text Editors*, like Notepad, that do. – Remy Lebeau Feb 08 '18 at 23:21
0

The documentation is clear about file encoding:

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

To answer questions from your last comment.

  1. You can find out what's the preferred encoding on Windows with

    import locale
    locale.getpreferredencoding()
    

If playListNames.txt was created with open('playListNames.txt', 'w') then the value returned by locale.getpreferredencoding() was used for encoding.

If the file was created manually then the encoding depends on the editor's default/preferred encoding.

  1. Refer to How to convert a file to utf-8 in Python? or How do I convert an ANSI encoded file to UTF-8 with Notepad++? [closed].
Dušan Maďar
  • 9,269
  • 5
  • 49
  • 64