0

i'm trying to get text from a webpage and it makes 'Traceback (most recent call last): File "C:\Users\username\Desktop\Python\parsing.py", line 21, in textFile.write(str(results)) UnicodeEncodeError: 'cp949' codec can't encode character '\xa9' in position 37971: illegal multibyte sequence'

I've searched and tried textFile.write(str(results).decode('utf-8')) and it makes no attribute arror.

import requests
import os
from bs4 import BeautifulSoup

outputFolderName = "output"

currentPath = os.path.dirname(os.path.realpath(__file__))
outputDir = currentPath + "/" +outputFolderName

r = requests.get('https://yahoo.com/')
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.findAll(text=True)

try :
    os.mkdir(outputDir)
    print("output directory generated")
except :
    print("using existing directory")

textFile = open(outputDir + '/output.txt', 'w')
textFile.write(str(results))
textFile.close()

Is there any way to convert the codec of str(results) and save it properly??

python version is 3.7.3

  • Possible duplicate of [How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?](https://stackoverflow.com/questions/20205455/how-to-correctly-parse-utf-8-encoded-html-to-unicode-strings-with-beautifulsoup) – walnut Sep 10 '19 at 10:41
  • Where does the cp949 codec come from? Can you post the full stacktrace? – Tom Dalton Sep 10 '19 at 10:47
  • @TomDalton Traceback (most recent call last): File "C:\Users\username\Desktop\Python\parsing.py", line 21, in textFile.write(str(results)) UnicodeEncodeError: 'cp949' codec can't encode character '\xa9' in position 37971: illegal multibyte sequence – FlippingFlop Sep 10 '19 at 11:34
  • Please include the traceback in the question body (use the "edit" link below the tags). Also: which Python version are you using? The meaning of `str()` has changed significantly from Python 2 to Python 3. – lenz Sep 10 '19 at 11:38
  • @lenz just included. and the version is 3.7.3. – FlippingFlop Sep 10 '19 at 11:41
  • I get no error with this code. Can you please provide the full code? – Pitto Sep 10 '19 at 11:48
  • 1
    @Pitto it is the whole code. maybe it is because i put the url 'example.com'. can you try this code again? i've just modified. – FlippingFlop Sep 10 '19 at 11:58
  • Possible duplicate of [UnicodeEncodeError: 'cp949' codec can't encode character](https://stackoverflow.com/questions/43821262/unicodeencodeerror-cp949-codec-cant-encode-character) – Tom Dalton Sep 10 '19 at 12:18
  • I think this is related to your system default encoding being cp949 instead of e.g. utf-8. As the answer below suggests, explicitly setting the file's encoding to utf8 will probably solve the issue .See https://stackoverflow.com/a/43821283/2372812 for more info. – Tom Dalton Sep 10 '19 at 12:18
  • @TomDalton omg!! what a simple way to solve it !! thanks. and thanks to all others commented :) – FlippingFlop Sep 10 '19 at 12:28

1 Answers1

1

Please specify the encoding like in this example

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
import os
from bs4 import BeautifulSoup

outputFolderName = "output"

currentPath = os.path.dirname(os.path.realpath(__file__))
outputDir = currentPath + "/" +outputFolderName

r = requests.get('https://yahoo.com')
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.findAll(text=True)

try :
    os.mkdir(outputDir)
    print("output directory generated")
except :
    print("using existing directory")

textFile = open(outputDir + '/output.txt', mode='w', encoding='utf8')
textFile.write(str(results))
textFile.close()
Pitto
  • 8,229
  • 3
  • 42
  • 51
  • Hi @FlippingFlop! If my answer was useful please don't forget to upvote and / or choose it as answer. Thanks! – Pitto Sep 15 '19 at 18:53