3

I am new to Python programming. I am using the following code in my Python file:

import gethtml
import articletext
url = "http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece"
result = articletext.getArticle(url)
text_file = open("Output.txt", "w")

text_file.write(result)

text_file.close()

the file articletext.py contains the following code :

from bs4 import BeautifulSoup
import gethtml
def getArticleText(webtext):
    articletext = ""
    soup = BeautifulSoup(webtext)
    for tag in soup.findAll('p'):
        articletext += tag.contents[0]
    return articletext

def getArticle(url):
    htmltext = gethtml.getHtmlText(url)
    return getArticleText(htmltext)

But I am getting the following error :

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 473: ordinal not in range(128)
To print the result into the output file, what proper code should I write ?

The output `result` is text in the form of a paragraph.
reo katoa
  • 5,751
  • 1
  • 18
  • 30
casanova
  • 175
  • 1
  • 10
  • possible duplicate of ['ascii' codec can't encode character at position \* ord not in range(128)](http://stackoverflow.com/questions/15364266/ascii-codec-cant-encode-character-at-position-ord-not-in-range128) – Martijn Pieters Nov 11 '13 at 17:58

2 Answers2

5

To take care of the unicode error, we need to encode the text as unicode (UTF-8 to be precise) instead of ascii. To ensure it doesn't throw an error if there's an encoding error, we're going to ignore any characters that we don't have a mapping for. (You can also use "replace" or other options given by str.encode. See the Python docs on Unicode here.)

Best practice in opening the file would be to use the Python context manager, which will close the file even if there's an error. I'm using slashes instead of backslashes in the path to make sure this works in either Windows or Unix/Linux.

text = text.encode('UTF-8', 'ignore')
with open('/temp/Out.txt', 'w') as file:
    file.write(text)

This is equivalent to

text = text.encode('UTF-8', 'ignore')
try:
    file = open('/temp/Out.txt', 'w')
    file.write(text)
finally:
    file.close()

But the context manager is much less verbose and much less open to possibility of causing you to lock up a file in the middle of an error.

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
4
text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

This should work, give it a try.

Why? Because saving everything as bytes and utf-8 it will ignore those kind of encoding errors :D

Edit Make sure the file exists in the same folder, otherwise put this code after the imports and it should create the file itself.

text_filefixed = open("Output.txt", "a")
text_filefixed.close()

It creates it, saves nothing, close file... but it's created automatically without human interaction.

Edit2 Notice this is only working in 3.3.2 but i know you can use this module to achieve the same thing in 2.7. A few minor differences would be that (i think) request is not needed in 2.7, but you should check that.

from urllib import request
result = str(request.urlopen("http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece").read())
text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

Just as i though, you will just find this error in 2.7, urllib.request in Python 2.7

Community
  • 1
  • 1
Saelyth
  • 1,694
  • 2
  • 25
  • 42
  • I am getting the following error on trying it : `Traceback (most recent call last): File "C:/Python27/crawler/main.py", line 7, in text_filefixed.write(bytes(result, 'UTF-8')) TypeError: str() takes at most 1 argument (2 given)` – casanova Nov 11 '13 at 18:06
  • oh you are using python 2.7. My code was working in 3.3.2. Might need to adapt it and... no idea how, honestly. if you print, it's a working string what you are getting? maybe try to write str(result) – Saelyth Nov 11 '13 at 18:07