0

I am having a problem regarding encoding that seems very similar to other problems here, but not quite the same, and I cannot figure this whole thing out.

I thought I had grasped the concept of encoding, but I have these special characters (æ,ø,å,ö, etc.) that looks fine when printing, but cannot be written to a file. (e.g. æ becomes Ê when I write to the file)

my code is as follows:

def sortWords(subject, articles, stopWordsFile):
    stopWords = [] 
    f = open(stopWordsFile)
    for lines in f:
        stopWords.append(lines.split(None, 1)[0].lower())

    for x in range(0,len(articles)):
        f = open(articles[x], 'r')
        article = f.read().lower()
        article = re.sub("[^a-zA-Z\æøåÆØÅöÖüÜ\ ]+", " ", article)
        article = [word for word in article.split() if word not in stopWords]
        print ' '.join(article)
        w = codecs.open(subject+str(x)+'.txt', 'w+')
        w.write(' '.join(article))



sortWords("hpv", ["vaccine_texts/hpv1.txt"], "stopwords.txt")

I have tried with various encodings, opening the files with codecs.open(file, r, 'utf-8'), but to no avail. What am I missing here?

I'm on ubuntu (switched from Windows because its terminal wouldn't output correctly)

LaughingMan
  • 640
  • 1
  • 9
  • 18
  • http://stackoverflow.com/questions/6048085/python-write-unicode-text-to-a-text-file – Joe Doherty Mar 04 '16 at 14:35
  • @JoeDoherty I've seen this one, I cannot use `.encode('utf8')` when I write as it gives me an error. No matter what I open the file with it shows strange symbols (sublime, gedit, vim, notepad). Why does this happen? – LaughingMan Mar 04 '16 at 14:48
  • There seemed to be a problem with the encoding of one the files I opened. I tried with two seperate files and one of them worked perfectly. I tried copying and pasting that file's content into a new text file and then it worked. Weird – LaughingMan Mar 04 '16 at 15:15

2 Answers2

2

When you see in the text file something like Ê (or more generally 2 characters the first of which is Ã) it is likely that the file was correctly written in UTF8, and that the editor (or the screen) does not process correctly UTF8.

Let's look at æ. It is the unicode character U+E6. When you encode it in utf8, it gives the two characters b'\xc3\xa6' and when decoded as latin1 it printf 'æ'.

What can you do to confirm? Use the excellent vim editor that knows about multiple encoding and among others utf8, at least when you use its graphical interface gvim.

And just another general advice: never write non ascii characters in a python source file, unless you put a # -*- coding: ... -*- line as first line (or second if first is a hashbang one #! /usr/bin/env python)

And if you want to use unicode under Windows with Python, do use IDLE that natively processes it.

TL/DR: If you are using Linux, it is likely that your system is configured natively to use utf8 encoding, and you correctly write your text files in utf8 but your text editor just fails to display utf8 correctly

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thank you, I think this was the clear answer I think I needed. I did try vim at some point(as it is objectively the best editor), but alas! still weird letters. Must have gotten screwed up even more from all my encoding/decoding experiments. It's all working now though, the problem was the source file I read from. Thanks a bunch everyone! – LaughingMan Mar 04 '16 at 16:41
0

have you tried:

w.write( ' '.join(article).encode('utf8') )

and don't forget to close your files (better use with context manager to manipulate files)

YOBA
  • 2,759
  • 1
  • 14
  • 29
  • I've tried this, it gives me the `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 66: ordinal not in range(128) ` error – LaughingMan Mar 04 '16 at 14:50
  • 1
    Of what type is type( ' '.join(article) ) , str or unicode? – YOBA Mar 04 '16 at 14:52
  • `type(' '.join(article))` returns _string_ – LaughingMan Mar 04 '16 at 14:59
  • for me it works just fine: str_u = "æøåÆØÅöÖüÜ", with open("myFile.txt") as myF: myF.write(str_u) why do you need to use codecs – YOBA Mar 04 '16 at 14:59
  • 1
    There seemed to be a problem with the encoding of one the files I opened. I tried with two seperate files and one of them worked flawlessly. I tried copying and pasting that file's content into a new text file and then it worked. Weird. before that I even tried to open the bad file with different codecs, but it wouldn't let me – LaughingMan Mar 04 '16 at 15:16