I am having a problem regarding encoding that seems very similar to other problems here, but not quite the same, and I cannot figure this whole thing out.
I thought I had grasped the concept of encoding, but I have these special characters (æ,ø,å,ö, etc.) that looks fine when printing, but cannot be written to a file. (e.g. æ becomes Ê when I write to the file)
my code is as follows:
def sortWords(subject, articles, stopWordsFile):
stopWords = []
f = open(stopWordsFile)
for lines in f:
stopWords.append(lines.split(None, 1)[0].lower())
for x in range(0,len(articles)):
f = open(articles[x], 'r')
article = f.read().lower()
article = re.sub("[^a-zA-Z\æøåÆØÅöÖüÜ\ ]+", " ", article)
article = [word for word in article.split() if word not in stopWords]
print ' '.join(article)
w = codecs.open(subject+str(x)+'.txt', 'w+')
w.write(' '.join(article))
sortWords("hpv", ["vaccine_texts/hpv1.txt"], "stopwords.txt")
I have tried with various encodings, opening the files with codecs.open(file, r, 'utf-8'), but to no avail. What am I missing here?
I'm on ubuntu (switched from Windows because its terminal wouldn't output correctly)