I'm trying to build a corpus from the .txt file found at this link.
I believe the instances of \xad
are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15
, using the code:
with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r',
encoding='iso8859-15') as myfile:
data=myfile.read().replace('\n', '')
data2 = data.split(' ')
This returns an array of 'words', but '\xad' remains attached to many entries in data2. I've tried
data_clean = data.replace('\\xad', '')
and
data_clean = data.replace('\\xad|\\xad\\xad','')
but this doesn't seem to remove the instances of '\xad'. Has anyone ran into a similar problem before? Ideally I'd like to encode this data as UTF-8 to avail of the nltk
library, but it won't read the file with UTF-8 encoding as I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 471: invalid start byte
Any help would be greatly appreciated!
Additional context: This is a recreational project with the aim of being able to generate stories based on the txt file. Everything I've generated thus far has been permeated with '\xad', which ruins the fun!