I'm trying to use a corpus for training an ML model but I'm running into some encoding errors that where likely caused by someone else's conversion/annotation of the file. I can visually see the errors when opening the file in vim
but python doesn't seem to notice them when reading. The corpus is fairly large so I need to find a way to get python to detect them and hopefully a method to correct them.
Here's a sample line as viewed in vim
...
# ::snt That<92>s what we<92>re with<85>You<92>re not sittin<92> there in a back alley and sayin<92> hey what do you say, five bucks?
The <92> should be an apostrophe and the <85> should probably be 3 dots. There are a number of other values that appear on other lines. Doing some googling, I'm thinking the original encoding was probably CP1252 but currently the file
command under Linux list this file as UTF-8. I've tried a few ways to open this but no luck...
with open(fn) as f:
returns
# ::snt Thats what were withYoure not sittin there in a back alley and sayin hey what do you say, five bucks?
which is skipping those tokens and concatenating words, which is a problem.
with open(fn, encoding='CP1252') as f:
returns
# ::snt ThatA's what weA're withA...YouA're not sittinA' there in a back alley and sayinA' hey what do you say, five bucks?
which is visually inserting "A" for those odd characters.
with io.open(fn, errors='strict')
doesn't raise any errors and neither does reading in a byte stream and decoding, so unfortunately at this point I can't even detect the errors much less correct for them.
Is there a way to read in this large file and detect encoding errors within it. Even better, is there a way to correct them?