0

I'm trying to use a corpus for training an ML model but I'm running into some encoding errors that where likely caused by someone else's conversion/annotation of the file. I can visually see the errors when opening the file in vim but python doesn't seem to notice them when reading. The corpus is fairly large so I need to find a way to get python to detect them and hopefully a method to correct them.

Here's a sample line as viewed in vim...

# ::snt That<92>s what we<92>re with<85>You<92>re not sittin<92> there in a back alley and sayin<92> hey what do you say, five bucks?

The <92> should be an apostrophe and the <85> should probably be 3 dots. There are a number of other values that appear on other lines. Doing some googling, I'm thinking the original encoding was probably CP1252 but currently the file command under Linux list this file as UTF-8. I've tried a few ways to open this but no luck...

with open(fn) as f: returns

# ::snt Thats what were withYoure not sittin there in a back alley and sayin hey what do you say, five bucks?

which is skipping those tokens and concatenating words, which is a problem.

with open(fn, encoding='CP1252') as f: returns

# ::snt ThatA's what weA're withA...YouA're not sittinA' there in a back alley and sayinA' hey what do you say, five bucks?

which is visually inserting "A" for those odd characters.

with io.open(fn, errors='strict') doesn't raise any errors and neither does reading in a byte stream and decoding, so unfortunately at this point I can't even detect the errors much less correct for them.

Is there a way to read in this large file and detect encoding errors within it. Even better, is there a way to correct them?

bivouac0
  • 2,494
  • 1
  • 13
  • 28
  • You could try https://pypi.org/project/charset-normalizer/ – snakecharmerb Aug 24 '20 at 19:28
  • That lib didn't seem to work and `unidecode` doesn't work either, although `unidecode` does at least strip the offending characters. – bivouac0 Aug 24 '20 at 19:33
  • Are only *some* lines in cp1252? Is the whole file cp1252? – Max Aug 24 '20 at 19:40
  • I'm a little unclear what's going on. Reading the file (or only the example line above) as CP1252 doesn't work correctly so I'm guessing the file is encoded with UTF-8 but has invalid characters in it that `vim` is able to detect but python is ignoring? – bivouac0 Aug 24 '20 at 19:44
  • maybe you could put original file on Google Drive (or similar portal) so we could download original file and test it. – furas Aug 24 '20 at 20:31

2 Answers2

1

Using the original data from your answer, you've got mojibake from a double-encode. You need a double-decode to translate it properly.

>>> s = b'# ::snt That\xc2\x92s what we\xc2\x92re with\xc2\x85You\xc2\x92re not sittin\xc2\x92 there in a back alley and sayin\xc2\x92 hey what do you say, five bucks?\n'
>>> s.decode('utf8').encode('latin1').decode('cp1252')
'# ::snt That’s what we’re with…You’re not sittin’ there in a back alley and sayin’ hey what do you say, five bucks?\n'

The data is actually in UTF-8, but on decode to Unicode the code points of the errors are the bytes for a Windows-1252 code page. The .encode('latin1') converts the Unicode code points 1:1 back to bytes, since the latin1 encoding is the first 256 code points of Unicode, then it can be decoded correctly as Windows-1252.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Here's a workable but not very elegant solution...

# Read in file as a raw byte-string
fn  = 'bad_chars.txt'
with open(fn, 'rb') as f:
    text = f.read()
print(text)

# Detect out of range 
has_bad = False
for c in text:
    if c >= 128:
        has_bad = True
print('Had bad:', has_bad)

# Fix offending characters
text = text.replace(b'\xc2\x92', b"\x27")
text = text.replace(b'\xc2\x85', b"...")
text = text.decode('utf-8')
print(text)

Which produces the following output...

b'# ::snt That\xc2\x92s what we\xc2\x92re with\xc2\x85You\xc2\x92re not sittin\xc2\x92 there in a back alley and sayin\xc2\x92 hey what do you say, five bucks?\n'

Had bad: True

# ::snt That's what we're with...You're not sittin' there in a back alley and sayin' hey what do you say, five bucks?

The downside is I need to find the offending characters and code a replace command for this to work. There is table of possible replacement codes found in a similar question at.. efficiently replace bad characters.

bivouac0
  • 2,494
  • 1
  • 13
  • 28