I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gz
file (I downloaded using wget
). I want to extract the text and see how it looks like in order to further process the corpus.
Using the following code to extract the text from gzip
, I obtained data with the class being bytes
.
with gzip.open(file_path, 'rb') as f_in:
print('type(f_in)=', type(f_in))
text = f_in.read()
print('type(text)=', type(text))
The printed results for several first lines are as follows:
type(f_in) = class 'gzip.GzipFile'
type(text)= class 'bytes'
b'Reprise de la session\nJe d\xc3\xa9clare reprise la session du Parlement europ\xc3\xa9en qui avait \xc3\xa9t\xc3\xa9 interrompue le vendredi 17 d\xc3\xa9cembre dernier et je vous renouvelle tous mes vux en esp\xc3\xa9rant que vous avez pass\xc3\xa9 de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit.\n
I tried to decode the binary data with utf8
and ascii
with the following code:
with gzip.open(file_path, 'rb') as f_in:
print('type(f_in)=', type(f_in))
text = f_in.read().decode('utf8')
print('type(text)=', type(text))
And it returned errors like this:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 26: ordinal not in range(128)
I also tried using codecs
and unicodedata
packages to open the file but it returned encoding error as well.
Could you please help me explain what I should do to get the French text in the correct format like this for example?
Reprise de la session\nJe déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit.\n
Thank you a ton for your help!