4

I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gzfile (I downloaded using wget). I want to extract the text and see how it looks like in order to further process the corpus.

Using the following code to extract the text from gzip, I obtained data with the class being bytes.

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read()
    print('type(text)=', type(text))

The printed results for several first lines are as follows:

type(f_in) = class 'gzip.GzipFile'

type(text)= class 'bytes'

b'Reprise de la session\nJe d\xc3\xa9clare reprise la session du Parlement europ\xc3\xa9en qui avait \xc3\xa9t\xc3\xa9 interrompue le vendredi 17 d\xc3\xa9cembre dernier et je vous renouvelle tous mes vux en esp\xc3\xa9rant que vous avez pass\xc3\xa9 de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit.\n

I tried to decode the binary data with utf8 and ascii with the following code:

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read().decode('utf8')
    print('type(text)=', type(text))

And it returned errors like this:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 26: ordinal not in range(128)

I also tried using codecs and unicodedata packages to open the file but it returned encoding error as well.

Could you please help me explain what I should do to get the French text in the correct format like this for example?

Reprise de la session\nJe déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit.\n

Thank you a ton for your help!

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Sophil
  • 223
  • 1
  • 9
  • It would help if you posted the code you're using to decode to ut8 — the code that's not working. You example bytes decode fine for me. – Mark Jul 25 '19 at 08:10
  • Which version of python you are using 3.x or 2.x? – Rahul Jul 25 '19 at 08:11
  • `b'Reprise de la session\nJe d\xc3\xa9clare reprise la session du Parlement europ\xc3\xa9en qui avait \xc3\xa9t\xc3\xa9 interrompue le vendredi 17 d\xc3\xa9cembre dernier et je vous renouvelle tous mes vux en esp\xc3\xa9rant que vous avez pass\xc3\xa9 de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit.\n` This _is_ utf-8 encoded text. A `UnicodeEncodeError` suggests the problem lies in your environment. What OS and python version are you on? – snakecharmerb Jul 25 '19 at 08:13
  • @Rahul Thank you for your comment! I'm using Python 3.5.3 and Debian GNU/Linux 9.9. – Sophil Jul 25 '19 at 08:15
  • What is the output of `echo $LANG` in your terminal? – snakecharmerb Jul 25 '19 at 08:17
  • @snakecharmerb The output of ```echo $LANG```in my command is ```fr_FR.UTF-8```. – Sophil Jul 25 '19 at 08:18
  • @MarkMeyer Thank you for your feedback! I've added the part to decode to utf8 in the question for easier reading. – Sophil Jul 25 '19 at 08:21
  • `Unicode` **`Encode`** `Error` uhm... if the error was raised when *decoding* we should see a `UnicodeDecodeError` instead. Please can you provide 1) The *full* traceback (Starting from `Traceback (most recent call last)` upto the `UnicodeEncodeError: ...` and the *whole* code you are using? – Giacomo Alzetta Jul 25 '19 at 08:23
  • @GiacomoAlzetta Thank you for your comment! I couldn't copy the whole code due to the character limit. It's is basically just the part that I showed in the question. The full traceback is ```Traceback (most recent call last): File "europarl_extractor.py", line 38, in main() File "europarl_extractor.py", line 35, in main print(toto[:500]) UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 26: ordinal not in range(128)``` – Sophil Jul 25 '19 at 08:27
  • 1
    Try setting the PYTHONIOENCODING environment variable when you run you script: `PYTHONIOENCODING:=UTF-8 python3 europarl_extractor.py` – snakecharmerb Jul 25 '19 at 08:31
  • @snakecharmerb Thank you for your help!! I tried your suggestion but the result format is the same: ```b'Reprise de la session\nJe d\xc3\xa9clare reprise la session du Parlement europ\xc3\xa9en qui avait \xc3\xa9t\xc3\xa9 ...``` :( – Sophil Jul 25 '19 at 08:36
  • 1
    @snakecharmerb Hello, although it worked for me when I converted the file to `txt` format, I encounter problem when I read the `txt` file. After trying a lot of things, I finally found out that your suggestion of setting `PYTHONENCODING` works, but not exactly using your command. I tried your command and also the command `set PYTHONIOENCODING:=UTF-8` but they didn't work. It worked for me when I used the following command `export PYTHONIOENCODING=utf8`. – Sophil Jul 26 '19 at 15:46
  • 1
    Sorry, that ':' was a typo. Glad it's working for you now. – snakecharmerb Jul 26 '19 at 15:48
  • @snakecharmerb Thank you so much!!! I'm so happy because I spent a lot of time (and frustration) on this. If you don't mind, then could you please answer the question so that I could accept it. – Sophil Jul 26 '19 at 15:51

2 Answers2

2

The UnicodeEncodeError is occurring because when printing, Python encodes strings to bytes, but in this case, the encoding being used - ASCII - has no character that matches '\xe9', so the error is raised.

Setting the PYTHONIOENCODING environment variable forces Python to use a different encoding - the value of the environment variable. The UTF-8 encoding can encode any character, so calling the program like this solves the issue:

PYTHONIOENCODING=UTF-8 python3  europarl_extractor.py

assuming the code is something like this:

import gzip

if __name__ == '__main__':
    with gzip.open('europarl-v7.fr.gz', 'rb') as f_in:
        bs = f_in.read()
        txt = bs.decode('utf-8')
        print(txt[:100])

The environment variable may be set in other ways - via an export statement, in .bashrc, .profile etc.

An interesting question is why Python is trying to encode output as ASCII. I had assumed that on *nix systems, Python essentially looked at the $LANG environment variable to determine the encoding to use. But in the case the value of $LANG is fr_FR.UTF-8, and yet Python is using ASCII as the output encoding.

From looking at the source for the locale module, and this FAQ, these environment variables are checked, in order:

'LC_ALL', 'LC_CTYPE', 'LANG', 'LANGUAGE'

So it may be that one of LC_ALL or LC_CTYPE has been set to a value that mandates ASCII encoding in your environment (you can check by running the locale command in your terminal; also running locale charmap will tell you the encoding itself).

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • 1
    I checked using the `locale charmap` and it returned `ANSI_X3.4-1968` instead of `UTF-8`. So that's the reason why there was encoding error. Thank you so much for your detailed explanation! It's so good that this was solved at its root instead of the work-around solution of converting to another format that I didn't understand why it worked previously. – Sophil Jul 27 '19 at 08:52
1

Many thanks for all your help! I found a simple solution to work around. I'm not sure why it works but I think that maybe the .txt format is supported somehow? If you know the mechanism, it would be extremely helpful to know.

with gzip.open(file_path, 'rb') as f_in:
    text = f_in.read()

with open(os.path.join(out_dir, 'europarl.txt'), 'wb') as f_out:
    f_out.write(text)

When I print out the text file in terminal, it looks like this:

Reprise de la session Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.

Sophil
  • 223
  • 1
  • 9
  • You are writing the raw binary bytes, so Python never attempts to understand them as text. Probably you should `open(s.path.join(out_dir, 'europarl.txt'), 'w', encoding='utf-8')` to specifically select text mode with the desired encoding. You'll obviously also need to `decode` the input then, like in your original question. – tripleee Aug 10 '21 at 05:28