25

I wrote a python3 script that does some web scraping and stores some information on a CSV file. The script works fine on my computer. The problem happens when I try to run the script on a docker container. The error seems to be on this part of my code (simplified further for the purposes of this question).

# default CSV module
import csv

# this is how an ACTUAL row looks like in my program, included it in case it was important
row = {'title': 'Electrochemical sensor for the determination of dopamine in presence of high concentration of ascorbic acid using a Fullerene-C60 coated gold electrode', 'url': 'https://onlinelibrary.wiley.com/doi/abs/10.1002/elan.200704073', 'author': 'Goyal, Rajendra Nath and Gupta, Vinod Kumar and Bachheti, Neeta and Sharma, Ram Avatar', 'abstract': 'A fullerene‐C60‐modified gold electrode is employed for the determination of dopamine in the excess of ascorbic acid using square‐wave voltammetry. Based on its strong catalytic function towards the oxidation of dopamine and ascorbic acid, the overlapping voltammetric …', 'eprint': 'http://www.academia.edu/download/3909892/Dopamene.pdf', 'publisher': 'Wiley Online Library', 'year': '2008', 'pages': '757--764', 'number': '7', 'volume': '20', 'journal': 'Electroanalysis: An International Journal Devoted to Fundamental and Practical Aspects of Electroanalysis', 'ENTRYTYPE': 'article', 'ID': 'goyal2008electrochemical'}

# the CSV writer object
writer = csv.DictWriter("file.csv", fieldnames=[a, b, c],  dialect='toMYSQL')

# this is the source of the problem!
writer.writerow(row)

I understand the containers have only the bare bones and that means that maybe the encoding the script uses is not supported. Thus, I added this to the start of my script: (bellow the usual she-bang)

# coding=utf-8

These are the locales on my docker:

$ locale -a

C
C.UTF-8
POSIX
en_US.utf8
es_CR.utf8

I have way more on my PC, but that shouldn't change much since en_US.utf8 covers all English stuff and es_CR.utf8 covers all Spanish stuff. (most, if not all, of my results are in English.)

I'm using python3, so I know all strings are unicode characters, maybe thats related to the problem?

$ python3 --version
Python 3.6.5

Despite all that, when I run my program, I get the following error message as soon as the script tries to print the row on console:

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/Systematic-Mapping-Engine/sysmapengine/scraper.py", line 100, in build_csv
    writer.writerow(clean_row)
  File "/usr/lib/python3.6/csv.py", line 155, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2010' in position 262: ordinal not in range(128)
Fabián Montero
  • 1,613
  • 1
  • 16
  • 34
  • 7
    This question doesn't seem to be a duplicate of the linked question and therefore the notice above seems to be misleading. In the other question the problem isn't caused by using docker environment and its answers doesn't solve the issue. The problem here isn't that some random file has improper encoding and requires special handling, but rather that ANY file in docker container has assumed wrong encoding because of some image defaults. This question could be improved by including the Dockerfile or image name and showing raw Python's `open` instead of `csv` module. – pkubik Oct 13 '18 at 23:34
  • 2
    This question is absolutely NOT a duplicate, how can we get it un marked? – Vincent Buscarello Sep 17 '20 at 15:16
  • 1
    Voted to reopen; this is definitely not a duplicate – Clément Dec 14 '20 at 16:09

1 Answers1

40

Most containers start with LANG=C set. That can be really annoying if you're dealing with UTF-8.

Just to make sure your container starts with the right locale add -e LANG=C.UTF-8 when calling docker.

yorodm
  • 4,359
  • 24
  • 32