-1

could you please help me with something?

I have very large .xlsx files (ranging from 200MB to 1.4GB) that I need to convert to .csv format.

I wanted to use a csvkit function in2csv:

(venv) C:\Users\Martin\csvkit_tutorial>in2csv first.xlsx > second.csv

However, when I run it, I get this error:

C:\Users\Martin\csvkit_tutorial\venv\Lib\site-packages\agate\utils.py:274: UnnamedColumnWarning: Column 0 has no name. Using "a".
UnicodeEncodeError: 'charmap' codec can't encode characters in position 137-146: character maps to <undefined>

I don't care if the first column is named "a". The main issue is the second error, UnicodeEncodeError.

The second.csv file is created, but only until the error has occured, i.e. only 133 lines out of thousands are converted into .csv.

I have looked online, and people have suggested that converting the encoding into, e.g. utf-8 standard would help, but I do not know how to do this within csvkit.

Thank you for your help.

TLDR: I tried to convert .xlsx file into .csv using csvkit function in2csv. It did not work properly, citing a UnicodeEncodeError, and only a small part of the file was converted.

Martin
  • 1
  • I don't know the tool, but a quick google showed it has an `--encoding` parameter, which you might want to try setting to UTF-8 for example…? – deceze Jul 07 '23 at 12:18
  • If that doesn't do anything, then you're probably simply dealing with Unicode on Windows problems: https://stackoverflow.com/q/5419/476 – deceze Jul 07 '23 at 12:19
  • 1
    Hmm... older versions of csvkit were known to badly handle non ascii characters and required quick and dirty fixes in the tool code... Hard to say more without a [mre]. Maybe you could try to find what are the offending characters by examining the first non converted line in order to show us the problematic input. – Serge Ballesta Jul 07 '23 at 12:31
  • @SergeBallesta I have looked at two files, both break when there are Chinese characters included, e.g. 亚得科技工程有限公司 and 帝斯曼食品专业有限公司 – Martin Jul 07 '23 at 15:18
  • @deceze: csvkit still has an awful non ascii support. It allows to declare an input encoding but not the output one. The only reliable way is to set the PYTHONIOENCODING environment variable. – Serge Ballesta Jul 07 '23 at 16:30

2 Answers2

1

csvkit always create the output csvfile with the default encoding. It has options for the input encoding (--encoding and --encoding-xls) but they are not used for output.

I could reproduce the error with a tiny file containing 2 cells with the chinese characters you have provided in your comment: 亚得科技工程有限公司 and 帝斯曼食品专业有限公司.

Setting the PYTHONIOENCODING environment variable to utf8 was enough to fix:

  • Windows syntax:

      set PYTHONIOENCODING=utf8
      in2csv first.xlsx > second.csv
    
  • Linux (or other Unix-like):

      PYTHONIOENCODING=utf8 in2csv first.xlsx > second.csv
    
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

There are 2 answers that come to mind quickly - they are easy to miss in the docs though.

in2csv has an encoding parameter: -e ENCODING, --encoding ENCODING

It also has --encoding-xls ENCODING_XLS - I am not sure what the difference is, but for a unicode error you probably want -e utf8

The second possibility is more universal (but I do not recommend) - Tips & Tricks explains:

Python standard output encoding errors

If, when running a command like csvlook dummy.csv | less you get an error like:

"ascii" codec can't encode character "\u0105" in position 2: ordinal not in range(128)

The simplest option is to set the encoding that Python uses for standard streams, using the PYTHONIOENCODING environment variable:

env PYTHONIOENCODING=utf8 csvlook dummy.csv | less

I don't recommend you set this universally, it is mentioned here for completeness.

Alan
  • 2,914
  • 2
  • 14
  • 26