1

Question:

I'm facing an issue with my Python function that reads data from a CSV file and converts it to JSON format. The CSV file contains special Slovenian letters such as "č," "š," "ž," etc. I'm using UTF-8 encoding for both reading and writing the files, which, to my knowledge should support these characters. However, the function doesn't seem to handle these characters the way I intended it to. They appear as Unicode Escape sequences in the output JSON file.

Here is a simplified version of the function:

import csv
import json

def read_rail_nodes(_in_file: str, _out_file: str) -> None:
    json_objects = []
    with open(_in_file, 'r', encoding='utf-8') as file:
        csv_reader = csv.reader(file)
        next(csv_reader)
        for row in csv_reader:
            id, station_name, _, _ = row
            json_objects.append({'id': id, 'station_name': station_name})

    with open(_out_file, 'w', encoding='utf-8') as file:
        json.dump(json_objects, indent=4, fp=file)

    print("\n[rail_nodes.csv] data converted and saved to [", _out_file, "]\n")

read_rail_nodes('rail_nodes.csv', 'rail_nodes.json')

Sample Input:

43002,Laško,46.15453611,15.23225833

Sample Output:

    {
        "id": "43002",
        "station_name": "La\u0161ko"
    },

Desired Output:

    {
        "id": "43002",
        "station_name": "Laško"
    },

I am working in VS Code and I've double-checked my encoding, It is set to UTF-8, and opening the .josn file with any other encoding only made it worse.

I have been contemplating a single potential solution, although I am cautious about potential complications that may arise in the future. It occurs to me that employing UTF-8 encoding when reading the .json file could be a viable remedy. However, when it comes to visually presenting the JSON data, I find myself in a bit of a predicament as I am currently unable to devise a suitable approach

Am I missing something in my code, is there something specific I need to do to properly support special Slovenian letters when processing files with UTF-8 encoding, or are my hands tied?

Any help or guidance would be greatly appreciated!

Additional Information:

Python version: 3.11.3
Operating System: MacOS

  • I have searched something. It shows unicode number but once read by Python again it appears as 'š'. I tried it inside Python console so I believe it should work as you want it to. – JohnyCapo Aug 15 '23 at 11:48
  • @JohnyCapo Yeah, I figured that. I'm leaning toward manually replacing the codes with the special letters one by one. The dataset isn't too big, so I reckon it should be doable. – underloaded_operator Aug 15 '23 at 11:51
  • 3
    The JSON standard does not allow raw UTF-8, so pythons library correctly escapes it. All proper programs will read in the string and convert the escape sequences to a single character (Whatever this means). – MegaIng Aug 15 '23 at 11:52
  • Exactly, just as Megalng said. – JohnyCapo Aug 15 '23 at 11:52
  • 1
    If you had UTF-8 in your JSON file, other programs might break and be unable to read in the JSON file (although unlikely, not impossible, for example slightly older python versions (below 3.7 I think?) would not default to UTF-8. – MegaIng Aug 15 '23 at 11:53

1 Answers1

4

The issue you're facing is not with your code, but with how Python's json.dump() function is escaping Unicode characters.

If you want to see the JSON with the actual Unicode characters displayed directly, you can use the ensure_ascii parameter of the json.dump() function, set it to false and the JSON will be written using the actual Unicode characters instead of escape sequences, for eg:

json.dump(json_objects, indent=4, fp=file, ensure_ascii=False)
Sarthak
  • 380
  • 7