Parse CSV files for Unicode values in Python

Question

First and foremost - I'm a DBA with very little exp with python (or any coding language), so any help will be greatly appreciated.

I have a need to parse a few hundred CSV files to spot any unicode characters that may exist and doing this by hand is not possible. The intent is to define how problematic the data is going to be when we go to attempt to import into a non-unicode database (it is a FAR more complex story but this is the gist of the situation).

I've gotten as far as being able to open the CSV file and read the contents, including the unicode characters - if I had unlimited time, I'd bang my head against a wall, but...I don't. Thanks, in advance, for any light you can shine.

When you say "unicode characters" do you mean non-ascii-characters? — snakecharmerb, Apr 22 '21 at 22:37
Correct, I'm looking for cyrillic, japanese, chinese, korean, etc — Cody Powell, Apr 23 '21 at 00:37
Are you looking for [`unidecode` module](https://pypi.org/project/Unidecode/)? or alike? See [Romanization of Unicode text](https://stackoverflow.com/questions/9842527/). — JosefZ, Apr 23 '21 at 11:05
No, I just need to dump any unicode characters found into an output file without conversion. — Cody Powell, Apr 23 '21 at 23:26

score 0 · Accepted Answer · answered Apr 24 '21 at 15:07

To collect all the non-ASCII characters in a file into a list you can do this:

non_ascii_chars = []
with open('myfile.csv') as f:
    for line in f:
        for char in line:
            if ord(char) > 127:
                non_ascii_chars.append(char)

The ord built-in function returns the Unicode codepoint of a character; ASCII characters have codepoints in the range 0 - 127.

A more succinct version, using a list comprehension:

with open('myfile.csv') as f:
    non_ascii_chars = [char for line in f for char in line if ord(char) > 127]

To write the collected characters to a file:

with open('non_ascii_chars.txt', 'w', encoding='utf-8') as f:
    f.write(''.join(non_ascii_chars))

This is EXACTLY what I needed. I greatly appreciate the time/effort you spent answering my question. — Cody Powell, Apr 26 '21 at 17:35

Parse CSV files for Unicode values in Python

1 Answers1