0

First and foremost - I'm a DBA with very little exp with python (or any coding language), so any help will be greatly appreciated.

I have a need to parse a few hundred CSV files to spot any unicode characters that may exist and doing this by hand is not possible. The intent is to define how problematic the data is going to be when we go to attempt to import into a non-unicode database (it is a FAR more complex story but this is the gist of the situation).

I've gotten as far as being able to open the CSV file and read the contents, including the unicode characters - if I had unlimited time, I'd bang my head against a wall, but...I don't. Thanks, in advance, for any light you can shine.

1 Answers1

0

To collect all the non-ASCII characters in a file into a list you can do this:

non_ascii_chars = []
with open('myfile.csv') as f:
    for line in f:
        for char in line:
            if ord(char) > 127:
                non_ascii_chars.append(char) 

The ord built-in function returns the Unicode codepoint of a character; ASCII characters have codepoints in the range 0 - 127.

A more succinct version, using a list comprehension:

with open('myfile.csv') as f:
    non_ascii_chars = [char for line in f for char in line if ord(char) > 127]

To write the collected characters to a file:

with open('non_ascii_chars.txt', 'w', encoding='utf-8') as f:
    f.write(''.join(non_ascii_chars))
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153