1

I have a text file which contains accented characters such as: 'č', 'š', 'ž'. When I read this file with a Python program and put the file content into a Python list the accented characters are lost, Python replaces them with other characters. For example: 'č' is replaced by '_'. Does anyone know how I can keep the accented characters in a Python program, when I read them from a file? My code:

import sqlite3 #to work with relational DB

conn = sqlite3.connect('contacts.sqlite') #connect to db 
cur = conn.cursor() #db connection handle

cur.execute("DROP TABLE IF EXISTS contacts")

cur.execute("CREATE TABLE contacts (id INTEGER, name TEXT, surname  TEXT, email TEXT)")

fname = "acos_ibm_notes_contacts - test.csv"
fh = open(fname) #file handle
print " "
print "Reading", fname
print " "

#--------------------------------------------------
#First build a Python list with new contacts data: name, surname and email address

lst = list() #temporary list to hold content of the file
new_contact_list = list() #this list will contain contatcs data: name, surname and email address
count = 0 # to count number of contacts
id = 1 #will be used to add contacts id into the DB
for line in fh: #for every line in the file handle
    new_contact = list()
    name = ''
    surname = ''
    mail = ''
    #split line into tokens at each '"' character and put tokens into  the temporary list
    lst = line.split('"')
    if lst[1] == ',': continue #if there is no first name, move to next line
    elif lst[1] != ',': #if 1st element of list is not empty
        name = lst[1] #this is the name
        if name[-1] == ',': #If last character in name is ','
        name = name[:-1] #delete it
        new_contact.append({'Name':name}) #add first name to new list of contacts
        if lst[5] != ',': #if there is a last name in the contact data
            surname = lst[5] #assign 5th element of the list to surname
            if surname[0] == ',': #If first character in surname is ','
                surname = surname[1:] #delete it
            if surname[-1] == ',': #If last character in surname is ','
                surname = surname[:-1] #delete it
            if ',' in surname: #if surname and mail are merged in same list element
                sur_mail = surname.split(',') #split them at the ','
                surname = sur_mail[0]
                mail = sur_mail[1]
            new_contact.append({'Surname':surname}) #add last name to new list of contacts
            new_contact.append({'Mail':mail}) #add mail address to new list of contacts
        new_contact_list.append(new_contact)
    count = count + 1

fh.close()
#--------------------------------------------------
# Second: populate the DB with data from the new_contact_list

row = cur.fetchone()
id = 1
for i in range(count):
    entry = new_contact_list[i] #every row in the list has data about 1 contact - put it into variable
    name_dict = entry[0] #First element is a dictionary with name data
    surname_dict = entry[1] #Second element is a dictionary with surname data
    mail_dict = entry[2] #Third element is a dictionary with mail data
    name = name_dict['Name']
    surname = surname_dict['Surname']
    mail = mail_dict['Mail']
    cur.execute("INSERT INTO contacts VALUES (?, ?, ?, ?)", (id, name, surname, mail))
    id = id + 1               

conn.commit() # Commit outstanding changes to disk 

-----------------------------------

This is simplified version of the program with no DB, just printing to screen

import io
fh = io.open("notes_contacts.csv", encoding="utf_16_le") #file handle

lst = list() #temporary list to hold content of the file
new_contact_list = list() #this list will contain the contact name,    surname and email address
count = 0 # to count number of contacts
id = 1 #will be used to add contacts id into the DB
for line in fh: #for every line in the file handle
    print "Line from file:\n", line # print it for debugging purposes
    new_contact = list()
    name = ''
    surname = ''
    mail = ''
    #split line into tokens at each '"' character and put tokens into  the temporary list
    lst = line.split('"')
    if lst[1] == ',': continue #if there is no first name, move to next line
    elif lst[1] != ',': #if 1st element of list is not empty
        name = lst[1] #this is the name
        print "Name in variable:", name # print it for debugging purposes
        if name[-1] == ',': #If last character in name is ','
            name = name[:-1] #delete it
            new_contact.append({'Name':name}) #add first name to new list of contacts
        if lst[5] != ',': #if there is a last name in the contact data
            surname = lst[5] #assign 5th element of the list to surname
            print "Surname in variable:", surname # print it for debugging purposes
            if surname[0] == ',': #If first character in surname is ','
                surname = surname[1:] #delete it
            if surname[-1] == ',': #If last character in surname is ','
                surname = surname[:-1] #delete it
            if ',' in surname: #if surname and mail are merged in same list element
                sur_mail = surname.split(',') #split them at the ','
                surname = sur_mail[0]
                mail = sur_mail[1]
            new_contact.append({'Surname':surname}) #add last name to new list of contacts
            new_contact.append({'Mail':mail}) #add mail address to new list of contacts
        new_contact_list.append(new_contact)
        print "New contact within the list:", new_contact # print it for debugging purposes

fh.close()

And this is the content of the file notes_contacts.csv, it has 1 line only:

Aco,"",Vidovič,aco.vidovic@si.ibm.com,+38613208872,"",+38640456872,"","","","","","","","",""
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
Aco
  • 47
  • 5
  • 4
    Please show some code. – Ulrich Schwarz Dec 28 '16 at 06:37
  • 1
    Are you using `codecs` to read the file? – martianwars Dec 28 '16 at 06:39
  • Try to open the file with utf-8 encoding, `open(Filename, 'r', encoding='utf-8')` – parthiban Dec 28 '16 at 06:45
  • 1
    http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python – Bobby Dec 28 '16 at 06:45
  • To process your file correctly you need to know which encoding it uses. If UTF-8 doesn't work there's a good chance that it's using Latin2, aka [ISO 8859-2](https://en.wikipedia.org/wiki/ISO/IEC_8859-2). Another possibility is Microsoft's Codepage 1250. BTW, you should _always_ mention the Python version with Unicode questions, since Unicode handling is quite different in Python 2 vs Python 3. – PM 2Ring Dec 28 '16 at 09:43
  • My Python version is 2.7. My file is a CSV file which I got by exporting Contacts from Lotus Notes on Mac, using the Notes export option 'Unicode'. So, I know the CSV file is not in UTF-8, but I don't know which encoding it is and how to check it. It might be ISO8859-2... Here is the code: – Aco Dec 31 '16 at 07:58
  • fname = "acos_ibm_notes_contacts - test.csv" fh = open(fname) #file handle ... (many lines of code here) fh.close() ... (more lines of code) cur.execute("INSERT INTO contacts VALUES (?, ?, ?, ?)", (id, name, surname, mail)) conn.commit() – Aco Dec 31 '16 at 08:18
  • Sorry for the messy code above - everything in one line. I copy/pasted it from the program, thought it would include new lines, but it doesn't. Obviously it puts all lines into one line here in comments. – Aco Dec 31 '16 at 08:22
  • @martianwars I am not using codecs. The reason being when I looked at codecs tutorial, I found out that codecs can be used to convert between text, bin, hex etc, but did not see how codecs can be used for for different text codings such as ISO 5559-2-UTF 8. Do you know a good tutorial or examples for such cases? – Aco Dec 31 '16 at 08:36
  • @Ulrich Schwarz, I added the code. Can you please enlighten me how I can add a new line in comments on this page - so the code will be understandable. I thought the Enter key would insert a new line, but it saves the comments instead. And it seems that I am not authorised to edit/correct my own comments once I submit them. – Aco Dec 31 '16 at 08:40
  • 1
    @Aco, please add the code in the question and not in the comments. You can edit comments upto 5 minutes after posting it – martianwars Dec 31 '16 at 08:40
  • Thank you @martianwars! Code added. :) This is the first time I am writing at Stack Overflow. – Aco Dec 31 '16 at 16:42
  • The "many lines of code" bit you've redacted is the most important part. Please include it. – Alastair McCormack Dec 31 '16 at 17:01
  • Also, please work out the encoding of the file from Lotus Notes. If the setting was "Unicode", then it's likely to be UTF-8 or UTF-16 encoded. Load the file in a good text editor like Notepad++ or Sublime to see which encoding gives the correct display. Make sure not to save anything – Alastair McCormack Dec 31 '16 at 17:09
  • @Alastair McCormack, thanks for the advise! :) I opened the file with Sublime, it shows this encoding: u'UTF-16 LE with BOM'. I added the whole code into the question. Didn't want to bore everyone with lengthy code which I thought were irrelevant. – Aco Jan 01 '17 at 17:50
  • I just found out that the problem is probably not in my program but in my Python environment which is set to 'ASCII' encoding. Namely, even when I try to print letters 'č', 'ž' or 'Đ' in Python interactive mode, it does not work. I need to find out how to set Python environment to Unicode. – Aco Jan 02 '17 at 08:53
  • ok @Aco, let me answer your question. ASCII default encoding is fine – Alastair McCormack Jan 02 '17 at 09:33
  • ok @Aco, see my answer. If you still have a problem with your environment please ask a new question, remembering to state your OS and environment. Paste the question URL and I'll take a look. You may find these two answer useful depending on your OS: http://stackoverflow.com/questions/39725839/why-in-remote-server-cannot-input-chinese-character-in-mysql-command-line-but-lo/39758069#39758069 and http://stackoverflow.com/questions/39049294/processing-non-english-text/39063084#39063084. Please don't fall into the trap of calling `sys.setdefaultencoding` – Alastair McCormack Jan 02 '17 at 09:58

2 Answers2

0

In Python 2.7, the default file mode is binary. Instead, you need to open the file in a text mode and have the text decoded as it in Python 3. You don't have to decode text when reading a file but it saves you from having to worry about encodings later in your code.

Add to the top:

import io

Change:

 fh = io.open(fname, encoding='utf_16_le')

Note: You always need to pass in the encoding as Python can't natively guess the encoding.

Now, every time you read(), the text will be converted to a Unicode string.

The SQLite module accepts TEXT as Unicode or UTF-8 encoded str. As you've already decoded your text to Unicode you don't have to do anything else.

To ensure that SQLite doesn't try to encode the main body of your SQL command back to an ASCII string, change the SQL command to a Unicode string by appending a u to the string.

E.g.

cur.execute(u"INSERT INTO contacts VALUES (?, ?, ?, ?)", (id, name, surname, mail))

Python 3 will help you avoid some of these quirks and you'll simply need to do the following to make it work:

fh = io.open(fname, encoding='utf_16_le')

As your data looks like standard Excel dialect CSV, then you can use the CSV modules to split your data. The DictReader allows you to pass the column names, which makes it ultra easy to parse your fields. Unfortunately, Python's 2.7 CSV module is not Unicode-safe so you need to use the Py3 backport: https://github.com/ryanhiebert/backports.csv

Your code can be simplified to:

from backports import csv
import io

csv_fh = io.open('contacts.csv', encoding='utf_16_le')

field_names = [u'first_name', u'middle_name', u'surname', u'email',
               u'phone_office', u'fax', u'phone_mobile', u'inside_leg_measurement']

csv_reader = csv.DictReader(csv_fh, fieldnames=field_names)

for row in csv_reader:
    if not row['first_name']: continue

    print u"First Name: {first_name}, " \
          u"Surname: {surname} " \
          u"Email: {email}".format(first_name=row['first_name'],
                                   surname=row['surname'],
                                   email=row['email'])
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • Thank you very much @Alastair McCormack. I did as you suggested, the result is still the same (accented characters scrambled), but it helped me identify the spot where the problem appears which is: when I use the list.append() method. Namely when a word with accented characters is in a variable, all accented characters are still there (ie. I can see them when I print the variable). But when I move this variable into a list using list.append(variable), the accented characters are scrambled. For example the letter 'č' becomes '\xc4\x8d'. Is there a way to overcome this, do you know? – Aco Jan 03 '17 at 23:29
  • There's something you're not telling me :) '\xc4\x8d' is the **UTF-8** encoding of `'č'`. If your data was UTF16 then you wouldn't see this unless you did an encode to UTF-8. I'm guessing that this happens when you **SELECT** data from SQLite? – Alastair McCormack Jan 04 '17 at 09:06
  • I don't think it is either of these 2. I simplified the code by deleting the DB part, now I only read from file and print to display: first print line from file ('č' is visible), then print variable ('č' still visible), then print list ('č' scrambled). There is no mention of UTF-8 in my code whatsoever, only UTF_16_le. I did add this statement to top of file: coding = "UTF-16-le". Thanks to this 'č' shows in line print and variable print. I can show you the new simplified code (if I find out where to put it). Could it be that the list.append() changes the encoding to UTF-8 by itself? – Aco Jan 04 '17 at 14:05
  • Appending a string to a list does not change its encoding. It sounds like your input data is not "UTF-16-le" but somehow changed to "UTF-8". Please remove `coding = "UTF-16-le"` as this only refer to the encoding of your source code. Add a section to your question at the bottom with your new code – Alastair McCormack Jan 04 '17 at 14:17
  • I added the new simplified code into my question. Many thanks for taking time for this!! I owe you a beer (or some other drink of your choice). :) I really wonder "who" changes the encoding. – Aco Jan 04 '17 at 15:20
  • You don't have `sys.setdefaultencoding('utf-8')` at the top of your code? Can you add `print sys.getdefaultencoding()` at the end of your code and tell me the result? – Alastair McCormack Jan 04 '17 at 16:13
  • Also, your code doesn't work with the sample data provided. Further, this looks like regular CSV, so you could parse it with the csv module rather than all this splitting on quotes – Alastair McCormack Jan 04 '17 at 16:34
  • @Aco I've updated my answer with a simplified parsing code. I hope you find it useful :) – Alastair McCormack Jan 05 '17 at 10:48
  • No, I don't have sys.setdefaultencoding('utf-8') at the top of my code. I added the sys.getdefaultencoding() at the end of my code, ran it, it prints: "ascii". Wow, where does this come from!? Must be something with list.append(). I do not mention "ascii" or "UTF-8" anywhere in my code. The simplified code which i published works for me (Python 2.7), maybe you missed to copy/paste the first line (import io)? Thanks for the "import csv" code, it looks really elegant. It does not work for me though, it throws: "ImportError: cannot import name csv". Tried it with Python 2.7 and 3.5. – Aco Jan 05 '17 at 15:55
  • Don't worry about the 'ascii' default encoding - that's correct. Your code doesn't work due to an error in the data you gave - I get a `IndexError: string index out of range`. For my code to work, you need to install the backports.csv module with: `pip install backports.csv`. – Alastair McCormack Jan 05 '17 at 16:11
-3

try to use # coding=utf-8 at the first line of code program

Tran Manh
  • 71
  • 1
  • 6
  • 2
    That won't help. The Python coding directive comments are to let the interpreter know what encoding the script file itself is encoded with. It has no bearing on how the script handles external text. – PM 2Ring Dec 28 '16 at 09:46