I already came up with this problem, but after some testing I decided to create a new question with some more specific Infos:
I am reading user accounts with python-ldap (and Python 2.7) from our Active Directory. This does work well, but I have problems with special chars. They do look like UTF-8 encoded strings when printed on the console. The goal is to write them into a MySQL DB, but I don't get those strings into proper UTF-8 from the beginning.
Example (fullentries is my array with all the AD entries):
fullentries[23][1].decode('utf-8', 'ignore')
print fullentries[23][1].encode('utf-8', 'ignore')
print fullentries[23][1].encode('latin1', 'ignore')
print repr(fullentries[23][1])
A second test with a string inserted by hand as follows:
testentry = "M\xc3\xbcller"
testentry.decode('utf-8', 'ignore')
print testentry.encode('utf-8', 'ignore')
print testentry.encode('latin1', 'ignore')
print repr(testentry)
The output of the first example ist:
M\xc3\xbcller
M\xc3\xbcller
u'M\\xc3\\xbcller'
Edit: If I try to replace the double backslashes with .replace('\\\\','\\) the output remains the same.
The output of the second example:
Müller
M�ller
'M\xc3\xbcller'
Is there any way to get the AD output properly encoded? I already read a lot of documentation, but it all states that LDAPv3 gives you strictly UTF-8 encoded strings. Active Directory uses LDAPv3.
My older question this topic is here: Writing UTF-8 String to MySQL with Python
Edit: Added repr(s) infos