100

I'm trying to make a script that gets data out from an sqlite3 database, but I have run in to a problem.

The field in the database is of type text and the contains a html formated text. see the text below

<html>
<head>
<title>Yahoo!</title>
</head>
<body>
<style type="text/css">
html {}
.yshortcuts {border-bottom:none !important;}
.ReadMsgBody {width:100%;}
.ExternalClass{width:100%;}
</style>
<table cellpadding="0" cellspacing="0" bgcolor="#ffffff">    
<tr>
<td width="550" valign="top" align="left">

    <table cellpadding="0" cellspacing="0" width="500">
        <tr>
            <td colspan="3"><img        src="http://mail.yimg.com/nq/assets/sharedmessages/v1/us/logo.gif" width="292" height="51" style="display:block;" border="0" alt="Yahoo! Mail"></td>
        </tr>
        <tr>
            <td rowspan="3" width="1" bgcolor="#c7c4ca"></td>
            <td width="498" height="1" bgcolor="#c7c4ca"></td>
            <td rowspan="3" width="1" bgcolor="#c7c4ca"></td>
        </tr>
        <tr>
            <td width="498" valign="top" align="left">
            <table cellpadding="0" cellspacing="0">
                <tr>
                    <td width="498" bgcolor="#61399d" align="left" valign="top">
                    <table cellspacing="0" cellpadding="0"><tr><td height="24"></td></tr></table>
                    <div style="font-family:Arial, Helvetica, sans-serif;font-size:23px;line-height:27px;margin-bottom:10px;color:#ffffff;margin-left:15px;"><span style="color:#ffffff;text-decoration:none;font-weight:bold;line-height:27px;">Välkommen till Yahoo! Mail.</span></div>
                    <div style="font-family:Arial, Helvetica, sans-serif;font-size:22px;line-height:26px;margin-bottom:1px;color:#ffffff;margin-left:15px;margin-bottom:7px;margin-right:15px;">Ansluta och dela går snabbt och enkelt och är tillgängligt överallt.</div>
                    </td>
                </tr>
                <tr>
                    <td><img src="http://mail.yimg.com/nq/assets/sharedmessages/v1/all/b1.gif" width="498" height="18" style="display:block;" border="0"></td>
                </tr>
            </table>
            <table cellpadding="0" cellspacing="0" width="498">
                <tr>
                    <td width="292" valign="top">
                    <table cellpadding="0" cellspacing="0">
                        <tr>
                            <td><img src="http://mail.yimg.com/nq/assets/sharedmessages/v1/all/grad.gif" width="292" height="9" style="display:block;"></td>
                        </tr>
                        <tr>
                            <td width="292" bgcolor="#ffffff" align="left" valign="top">
                            <table cellspacing="0" cellpadding="0"><tr><td height="11"></td></tr></table>
                            <div style="margin-left:15px;">                  
                                <div style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:18px;color:#333333;margin-bottom:11px;font-weight:bold;">Det är lätt som en plätt att komma igång.</div>
                                <table cellpadding="0" cellspacing="0" width="267">
                                    <tr>
                                        <td width="16" align="left" valign="top"><div style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:16px;color:#61399d;margin-bottom:9px;font-weight:bold;">1. </div></td>
                                        <td align="left" valign="top"><div style="font-family:Arial, Helvetica, sans-serif;font-size:13px;line-height:16px;color:#61399d;margin-bottom:9px;"><a rel="nofollow" target="_blank" href="http://us-mg999.mail.yahoo.com/neo/launch?action=contacts" style="text-decoration:underline;color:#61399d;"><span>Lägg till alla dina kontakter på en plats</span></a>.</div></td>
                                    </tr>
                                    <tr>
                                        <td align="left" valign="top"><div style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:16px;color:#61399d;margin-bottom:9px;font-weight:bold;">2. </div></td>
                                        <td align="left" valign="top"><div style="font-family:Arial, Helvetica, sans-serif;font-size:13px;line-height:16px;color:#61399d;margin-bottom:9px;"><a rel="nofollow" target="_blank" href="http://mrd.mail.yahoo.com/themes" style="text-decoration:underline;color:#61399d;"><span>Anpassa din nya inkorg</span></a>.</div></td>
                                    </tr>
                                    <tr>
                                        <td align="left" valign="top"><div style="font-family:Arial, Helvetica, sans-serif;font-size:14px;line-height:16px;color:#61399d;margin-bottom:9px;font-weight:bold;">3. </div></td>
                                        <td align="left" valign="top"><div style="font-family:Arial, Helvetica, sans-serif;font-size:13px;line-height:16px;color:#61399d;"><a rel="nofollow" target="_blank" href="http://se.overview.mail.yahoo.com/mobile" style="text-decoration:underline;color:#61399d;"><span>Anslut överallt på dina mobila enheter</span></a>.</div></td>
                                    </tr>
                                </table>

                            </div>
                            </td>
                        </tr>
                        <tr><td height="13"></td></tr>
                    </table>
                    </td>
                    <td width="196" valign="top">
                    <table cellpadding="0" cellspacing="0">
                        <tr>
                            <td width="1" bgcolor="#fbfbfd" valign="top"><img src="http://mail.yimg.com/nq/assets/sharedmessages/v1/all/g1.gif" width="1" height="21" style="display:block;"></td>
                            <td width="1" bgcolor="#f5f6fa" valign="top"><img src="http://mail.yimg.com/nq/assets/sharedmessages/v1/all/g2.gif" width="1" height="21" style="display:block;"></td>
                            <td width="1" bgcolor="#e8eaf1" valign="top"><img src="http://mail.yimg.com/nq/assets/sharedmessages/v1/all/g3.gif" width="1" height="21" style="display:block;"></td>
                            <td width="1" bgcolor="#d4d4d4"></td>
                            <td width="186" bgcolor="#f0f0f0" align="left" valign="top">  
                            <table cellspacing="0" cellpadding="0"><tr><td height="3">   </td></tr></table>
                            <div style="margin-left:11px;">
                            <div style="font-family:Arial, Helvetica, sans-serif;font-size:13px;line-height:16px;color:#333333;margin-bottom:9px;"><b>Info för dig:</b></div>
                            <div style="font-family:Arial, Helvetica, sans-serif;font-size:12px;color:#43494e;line-height:18px;margin-bottom:10px;">
                            Yahoo!-ID och e-postadress:<br />
                            <div style="font-family:Arial, Helvetica, sans-serif;font-size:12px;color:#43494e;line-height:18px;">
                            Håll ditt konto och inställningar aktuella. <br><a rel="nofollow" target="_blank" href="https://edit.yahoo.com/config/eval_profile" style="text-decoration:underline;color:#61399d;"><span>Mitt konto</span></a> 
                            </div>
                            </div>
                            <table cellspacing="0" cellpadding="0"><tr><td height="20"></td></tr></table>
                            </td>
                            <td width="1" bgcolor="#dbdbdb"></td>
                            <td width="1" bgcolor="#ced2de"></td>
                            <td width="1" bgcolor="#dbdfed"></td>
                            <td width="1" bgcolor="#e8ebf3"></td>
                            <td width="1" bgcolor="#f3f4f9"></td>
                            <td width="1" bgcolor="#fafbfc"></td>
                        </tr>
                        <tr>
                            <td colspan="11"><img src="http://mail.yimg.com/nq/assets/sharedmessages/v1/all/b2.gif" width="196" height="8" style="display:block;" border="0"></td>
                        </tr>
                        <tr><td height="13"></td></tr>
                    </table>
                    </td>
                    <td width="10"></td>
                </tr>
            </table>
            </td>
        </tr>
        <tr>
            <td width="498" height="1" bgcolor="#c7c4ca"></td>
        </tr>
    </table>
    <table cellpadding="0" cellspacing="0" width="500">
        <tr>
            <td align="center" valign="top">
            <table cellspacing="0" cellpadding="0"><tr><td height="10"></td></tr></table>
                <div style="font-family:Arial, Helvetica, sans-serif;font-size:11px;line-height:18px;margin-bottom:10px;">
                <a rel="nofollow" target="_blank" href="http://info.yahoo.com/legal/se/yahoo/utos.html" style="text-decoration:underline;color:#61399d;">Yahoo! Villkor för användning</a>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a rel="nofollow" target="_blank" href="http://info.yahoo.com/legal/se/yahoo/mail/atos.html" style="text-decoration:underline;color:#61399d;">Yahoo! Mail –Villkor för användning</a>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a rel="nofollow" target="_blank" href="http://info.yahoo.com/privacy/se/yahoo/details.html" style="text-decoration:underline;color:#61399d;">Yahoo! Sekretesspolicy</a>
                </div>
            </td>
        </tr>
        <tr>
            <td align="left" valign="top">
                <div style="font-family:Arial, Helvetica, sans-serif;font-size:11px;line-height:14px;color:#545454;margin-left:16px;margin-right:14px;">Var god svara inte på detta meddelande. Detta är ett servicemeddelande som rör din användning av Yahoo! Mail. Om du vill veta mer om Yahoo!s användning av personlig information, inklusive användning av webb-beacons i HTML-baserad e-post, kan du läsa vår Yahoo! Sekretesspolicy. Yahoo!s adress är 701 First Avenue, Sunnyvale, CA 94089, USA.<br /><br />RefID: lp-1037111</div>
            </td>
        </tr>
    </table>





    </td>
</tr>
</table>
<img width="1" height="1" src="http://pclick.internal.yahoo.com/p/s=2143684696">
</body>
</html>`

and the python code that try to extract the data is as follows.

>>> import sqlite3
>>> conn = sqlite3.connect('C:/temp/Mobils/export/com.yahoo.mobile.client.android.mail/databases/mail.db')
>>> c = conn.cursor()
>>> conn.row_factory=sqlite3.Row
>>> c.execute('select body from messages_1 where _id=7')
<sqlite3.Cursor object at 0x0000000001FB78F0>
>>> r = c.fetchone()
>>> r.keys()
['body']
>>> print(r['body'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to <undefined>
>>>

Does anybody have any idea of how to print/write this to a file. Yes I know that this is printed to stdout, but I get the same UnicodeEncodeError when I try to write to a file. I tried both write method of a file object and print(r['body'], file=f).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Mattias
  • 1,331
  • 3
  • 11
  • 11
  • 8
    This is a Windows console problem, see http://wiki.python.org/moin/PrintFails – Martijn Pieters May 02 '13 at 20:16
  • 3
    When writing to a file, pick a codec that can handle the codepoints in your data. Best read up on Python and Unicode: [Python Unicode HOWTO](http://docs.python.org/3/howto/unicode.html). – Martijn Pieters May 02 '13 at 20:17
  • 7
    A simple ```replace("\u2013", "-")``` before printing did the trick for me. P.S.: Isn't it funny? A question regarding problems with the character code \u2013 is asked in 2013. – smoothware Jan 14 '17 at 14:46

5 Answers5

231

When you open the file you want to write to, open it with a specific encoding that can handle all the characters.

with open('filename', 'w', encoding='utf-8') as f:
    print(r['body'], file=f)
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 14
    This solves half the problem, printing to a file. It doesn't solve the other half, printing to `stdout`, which is connected to a cp850 Windows console. – abarnert May 02 '13 at 20:33
  • 1
    That worked sort of. It now prints the database content to a file but every character is on a new line and that results in a file that can't be interpreted by a browser. – Mattias May 03 '13 at 09:35
  • 3
    @abarnert this solution wasn't available when you wrote your comment in 2013, but now you can just use Python 3.6 or better. It bypasses the console I/O to support true Unicode. – Mark Ransom Feb 05 '19 at 05:15
139

Maybe a little late to reply. I happen to run into the same problem today. I find that on Windows you can change the console encoder to utf-8 or other encoder that can represent your data. Then you can print it to sys.stdout.

First, run following code in the console:

chcp 65001
set PYTHONIOENCODING=utf-8

Then, start python do anything you want.

Kattern
  • 2,949
  • 5
  • 20
  • 28
  • 3
    Is there anyway to set this at the start of a script? – dreyco676 Sep 10 '15 at 15:05
  • @dreyco676 I do not know. Check this out, may be helpful https://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python – Kattern Sep 11 '15 at 11:20
  • 11
    This solved my problem on Windows. Since I use a combination of ConEmu and git-bash, the solution was to add these commands to my ~/.basrhc in this form: `export ConEmuDefaultCp=65001` and `export PYTHONIOENCODING=utf-8`. In git-bash you will need to use `export` in place of `set`, and with ConEmu, the environment variable `ConEmuDefaultCP` functions like `chcp` in this answer. – Kyle Pittman Dec 21 '15 at 17:28
  • 1
    Had similar problems with Fabric and pip install -command. I was running fab deploy on Windows PowerShell, and this really did the trick. Thanks! – Niko Föhr Mar 08 '17 at 18:30
  • 1
    As of Python 3.6 it is no longer necessary to use the `chcp` command to get full Unicode support in the console, see [PEP 528](https://www.python.org/dev/peps/pep-0528/). The `PYTHONIOENCODING` may still be necessary if you want UTF-8 when the output is redirected to a file. – Mark Ransom Jul 09 '17 at 04:49
  • Am using Python 3.10.5. Does not work as of today's attempt. – Brian Lee Jun 21 '22 at 09:36
45

While Python 3 deals in Unicode, the Windows console or POSIX tty that you're running inside does not. So, whenever you print, or otherwise send Unicode strings to stdout, and it's attached to a console/tty, Python has to encode it.

The error message indirectly tells you what character set Python was trying to use:

  File "C:\Python32\lib\encodings\cp850.py", line 19, in encode

This means the charset is cp850.

You can test or yourself that this charset doesn't have the appropriate character just by doing '\u2013'.encode('cp850'). Or you can look up cp850 online (e.g., at Wikipedia).

It's possible that Python is guessing wrong, and your console is really set for, say UTF-8. (In that case, just manually set sys.stdout.encoding='utf-8'.) It's also possible that you intended your console to be set for UTF-8 but did something wrong. (In that case, you probably want to follow up at superuser.com.)

But if nothing is wrong, you just can't print that character. You will have to manually encode it with one of the non-strict error-handlers. For example:

>>> '\u2013'.encode('cp850')
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 0: character maps to <undefined>
>>> '\u2013'.encode('cp850', errors='replace')
b'?'

So, how do you print a string that won't print on your console?

You can replace every print function with something like this:

>>> print(r['body'].encode('cp850', errors='replace').decode('cp850'))
?

… but that's going to get pretty tedious pretty fast.

The simple thing to do is to just set the error handler on sys.stdout:

>>> sys.stdout.errors = 'replace'
>>> print(r['body'])
?

For printing to a file, things are pretty much the same, except that you don't have to set f.errors after the fact, you can set it at construction time. Instead of this:

with open('path', 'w', encoding='cp850') as f:

Do this:

with open('path', 'w', encoding='cp850', errors='replace') as f:

… Or, of course, if you can use UTF-8 files, just do that, as Mark Ransom's answer shows:

with open('path', 'w', encoding='utf-8') as f:
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I tried your suggestion and the error is gone but the text in the file is printed one character per line and that results in a file that the browser can't interpret correctly. Any ideas on how to correct that or why that happens? I say that the cp1252.py file was used in my error so I changed the encoding to cp1252 instead of cp850, but the principle should be the same right? – Mattias May 03 '13 at 10:20
  • @Mattias: Yes, the principle is the same for any codec that doesn't have complete coverage of Unicode (basically, everything but utf-8 and utf-7). I'm curious how you ended up with cp850 in the error you pasted here, if you've got cp1252 in your real use. Where did you copy and paste that error from? – abarnert May 03 '13 at 17:50
  • @Mattias: Meanwhile… The most common reason for getting one character per line is treating a string as if it were a sequence of strings somewhere in your code. (A string _is_ a sequence of 1-character strings, so it works, but not the way you hoped.) But we'd have to actually see the relevant code to help you debug it. And it's almost certainly a completely separate problem from this one, so you should probably create a new question. – abarnert May 03 '13 at 17:54
  • @Mattias: To give you an example of ways you could get this one-char-per-line problem… If you do `bodies = [row['body'] for row in c.fetchall()]` and then `for body in bodies: print(body)`, that's correct. But if you do `bodies = c.fetchone()['body']` and then `for body in bodies: print(body)`, it will give you one character at a time. – abarnert May 03 '13 at 17:58
  • I got the error message from recreating the problem in smaller sample code that I wrote in the consol. – Mattias May 03 '13 at 20:13
  • Thanks then i think where the fault is. I have a print funktion that takes a file and something to print and it asumes it is a tuple. The for loop will become like the last for loop in your example. – Mattias May 03 '13 at 20:22
  • Thanks that solved the problem, making a tuple off the `['body']` removed the one character per line problem. – Mattias May 03 '13 at 20:29
  • @Mattias: That sounds right. You've got code that assumes a tuple of strings, but only gets one string, it will end up treating the string as a tuple of one-char strings. Glad you solved it. – abarnert May 03 '13 at 21:27
13

for me using export PYTHONIOENCODING=UTF-8 before executing python command worked .

Samsul Islam
  • 2,581
  • 2
  • 17
  • 23
Raj
  • 1,698
  • 4
  • 22
  • 39
3

On Windows (git bash):

chcp.com 65001
export PYTHONIOENCODING=utf-8

And then you can run your script, in my case it was flask so finally it could decode utf on my website from mysql db...

Noone
  • 161
  • 1
  • 6