encoding problem with pgsql/python?

Question

I retrieved a bunch of text records from my postgresql database and intend to preprocess these text documents before analyzing them.

I want to tokenize the documents but ran into some problem during tokenizing

    #some other bunch of regex replacements
    #toToken is the text string    
    toTokens = self.regexClitics1.sub(" \\1",toTokens)                   
    toTokens = self.regexClitics2.sub(" \\1 \\2",toTokens)

    toTokens = str.strip(toTokens)

The error is TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode' I'm curious, why does this error occurs, when the encoding of the database is UTF-8?

score 4 · Accepted Answer · answered Jun 23 '11 at 07:13

4

Why don't you use toTokens.strip(). No need of str module.

There are 2 string types in Python, str and unicode. Look at this for an explanation.

answered Jun 23 '11 at 07:13

Samuel

2,430
2
19
21

+1. A shorter explanation can be found on StackOverflow: http://stackoverflow.com/questions/4545661/unicodedecodeerror-when-redirecting-to-file/4546129#4546129 (shameless plug). :) – Eric O. Lebigot Jun 23 '11 at 07:20
does that means that the strings I get from my queries are unicode? Why is that so? – goh Jun 24 '11 at 02:50
@amateur It seems so. It's strange, because AFAIK psycopg returns str objects unless instructed to do otherwise, but can't know without more information about your setup. – Samuel Jun 24 '11 at 08:54

encoding problem with pgsql/python?

1 Answers1