0

I retrieved a bunch of text records from my postgresql database and intend to preprocess these text documents before analyzing them.

I want to tokenize the documents but ran into some problem during tokenizing

    #some other bunch of regex replacements
    #toToken is the text string    
    toTokens = self.regexClitics1.sub(" \\1",toTokens)                   
    toTokens = self.regexClitics2.sub(" \\1 \\2",toTokens)

    toTokens = str.strip(toTokens)

The error is TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode' I'm curious, why does this error occurs, when the encoding of the database is UTF-8?

goh
  • 27,631
  • 28
  • 89
  • 151

1 Answers1

4

Why don't you use toTokens.strip(). No need of str module.

There are 2 string types in Python, str and unicode. Look at this for an explanation.

Samuel
  • 2,430
  • 2
  • 19
  • 21
  • +1. A shorter explanation can be found on StackOverflow: http://stackoverflow.com/questions/4545661/unicodedecodeerror-when-redirecting-to-file/4546129#4546129 (shameless plug). :) – Eric O. Lebigot Jun 23 '11 at 07:20
  • does that means that the strings I get from my queries are unicode? Why is that so? – goh Jun 24 '11 at 02:50
  • @amateur It seems so. It's strange, because AFAIK psycopg returns str objects unless instructed to do otherwise, but can't know without more information about your setup. – Samuel Jun 24 '11 at 08:54