u character appears within a regular expression in python

Question

I have some lines of code which extracts email addresses from a pdf file.

 for page in pdf.pages:
      pdf = page.extractText()
      # print elpdf
      r = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
      results = r.findall(pdf)
      Listemail.append(results)
      print(Listemail[0:])
 pdf.stream.close()

Unfortunately, after running the code I have noticed that results are not completely fine as it appears a 'u' character every time a match is found:

[[u'testuser1@training.local']]
[[u'testuser2@training.local']]

Does anybody know haow to avoid that character appearing?

Thanks in advance

score 1 · Answer 1 · answered Apr 04 '13 at 22:56

That's not a problem. That u prefacing your strings just indicates that it's a Python unicode string. See this documentation. Unless you're doing anything crazy with them that for some reason requires your strings not be unicode, I don't see how this could be an issue.

score 0 · Answer 2 · answered Apr 04 '13 at 22:55

0

These are unicode strings, you don't need to avoid them unless you have some real problems.

answered Apr 04 '13 at 22:55

wRAR

25,009
4
84
97

score 0 · Accepted Answer · edited May 23 '17 at 12:03

As others have noted, this is not a bug, but a feature.

If what you want are non-unicode encoded strings, you can convert the text from unicode to something more palatable. This StackOverflow Q/A cover the subject:

Convert a Unicode string to a string in Python (containing extra symbols)

I've run into this before and in some use cases, it can be problematic, as you will then encounter issues where a method expects a non-unicode string and breaks. :)

Example solutions from that link:

>>> a=u'aaa'
>>> a
u'aaa'
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('utf8','ignore')
'aaa'
>>> str(a)
'aaa'
>>>

u character appears within a regular expression in python

3 Answers3