-6

I have some lines of code which extracts email addresses from a pdf file.

 for page in pdf.pages:
      pdf = page.extractText()
      # print elpdf
      r = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
      results = r.findall(pdf)
      Listemail.append(results)
      print(Listemail[0:])
 pdf.stream.close()

Unfortunately, after running the code I have noticed that results are not completely fine as it appears a 'u' character every time a match is found:

[[u'testuser1@training.local']]
[[u'testuser2@training.local']]

Does anybody know haow to avoid that character appearing?

Thanks in advance

Lesmana
  • 25,663
  • 9
  • 82
  • 87
jkpascual
  • 3
  • 1

3 Answers3

1

That's not a problem. That u prefacing your strings just indicates that it's a Python unicode string. See this documentation. Unless you're doing anything crazy with them that for some reason requires your strings not be unicode, I don't see how this could be an issue.

Henry Keiter
  • 16,863
  • 7
  • 51
  • 80
0

These are unicode strings, you don't need to avoid them unless you have some real problems.

wRAR
  • 25,009
  • 4
  • 84
  • 97
0

As others have noted, this is not a bug, but a feature.

If what you want are non-unicode encoded strings, you can convert the text from unicode to something more palatable. This StackOverflow Q/A cover the subject:

Convert a Unicode string to a string in Python (containing extra symbols)

I've run into this before and in some use cases, it can be problematic, as you will then encounter issues where a method expects a non-unicode string and breaks. :)

Example solutions from that link:

>>> a=u'aaa'
>>> a
u'aaa'
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('utf8','ignore')
'aaa'
>>> str(a)
'aaa'
>>> 
Community
  • 1
  • 1
Wing Tang Wong
  • 802
  • 4
  • 10