15

I was trying to unify the lines in my file when I observed the following:

word1 word2
word1 word2

I did not understand why these lines were not combined so I opened the file in vim and used :set list to see if there are any special characters and I found this:

 word1 <feff>word2
 word1 word2

I am not sure how to clean this word in Python. Any suggestions on what character might be and how this can be cleaned?

Legend
  • 113,822
  • 119
  • 272
  • 400

2 Answers2

32

U+FEFF is the Byte Order Mark character, which should only occur at the start of a document. In documents, it should be treated as a ZERO WIDTH NON-BREAKING SPACE. If this causes issues, you can remove it like any other character:

>>> s = u'word1 \ufeffword2'
>>> s = s.replace(u'\ufeff', '')
>>> s
u'word1 word2'

(In Python 3.1 or 3.2, drop the u in front of strings)

phihag
  • 278,196
  • 72
  • 453
  • 469
  • Thank You. This might sound silly but it gives me this error: `UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)` – Legend Jul 22 '11 at 07:01
  • @Legend That's because you're incorrectly dealing with non-ASCII characters. Use `s = sBytes.decode('UTF-8')` to decode a UTF-8 string. Test your code with inputs containing `ä` or `Σ`! – phihag Jul 22 '11 at 07:04
  • 1
    @Legend: you have to open your file using codecs: `lines = codecs.open('file.txt', 'r', 'utf-8')`, assuming your file is in utf-8. – Matt N. Jul 22 '11 at 07:05
  • @Legend On a related note, if you're not sure about the distinction between bytes and string, you may want to [read](http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html) [up](http://diveintopython3.org/strings.html#byte-arrays) [on](http://stackoverflow.com/questions/606191/convert-byte-array-to-python-string) that. – phihag Jul 22 '11 at 07:06
  • 3
    Actually, this solved the problem: `w = w.replace('\xef\xbb\xbf', '')` Thank you for the pointers. – Legend Jul 22 '11 at 07:08
1

Have you tried mytext.split(string.whitespace) ?

Matt N.
  • 1,239
  • 2
  • 11
  • 26