Unknown value keeps getting included in unique vector

Question

I'm trying to pull out a list of unique values from a data frame but I keep getting a value that I can't find anywhere in the original data frame. Has anyone run into something like this before?

I read in a text file:

tmpPandaObj = pd.read_csv(fn, sep='\t', header=None)   
tmpPandaObj.columns = ['stockId','dt','hhmm','seq','ecalls']

Pull out the unique values:

uniqueStockIdVec = tmpPandaObj.stockId.unique()

Yet I keep getting '\ufeff19049' included in the unique vector. I've searched the text files and data frame as hard as I possibly can, with no luck finding any '\ufeff19049' value. The only unique values should be '19049', '24937', '139677'.

This looks like an encoding issue. What happens if you use `encoding='latin-1'` argument with `pd.read_csv`? — jpp, May 11 '18 at 16:52
Thats the unicode value `feff` or byte order mark (see https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string) — FHTMitchell, May 11 '18 at 16:53
That `\ufeff` is a Unicode byte-order mark (BOM). That means you’re reading a UTF-8, UTF-16, or UTF-32 file with a BOM, but giving an explicit encoding like UTF-8 or UTF-16-LE instead of using the BOM. Often this is caused by the fact that a lot of Microsoft tools write UTF-8 with a BOM even though they’re not supposed to, and a lot of Python tools default to an explicit UTF-8 if you don’t specify anything. — abarnert, May 11 '18 at 16:54
If that’s your problem, you can specify `encoding='utf-8-sig'`, which is Python’s name for Microsoft’s incorrect format. — abarnert, May 11 '18 at 16:55

score 0 · Answer 1 · answered May 11 '18 at 17:01

What are you doing is well, but look like reference u'\ufeff' in Python string. This \ufeff is like a ghost character that pandas omits so you couldn't find this value.

If it is a problem for you, you could try clean all data in column using .encode(...) function or try to cast each row to int or whatever you need.

score 0 · Accepted Answer · answered May 11 '18 at 19:05

0

First, the fix: Specify encoding='UTF-8-sig' when reading the file.

Now, the explanation:

\ufeff is the Unicode BOM (Byte Order Mark) character. Whenever one tool writes a file with a BOM, and another tool reads the file using an explicit encoding like UTF-16-LE instead of a BOM-switching version like UTF-16, the BOM is treated as a normal character, so \ufeff shows up in your string. Outside of Microsoft-land, this specific issue (reading UTF-16 as UTF-16-LE) is by far the most common version of this problem.

But if one of the tools is from Microsoft, it's more commonly UTF-8. The Unicode standard recommends never using a BOM with UTF-8 (because bytes don't need a byte-order mark), but doesn't quite forbid it, so many Microsoft tools keep doing it. And then every other tool, including Python (and Pandas), just reads it as UTF-8 without a BOM, causing an extra \ufeff to show up. (Older, non-Unicode-friendly tools will read the same three bytes \xef\xbb\xbf as something like ï»¿, which you may have seen a few times.)

But while Python (and Pandas) defaults to UTF-8, it does let you specify an encoding manually, and one of the encodings it comes with is called UTF-8-sig, which means UTF-8 with a useless BOM at the start.

answered May 11 '18 at 19:05

abarnert

354,177
51
601
671

Brilliant. This certainly took care of the issue. Are there any pitfalls of using this encoding if I am reading multiple text files that are written with and without a BOM? In other words, do I need to make sure all my text files are consistent with the usage (or no usage) of a BOM? Lastly, my unique vector now contains the correct array of integers but is still an object type. Should I be concerned that it is not an int type? – Kyle Dixon May 11 '18 at 19:29
@KyleDixon Well, decoding a file without knowing the exact encoding is always a pitfall, and it’s always better to make sure all of your files are UTF-8 without a BOM. But if you definitely have nothing but UTF-8-sig and UTF-8, I’m pretty sure it’s safe to always use UTF-8-sig for both. The only real pitfall is that you won’t notice if you’re still creating UTF-8-sig files but didn’t want to be. (Sometimes getting a visible error is useful, after all.) – abarnert May 11 '18 at 19:50
As for the int-as-object problem: you should probably ask that as a separate question (or first search for duplicates, because all of the most common causes have been asked before). Are all of the objects of type int, despite the type being object, or are some of them strings or something else? (If they are all int, things will _work_, but it’ll take a lot more memory, and CPU time, if you have large arrays, so it’s still worth fixing, it’s just that the likely causes/useful debugging steps will be different.) – abarnert May 11 '18 at 19:53

Unknown value keeps getting included in unique vector

2 Answers2