Why is pound sign (£) converted to Â£ in pyspark?

Question

I have a string N1 LTPO BABY FOOD 6 FOR £5 from which I want to extract 6 FOR £5 using regex. I am using pyspark. Regex101 tells me that [0-9]*\sFOR\s£[0-9]* should work (https://regex101.com/r/OWAA2k/1) howecer if I try and use that within pyspark I don't have any success, the following code returns zero rows:

import pyspark.sql.functions as funcs
print sc.version
mock_data = [('N1 LTPO BABY FOOD 6 FOR £5','b'),('foo','bar')]
schema = ['a','b']
mock_df = sqlContext.createDataFrame(data=mock_data, schema=schema)
mock_df = mock_df.filter(mock_df.a.rlike('[0-9]*\sFOR\s£[0-9]*'))
mock_df.show(truncate=False)

If I alter the regex slightly to [0-9]*\sFOR\s* then the data that I want is filtered in, note however that the pound sign is prefixed with Â

Thus I can change my original regex to [0-9]*\sFOR\sÂ£[0-9]* and it works:

My question this...why is this strange character Â appearing in the string? Why is pyspark putting it in there? I understand this will be something to do wth the encoding of the data but that's not something I know much about so am hoping someone can explain it to me and make me aware of any potential pitfalls.

Dan, sorry should have said that. Python 2. `import sys;print sys.version_info` returns `sys.version_info(major=2, minor=7, micro=10, releaselevel='final', serial=0)`. Eli, how would I know what encoding I am using? — jamiet, Nov 05 '16 at 23:43
@jamiet [Here](http://stackoverflow.com/a/4987414/5021321) is a way to tell. — Eli Sadoff, Nov 05 '16 at 23:46
Hi Eli. Looks like the data is encoded as a string, not unicode. — jamiet, Nov 05 '16 at 23:55
Further investigation...I can explicitly encode the value as unicode (simply putting `u` in front of the value) and when it comes out of the dataframe I call `collect()` and even though the value is still encoded as unicode it still has the strange `Â` character, though represented by its unicode encoding: `[Row(a=u'N1 LTPO BABY FOOD 6 FOR \xc2\xa35', b=u'b')]`. I'd just like to be able to treat my data as is and not have to worry about strange characters getting in there, but not sure how. — jamiet, Nov 06 '16 at 00:00

Why is pound sign (£) converted to Â£ in pyspark?

0 Answers0