0

I have a string N1 LTPO BABY FOOD 6 FOR £5 from which I want to extract 6 FOR £5 using regex. I am using pyspark. Regex101 tells me that [0-9]*\sFOR\s£[0-9]* should work (https://regex101.com/r/OWAA2k/1) howecer if I try and use that within pyspark I don't have any success, the following code returns zero rows:

import pyspark.sql.functions as funcs
print sc.version
mock_data = [('N1 LTPO BABY FOOD 6 FOR £5','b'),('foo','bar')]
schema = ['a','b']
mock_df = sqlContext.createDataFrame(data=mock_data, schema=schema)
mock_df = mock_df.filter(mock_df.a.rlike('[0-9]*\sFOR\s£[0-9]*'))
mock_df.show(truncate=False)

regex filters out

If I alter the regex slightly to [0-9]*\sFOR\s* then the data that I want is filtered in, note however that the pound sign is prefixed with Â

enter image description here

Thus I can change my original regex to [0-9]*\sFOR\s£[0-9]* and it works: enter image description here

My question this...why is this strange character  appearing in the string? Why is pyspark putting it in there? I understand this will be something to do wth the encoding of the data but that's not something I know much about so am hoping someone can explain it to me and make me aware of any potential pitfalls.

jamiet
  • 10,501
  • 14
  • 80
  • 159
  • What encoding are you using? – Eli Sadoff Nov 05 '16 at 23:38
  • Are you using python 2 or 3? – Dan Nov 05 '16 at 23:41
  • Dan, sorry should have said that. Python 2. `import sys;print sys.version_info` returns `sys.version_info(major=2, minor=7, micro=10, releaselevel='final', serial=0)`. Eli, how would I know what encoding I am using? – jamiet Nov 05 '16 at 23:43
  • @jamiet [Here](http://stackoverflow.com/a/4987414/5021321) is a way to tell. – Eli Sadoff Nov 05 '16 at 23:46
  • Hi Eli. Looks like the data is encoded as a string, not unicode. – jamiet Nov 05 '16 at 23:55
  • Further investigation...I can explicitly encode the value as unicode (simply putting `u` in front of the value) and when it comes out of the dataframe I call `collect()` and even though the value is still encoded as unicode it still has the strange `Â` character, though represented by its unicode encoding: `[Row(a=u'N1 LTPO BABY FOOD 6 FOR \xc2\xa35', b=u'b')]`. I'd just like to be able to treat my data as is and not have to worry about strange characters getting in there, but not sure how. – jamiet Nov 06 '16 at 00:00

0 Answers0