I have a string N1 LTPO BABY FOOD 6 FOR £5
from which I want to extract 6 FOR £5
using regex. I am using pyspark.
Regex101 tells me that [0-9]*\sFOR\s£[0-9]*
should work (https://regex101.com/r/OWAA2k/1) howecer if I try and use that within pyspark I don't have any success, the following code returns zero rows:
import pyspark.sql.functions as funcs
print sc.version
mock_data = [('N1 LTPO BABY FOOD 6 FOR £5','b'),('foo','bar')]
schema = ['a','b']
mock_df = sqlContext.createDataFrame(data=mock_data, schema=schema)
mock_df = mock_df.filter(mock_df.a.rlike('[0-9]*\sFOR\s£[0-9]*'))
mock_df.show(truncate=False)
If I alter the regex slightly to [0-9]*\sFOR\s*
then the data that I want is filtered in, note however that the pound sign is prefixed with Â
Thus I can change my original regex to [0-9]*\sFOR\s£[0-9]*
and it works:
My question this...why is this strange character Â
appearing in the string? Why is pyspark putting it in there? I understand this will be something to do wth the encoding of the data but that's not something I know much about so am hoping someone can explain it to me and make me aware of any potential pitfalls.