I have a column which contains free-form text, i.e alphabets, digits and certain special characters and non-printable non-ascii control characters. How can I clean this text string by suppressing the non-printable characters using REGEX in Spark SQL 2.4 ?
Just to clarify further, besides ascii alphabets and digits, I also need to retain characters like %-()|,<;:">?/[]#+=@!&..
etc. Only the non-printable non-ascii characters need to be removed using regex.
Example - something similar to:
select regexp_replace(col, "[^:print:][^:ctrl:]", '')
OR
select regexp_replace(col, "[^:alphanum:]", "")
But I can't get it to work in Spark SQL (with the SQL API). Can anyone please advise with a working example.
Any help is appreciated.
Thanks