0

I have a column which contains free-form text, i.e alphabets, digits and certain special characters and non-printable non-ascii control characters. How can I clean this text string by suppressing the non-printable characters using REGEX in Spark SQL 2.4 ?

Just to clarify further, besides ascii alphabets and digits, I also need to retain characters like %-()|,<;:">?/[]#+=@!&.. etc. Only the non-printable non-ascii characters need to be removed using regex.

Example - something similar to:

select regexp_replace(col, "[^:print:][^:ctrl:]", '')

OR

select regexp_replace(col, "[^:alphanum:]", "")

But I can't get it to work in Spark SQL (with the SQL API). Can anyone please advise with a working example.

Any help is appreciated.

Thanks

marie20
  • 723
  • 11
  • 30
  • 1
    Does this answer your question? [Match non printable/non ascii characters and remove from text](https://stackoverflow.com/questions/24229262/match-non-printable-non-ascii-characters-and-remove-from-text) –  Jul 07 '20 at 06:07
  • I need the regex pattern string that I can use in the regexp_replace function in Spark SQL using the SQL API.. something in the lines `regexp_replace(textcol, "[^:print:][^:ctrl:]", "rep_str");` – marie20 Jul 07 '20 at 08:13
  • You emphasize that "Only the non-printable **non-ascii** characters need to be removed using regex." Does that mean that you want to remove only non-ASCII characters that are non-printable? That is: non-printable ASCII characters should _not_ be removed? Or does that mean that you want to remove any character that is _either_ non-ASCII (e.g., any accented character, any Japanese/Chinese/Cyrillic/Arabic character, etc.) _or_ non-printable? – Ryan M Jul 07 '20 at 17:34
  • @Ryan - thanks for your reply.. I want to remove any character that is either non-ASCII (e.g., any accented character, any Japanese/Chinese/Cyrillic/Arabic character, etc.) or non-printable.. Pls note that I want to retain punctuation characters also (e.g. `,:/.<>` etc) – marie20 Jul 10 '20 at 01:25
  • In that case, Mandy8055's suggested duplicate seems correct to me. It suggests the pattern `[^ -~]+`. – Ryan M Jul 10 '20 at 01:29
  • @Ryan M - thanks for confirming. Besides that it would be really helpful if could you could also let me know how to specify the "alphanumeric" character class in Regex terms ? For e.g. `[[:alphanum:]]` etc which I could use in a `regexp_replace` function. – marie20 Jul 13 '20 at 07:24

0 Answers0