1

How can I replace non-ascii characters with their ascii counterparts in a SELECT request sent to hive ? That is have accents removed (é, ê, è => e) and have other non alphanumeric characters (``) removed.

I know I can use regexp_replace() but I'd have to deal with every accented/non-accented pair there is. Surely, there is something more practical ?

François M.
  • 4,027
  • 11
  • 30
  • 81

1 Answers1

1

It seems that you want to use

String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);

As described in Replace non ASCII character from string

I have tried using reflect but couldn't make it work due to the Normalizer.Form enum parameter.

So, it seems that you have to define a one-line UDF:

public class NormalizerUDF extends UDF {
  public String evaluate(String in) {
        return Normalizer.normalize(in, Normalizer.Form.NFD);
  }
}
Alex Libov
  • 1,481
  • 2
  • 11
  • 20