5

In PSQL (which I believe Redshift is based), there are string similarity functions like levenshtein / levenshtein_less_equal [ http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html ]. These features don't seem to have made it into Redshift [ http://docs.aws.amazon.com/redshift/latest/dg/String_functions_header.html ]. Am I wrong, or has anyone made a creative query to work around this limitation?

gatoatigrado
  • 16,580
  • 18
  • 81
  • 143

1 Answers1

7

Redshift is based on Postgres 8.0, which is extremely old (and no longer supported), so they've diverged substantially over the years. The development version of Postgres is currently 9.4, and 8.x and 9.x versions of Postgres have some substantial differences and additions in 9.0 and up.

The levenshtein function is part of the fuzzystrmatch module that you linked above, and that module was introduced in Postgres 8.3, which is likely why it didn't make the cut for Redshift (and it apparently has not since been added).

Normally I would say your best bet would be to define a custom PL/pgSQL function to implement the Levenshtein Distance algorithm, but according to the Redshift doc, User-defined functions and stored procedures are one of the many major features of Postgres that Redshift does not support, so I think you're unfortunately out of luck on that front.

Depending on your requirements, you may be able to use LIKE to achieve similar results. See this SO answer for more info on that (note that some of the alternative suggestions in that answer, such as full text are also not supported in Redshift).

Update, 2016-04-25

It appears as though since I originally answered this question back in Oct. 2014, an ability to create Python-based User Defined Functions (UDFs) has been added. This is not as ideal as being able to create a custom Postgres function inline (the doc lists an assortment of caveats for the UDFs), but should allow the levenshtein distance algorithm to be implemented in Python and processed within the context of a Redshift query.

UDFs in Apache Hive, a data warehousing project used in the Hadoop ecosystem, allows for user-defined functions in a similar manner (Java or Python based).

Community
  • 1
  • 1
khampson
  • 14,700
  • 4
  • 41
  • 43