6

I'm trying to write a PLPGSQL function which obfuscates/censors/redacts text.

-- Obfuscate a body of text by replacing lowercase letters and numbers with # symbols.
CREATE OR REPLACE FUNCTION obfuscate(str text) RETURNS text AS $$
BEGIN
  str := replace(str, '\r', E'\r');
  str := replace(str, '\n', E'\n');
  str := translate(str, 'abcdefghijklmnopqrstuvwxyz0123456789', rpad('#',36,'#'));
  str := replace(str, E'\r', '\r');
  str := replace(str, E'\n', '\n');
  RETURN str;
END
$$ LANGUAGE plpgsql;

This works, but note the dance to convert escaped newlines and carriage returns to their respective byte, and then back again. This is because my dataset contains strings that have been escaped (data which has been serialized to JSON/YAML), and I don't want to clobber those values.

Is there another more convenient way to unescape a string? It would be great to handle other escaped values, like unicode escape sequences, too.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Brent Dillingham
  • 1,019
  • 1
  • 13
  • 22

1 Answers1

2

To "unescape" a string, you have to "execute" it - literally. Use the EXECUTE command in plpgsql.
You can wrap this into a function. Naive approach:

CREATE OR REPLACE FUNCTION f_unescape(text, OUT _t text)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   EXECUTE 'SELECT E''' || $1 || ''''
   INTO _t;
END
$func$;

Call:

SELECT f_unescape('\r\nabcdef\t0123\x123\n');

This naive function is vulnerable to single quotes in the original string, which need to be escaped. But that's a bit tricky. Single quotes can be escaped in two ways in a Posix escape string syntax: \' or ''. But we could also have \\' etc. Basics:

We could enclose the string in dollar quoting, but that does not work for Posix escape string syntax. E'\'' cannot be replaced with E$$\'$$. We could add SET standard_conforming_strings = off to the function, then we wouldn't have to prepend strings with E. But that would disable function inlining and interpret escapes everywhere in the function body.

Instead, escape all ' and all (optionally) leading \ with regexp_replace():

regexp_replace($1, '(\\*)(\''+)', '\1\1\2\2', 'g')

(\\*) .. 0 or more leading \
(\''+) .. capture 1 or more '
'\1\1\2\2' .. double up each match
'g' .. replace all occurrences, not just the first

Safe function

CREATE OR REPLACE FUNCTION f_unescape(IN text, OUT _t text)
  RETURNS text
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   EXECUTE $$SELECT E'$$ || regexp_replace($1, '(\\*)(\''+)', '\1\1\2\2', 'g') || $$'$$
   INTO _t;
END
$func$;

The operation cannot be reversed reliable. There is no way to tell which special character was escaped before and which wasn't. You can escape all or none. Or do it manually like before. But if the same character was included in literal and escape form, you cannot tell them apart any more.

Test case:

SELECT t, f_unescape(t)
FROM  (
   VALUES
     ($$'$$)
   , ($$''$$)
   , ($$'''$$)
   , ($$\'$$)
   , ($$\\'$$)
   , ($$\\\'$$)
   , ($$\\\'''$$)
   , ($$\r\\'nabcdef\\\t0123\x123\\\\\'''\n$$)
   ) v(t);
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228