2

Is there are straightforward way, perhaps using REGEXP_REPLACE or the like, to redact all but the last four digits of numbers (or varying length of 5 or above) appearing within free text (there may be multiple occurrences of separate numbers within the text)?

E.g.

Input = 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text'

Output = 'This is a test text with numbers ****5, *****3210 and separately number ************4321 all buried within the text'

With REGEX_REPLACE it's obviously straightforward to replace all numbers with the *, but it's maintaining the final four digits and replacing with the correct number of *s that's vexing me.

Any help would be much appreciated!

(Just for context, due to the usual kind of business limitations, this had to be done within the query retrieving the data rather than using actual Oracle DBMS redaction functionality).

Many thanks.

GMB
  • 216,147
  • 25
  • 84
  • 135
Lorielus
  • 81
  • 1
  • 6

2 Answers2

3

You could try the following regex:

regexp_replace(txt, '(\d{4})(\d+(\D|$))', '****\2')

This captures sequences of 4 digits followed by at least one digit, then by a non-digit character (or the end of string), and replaces them with 4 stars.

Demo on DB Fiddle:

with t as (select 'select This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' txt from dual)
select regexp_replace(txt, '(\d{4})(\d+\D)', '****\2') new_text from t
| NEW_TEXT                                                                                                                    |
| :-------------------------------------------------------------------------------------------------------------------------- |
| select This is a test text with numbers ****5, ****543210 and separately number ****567887654321 all buried within the text |

Edit

Here is a simplified version, suggested by Aleksej in the comments:

regexp_replace(txt, '(\d{4})(\d+)', '****\2')

This works because of the greadiness of the regexp engine, that will slurp as many '\d+' as possible.

GMB
  • 216,147
  • 25
  • 84
  • 135
  • Why the `\D` part? This could be an issue if the string ends with a number, am I right? – Aleksej Oct 31 '19 at 10:02
  • \D is a non-numeric character. It defines the end of the number. – GMB Oct 31 '19 at 10:03
  • @Aleksej: good point. I modified the regexp to properly manage numbers at end of string. [See this demo](https://dbfiddle.uk/?rdbms=oracle_18&fiddle=6e11e9e8fbc97060dd5032c6c6ad4de3). – GMB Oct 31 '19 at 10:07
  • This works well, but I believe you could even use `'(\d{4})(\d+)'`; am I missing something? – Aleksej Oct 31 '19 at 10:10
  • @Aleksej: I think that you are right. I edited my answer with this simpler solution. Thanks! – GMB Oct 31 '19 at 10:16
  • Thanks a lot - this looks great - though will only work in terms of replacing with the correct number of *s where it's the clusters of four can be added. Is there a way to tweak it so instead it looks for one digit that's followed by at least four digits and replaces each such digit with * instead? – Lorielus Oct 31 '19 at 11:31
2

If you really need to keep the length of the numbers, then (I think) there is not wayy todo it in one step. You'll have to split the string in numbers and not numbers and then replace the digits seperatly:

SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
                    THEN LPAD('*', LENGTH(txt) - 4,'*') || SUBSTR(txt, LENGTH(txt) -3) -- replace all digits but the last 4 with *
                    ELSE txt END)
       within GROUP (ORDER BY lvl)
  FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts 
          FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS  txt FROM dual)
       CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)

Result:

This is a test text with numbers *2345, ******3210 and separately number ************4321 all buried within the text

And as your example replaced the first for digits of your first number - you might also want to replace at least 4 digits:

SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
                    THEN LPAD('*', GREATEST(LENGTH(txt) - 4, 4),'*') || SUBSTR(txt, GREATEST(LENGTH(txt) -3, 5)) -- replace all digits but the last 4 with *
                    ELSE txt END)
       within GROUP (ORDER BY lvl)
  FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts 
          FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS  txt FROM dual)
       CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)

(Added GREATEST in the second line to replace at least 4 digits.)

Result:

This is a test text with numbers ****5, ******3210 and separately number ************4321 all buried within the text
Radagast81
  • 2,921
  • 1
  • 7
  • 21
  • This solution produces the exact expected result so +1. – GMB Oct 31 '19 at 18:31
  • 1
    Thx @GMB, also I personally would stay with your solution and allways print a fixed length of asterixes. The length of the numbers is an additional information that i don't want to give in situations where i hide the actual data... – Radagast81 Oct 31 '19 at 18:40