0

I'm using Oracle REGEXP_REPLACE to remove all non standard ascii (special) characters from a CLOB freetext field a client uses to store data, some users copy and paste from Excel which causes issues when reading the text from the CLOB. I've managed to get rid of all special characters and maintain the layout for paragraphs etc.

The issue I'm having is when I compare the cleansed field to the original, after converting to char string of course, is it is returning rows that have no difference between them, I've pulled out both values and there is no difference in most cases. It's returning the whole dataset instead of just those that have been cleansed.I've run out of ideas

FYI a big part of maintaining the layout means I need to keep the carriage return CHR(13)

Below is the query

select *
from (
select incident, entity,
trim(to_char(textarea)) textarea,
trim(to_char(regexp_replace(textarea,'[^' || CHR(13) || CHR(32) || '-' || CHR(126) || ']',''))) regexp_text
from response
) tab
where tab.regexp_text <> tab.textarea

I've tried a number of different combinations with trimming whitespace and substring smaller strings etc. with no luck

Kevin T
  • 132
  • 1
  • 1
  • 11
  • Can you please post some example data, showing the expected and the actual result? – Aleksej Feb 22 '16 at 12:10
  • @Aleksej The expected result would be strings that don't match i.e `éklzéiuhkn` and `klziuhkn` except it's also returning strings that match as well as ones that don't – Kevin T Feb 22 '16 at 12:23
  • You could use http://stackoverflow.com/questions/22533037/how-to-call-oracle-md5-hash-function to calculate the MD5 checksum for both result columns, to see if they are *really* identical. – Erich Kitzmueller Feb 22 '16 at 12:51

3 Answers3

0

Your pattern is: '[^' || CHR(13) || CHR(32) || '-' || CHR(126) || ']'.

The caret (^) is replacing everything that is not a special character with an empty string. With Oracle's default settings, I would expect this to return the empty string, which is treated as a NULL -- but that would mean nothing is returned.

In any case, try removing the caret:

trim(to_char(regexp_replace(textarea, '[CHR(13) || CHR(32) || '-' || CHR(126) || ']',' '))) as regexp_text
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0
WITH data ( value ) AS (
  SELECT 'éklzéiuhkn' FROM DUAL
)
SELECT REGEXP_REPLACE( value, '[^\r -~]', '' )
FROM   data;

Outputs:

REGEXP_R
--------
klziuhkn

So your query should be something like:

SELECT *
FROM   (
  SELECT incident,
         entity,
         TRIM( textarea ) AS textarea,
         TRIM( REGEXP_REPLACE( textarea, '[^\r -~]', '' ) ) AS regexp_text
  FROM   response
)
WHERE textarea <> regexp_text;
MT0
  • 143,790
  • 11
  • 59
  • 117
0

I finally figured it out, the issue was in formatting the comparisson strings in the where clause to include all Ascii characters and match against the regexp I planned to use, it was the only way to eliminate bogus and invisible encoded characters.

select incident, entity,
regexp_replace(textarea,'[^' || CHR(13) || ' -}‘-~À-ü]',''))) regexp_text
regexp_replace(textarea,'[^' || CHR(13) || CHR(32) || '-' || CHR(125) || CHR(145) || '-' || CHR(152) || CHR(192) || '-' || CHR(252) || ']','') regexp_text2
from response
where to_char(regexp_replace(textarea,'[^' || CHR(163) || CHR(1) || '-' || CHR(125) || CHR(145) || '-' || CHR(152) || CHR(192) || '-' || CHR(252) || ']','')) <> to_char(regexp_replace(textarea,'[^' || CHR(1) || '-' || CHR(255) || ']',''))

I included both lines to show the simple regexp of '[^' || CHR(13) || ' -}‘-~À-ü]' as well as the CHR(x) version. I also needed to include latin characters in the end.

For some reason using only CHR() when specifing ascii characters works 100% of the time, I guess it might have something to do with the environments NLS_LANG setting.

This should work for all those looking to exclude weird encoded characters from their strings

Kevin T
  • 132
  • 1
  • 1
  • 11