Oracle SQL Dynamic Query non UTF-8 Characters

Question

I am trying to write a query that will provide all non UTF-8 encoded characters in a table, that is not specific to a column name. I am doing so by comparing the length of a column not equal to the byte length. %1 is the table name I want to check entered in a parameter. I am joining to user_tab_columns to get the COLUMN_NAME. I then want to take the COLUMN_NAME results and filter down to only show rows that have bad UTF-8 data (where length of a column is not equal to the byte length). Below is what I have come up with but it's not functioning. Can somebody help me tweak this query to get desired results?

 SELECT
 user_tab_columns.TABLE_NAME,
 user_tab_columns.COLUMN_NAME AS ColumnName,
 a.*

 FROM %1 a

 JOIN user_tab_columns
 ON UPPER(user_tab_columns.TABLE_NAME) = UPPER('%1')

 WHERE (SELECT * FROM %1 WHERE LENGTH(a.ColumnName) != LENGTHB(a.ColumnName))

Why would `length != lengthb` show you data that is not properly encoded? That will show you the data that has at least one character that whose encoding requires multiple bytes. The whole point of any Unicode encoding, though, is that most characters will require more than one byte of storage. Any non-US7ASCII character will require multiple bytes to be properly encoded in UTF-8. — Justin Cave, Mar 04 '15 at 23:17
if there is non-utf 8 encoding the lengths don't match. I have this working if I hard code the table name and check a specific column for the length not matching. The issue is I don't want to go through each table a column at a time. I want to take the results from user_tab_columns.COLUMN_NAME and pass that into a sub query dynamically. — BvilleBullet, Mar 04 '15 at 23:20
Are you sure you really mean 'non-UTF8'? Anyway, you can't use a value from one table as the column name for another like that; you'll need to use dynamic SQL. — Alex Poole, Mar 04 '15 at 23:29
Our database underwent a UTF-8 conversion some time ago and there are a bunch of unrecognized corrupted characters left behind. Comparing the length to byte length picks out these bad characters. I just want to do it by table by table and not column by column. — BvilleBullet, Mar 04 '15 at 23:37
They're only corrupt if they're outside the valid UTF-8 code point ranges. Otherwise they may be perfectly valid UTF-8 characters that your client can't render, or just that you don't want in your data. Your approach will find *any* multibyte character though. Essentially you're saying that you're only expecting single-byte characters, which may or may not be true depending on the character set you converted *from* - whether than was multibyte too, or say US7ASCII. So this *may* do what you actually want, but 'if there is non-utf 8 encoding the lengths don't match' is not really correct. — Alex Poole, Mar 05 '15 at 10:25

score 0 · Accepted Answer · answered Mar 05 '15 at 10:18

In your query LENGTH(a.ColumnName) would represent the length of the column name, not the contents of that column. You can't use a value from one table as the column name in another table in static SQL.

Here's a simple demonstration of using dynamic SQL in an anonymous block to report which columns contain any multibyte characters, which is what comparing length with lengthb will tell you (discussed in comments to not rehashing that here):

set serveroutput on size unlimited
declare
  sql_str varchar2(256);
  flag pls_integer;
begin
  for rec in (
    select utc.table_name, utc.column_name
    from user_tab_columns utc
    where utc.table_name = <your table name or argument>
    and utc.data_type in ('VARCHAR2', 'NVARCHAR2', 'CLOB', 'NCLOB')
    order by utc.column_id
  ) loop

    sql_str := 'select nvl(max(1), 0) from "' || rec.table_name || '" '
      || 'where length("' || rec.column_name || '") '
      || '!= lengthb("' || rec.column_name || '") and rownum = 1';

    -- just for debugging, to see the generated query
    dbms_output.put_line(sql_str);

    execute immediate sql_str into flag;

    -- also for debugging
    dbms_output.put_line (rec.table_name || '.' || rec.column_name
      || ' flag: ' || flag);

    if flag = 1 then
      dbms_output.put_line(rec.table_name || '.' || rec.column_name
        || ' contains multibyte characters');
    end if;
  end loop;
end;
/

This uses a cursor loop to get the column names - I've included the table name too in case you want to wild-card or remove the filter - and inside that loop constructs a dynamic SQL statement, executes it into a variable, and then checks that variable. I've left some debugging output in to see what's happening. With a dummy table created as:

create table t42 (x varchar2(20), y varchar2(20));
insert into t42 values ('single byte test', 'single byte');
insert into t42 values ('single byte test', 'single byte');
insert into t42 values ('single byte test', 'single byte');
insert into t42 values ('single byte test', 'single byte');
insert into t42 values ('single byte test', 'multibyte ' || unistr('\00FF'));

running that block gets the output:

anonymous block completed
select nvl(max(1), 0) from "T42" where length("X") != lengthb("X") and rownum = 1
T42.X flag: 0
select nvl(max(1), 0) from "T42" where length("Y") != lengthb("Y") and rownum = 1
T42.Y flag: 1
T42.Y contains multibyte characters

To display the actual multibyte-containing values you could use a dynamic loop over the selected values:

set serveroutput on size unlimited
declare
  sql_str varchar2(256);
  curs sys_refcursor;
  val_str varchar(4000);
begin
  for rec in (
    select utc.table_name, utc.column_name
    from user_tab_columns utc
    where utc.table_name = 'T42'
    and utc.data_type in ('VARCHAR2', 'NVARCHAR2', 'CLOB', 'NCLOB')
    order by utc.column_id
  ) loop

    sql_str := 'select "' || rec.column_name || '" '
      || 'from "' || rec.table_name || '" '
      || 'where length("' || rec.column_name || '") '
      || '!= lengthb("' || rec.column_name || '")';

    -- just for debugging, to see the generated query
    dbms_output.put_line(sql_str);

    open curs for sql_str;
    loop
      fetch curs into val_str;
      exit when curs%notfound;

      dbms_output.put_line (rec.table_name || '.' || rec.column_name
        || ': ' || val_str);
    end loop;
  end loop;
end;
/

Which with the same table gets:

anonymous block completed
select "X" from "T42" where length("X") != lengthb("X")
select "Y" from "T42" where length("Y") != lengthb("Y")
T42.Y: multibyte ÿ

As a starting point anyway; it would need some tweaking if you have CLOB values, or NVARCHAR2 or NCLOB - for example you could have one local variable of each type, include the data type in the outer cursor query, and fetch into the appropriate local variable.

Oracle SQL Dynamic Query non UTF-8 Characters

1 Answers1

Linked