Using sql/plsql, how do you find out which character set a text uses?

Question

I have an Oracle db which stores the content of documents originating from all over the world, with different languages. The documents are stored in a table with a BLOB column which stores the documents' content.

I want to find out what the char set is for every doc, with an Oracle procedure. I don't want to use the utility CSSCAN since it seems you have to use it in a separate session, outside of your procedure.

Thanks for your help!

This is an extremely difficult if not impossible request. The Characterset is set at the database level and will be unique to each database where you have received the data. So on the upside you should only see one encoding standard per source but there is crossover between these standards so identifying one without being told what it is will be an extreme challenge. Better to have your source tell you. 1) http://programmers.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file 2) http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file — , Jan 19 '16 at 20:05
@DanK It's impossible to do this with 100% accuracy but Oracle includes tools to help guess the character set, see my answer below. Although the character set is defined at the database level that configuration does not apply to BLOBs, each of which could contain a different character set. — Jon Heller, Jan 24 '16 at 22:46

score 3 · Accepted Answer · edited May 23 '17 at 11:59

Oracle Globalization Development Kit can detect character sets.

The GDK is included with Oracle but it is not installed in the database by default. To load the .jar files into the database find the jlib directory in the Oracle home and run this operating system command:

loadjava -u USER_NAME@SID orai18n.jar orai18n-collation.jar orai18n-lcsd.jar orai18n-mapping.jar orai18n-net.jar orai18n-servlet.jar orai18n-tools.jar orai18n-translation.jar orai18n-utility.jar

Some extra Java privileges are needed, even if your user has DBA. Run this command and then re-connect:

exec dbms_java.grant_permission( 'YOUR_USER_NAME', 'SYS:java.lang.RuntimePermission', 'getClassLoader', '' );

Create a Java class to do the detection. Below is a very simple example that returns the best guess for a string:

create or replace and compile java source named "Character_Set_Detector"
as
import oracle.i18n.lcsd.*;
import java.sql.*;
import java.io.IOException;
public class Character_Set_Detector
{
    public static String detect(Blob some_blob) throws SQLException, IOException
    {
        LCSDetector detector = new LCSDetector();
        detector.detect(some_blob.getBinaryStream());
        LCSDResultSet detector_results = detector.getResult();
        return detector_results.getORACharacterSet();
    }
}
/

Wrap the Java class in a PL/SQL function:

--Wrap the Java class in a PL/SQL function:
create or replace function detect_character_set(some_blob blob)
return varchar2
as language java
name 'Character_Set_Detector.detect(java.sql.Blob) return java.lang.String';
/

I simulated different character sets by translating a string into different languages, saving the text as different encodings with a text editor, opening the file with hex editor, and converting the hex into a BLOB:

--UTF8
--The quick brown fox jumps over the lazy dog
select 1 id, detect_character_set(hextoraw('54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f67')) character_set from dual union all
--Western European (ISO-8859-1)
--El zorro marrón rápido salta sobre el perro perezoso
select 2 id, detect_character_set(hextoraw('456c207a6f72726f206d617272f36e2072e17069646f2073616c746120736f62726520656c20706572726f20706572657a6f736f')) from dual union all
--Chinese Simplified (GBK)
--敏捷的棕色狐狸跳过懒狗
select 3 id, detect_character_set(hextoraw('c3f4bdddb5c4d7d8c9abbafcc0eaccf8b9fdc0c1b9b7')) from dual union all
--Western European (Windows-1252)
--Der schnelle braune Fuchs springt über den faulen Hund
select 4 id, detect_character_set(hextoraw('446572207363686e656c6c6520627261756e6520467563687320737072696e677420fc6265722064656e206661756c656e2048756e64')) from dual union all
--Cyrillic (KOI8-R)
--Быстрая коричневая лиса прыгает через ленивую собаку
select 5 id, detect_character_set(hextoraw('e2d9d3d4d2c1d120cbcfd2c9decec5d7c1d120ccc9d3c120d0d2d9c7c1c5d420dec5d2c5da20ccc5cec9d7d5c020d3cfc2c1cbd5')) from dual;

ID  CHARACTER_SET
--  -------------
1   US7ASCII
2   WE8ISO8859P1
3   ZHS16CGB231280
4   WE8ISO8859P1
5   CL8KOI8R

That trivial example works well but I don't know how well it will work with real-world files. There are a lot of features in the GDK, the above code is only a simple starting point. With only minor changes the code can also detect languages as demonstrated in my answer here.

I'm impressed, Jon. I learned something today! :-) I'll have to play with the GDK one of these days. Thanks for this answer. — , Jan 25 '16 at 15:19

Using sql/plsql, how do you find out which character set a text uses?

1 Answers1

Linked

Related