0

I have special characters inserted in MYSQL DB like below samples mostly in first name,last name columns of DB.

1.  BALPAI SAB

2. à¦à¦¿à¦•à§Âরমাদিতà§Âয

Valid case:

*Saurabh Shree

S.shree

T.M.Anthony

Charles Babbage Senior*

Length is variable.All are case insensitive with no trailing spaces.only spaces and period allowed between two consecutive words.

I have gone through the posts of regexp and also changed the collation of column as well as table to utfmb4_unicode_ci and applied regexp but with no luck.

I have to search even a single occurence of characters in around 7 million records.

SELECT FARMER_BRANCH_NAME, HEX(FARMER_BRANCH_NAME) FROM BSBY.PROPOSAL

OUTPUT

Farmer Branch Name      Hex(Farmer Branch Name)
SME œ•œBRANCH JASDANœ•œ   534D45209C959C4252414E4348204A415344414E9C959C
নলহাটি        E0A6A8E0A6B2E0A6B9E0A6BEE0A69FE0A6BF
নলহাটি        E0A6A8E0A6B2E0A6B9E0A6BEE0A69FE0A6BF
নলহাটি        E0A6A8E0A6B2E0A6B9E0A6BEE0A69FE0A6BF
SME œ•œBRANCH JASDANœ•œ   534D45209C959C4252414E4348204A415344414E9C959C
Mumbai - Chembur         4D756D626169202D204368656D627572C2A0
New Delhi - Connaught Place - II   4E65772044656C6869202D20436F6E6E617567687420506C616365202D204949C2A0
Mumbai - Malad            4D756D626169202D204D616C6164C2A0
Bangalore - Cantonment    42616E67616C6F7265202D2043616E746F6E6D656E74C2A0
Ahmedabad-BOPAL           41686D6564616261642D424F50414CC2A0
SME œ•œBRANCH JASDANœ•œ    534D45209C959C4252414E4348204A415344414E9C959C

SELECT FARMER_NAME,HEX(FARMER_NAME) FROM BSBY_UAT.PROPOSAL where FARMER_NAME NOT REGEXP '[A-Za-z0-9.() ]$' 

OUTPUT

FARMER NAME         HEX(FARMER NAME)
RAHIM BISWAS        524148494D2042495357415309
ESARUL GAZI         45534152554C2047415A4909
GOLAM NABI MANDAL   474F4C414D204E414249204D414E44414C09
LATIF MANDAL        4C41544946204D414E44414C09
NILKAMAL MANDAL     4E494C4B414D414C204D414E44414C09
SHUKUR ALI MONDAL   5348554B555220414C49204D4F4E44414C09
¦€ Â¦€° Â¦€º Â§Â Â¦€¢  Â¦€  Â¦Â² Â¦Â¿  A0C2A680A0C2A680B0A0C2A680BAA0C2A7C281A0C2A680A220A0C2A680A0A0C2A6C2B2A0C2A6C2BF
HASINA KHATUN       484153494E41204B484154554E09
KSHETRAGOPAL GHOSH  4B534845545241474F50414C2047484F534809
SUKUMAR DAS HALDAR  53554B554D4152204441532048414C44415209
Yasin Hossain       596173696E20486F737361696E09
SHAH HOSSAIN MOLLA  5348414820484F535341494E204D4F4C4C4109
RAMJAN SEKH         52414D4A414E2053454B4809
Nibaran Ch. Mahato  4E69626172616E2043682E204D616861746F09
PRAKASH KUMAR MONDAL 5052414B415348204B554D4152204D4F4E44414C2009
UNFERA BEWA          554E4645524120424557410909
BODRUL HOQUE        424F4452554C20484F5155450909
à¦à¦¾à¦¦à¦² চনà§à¦¦à§à¦° সরকার    E0A6E0A6BEE0A6A6E0A6B220E0A69AE0A6A8E0A78DE0A6A6E0A78DE0A6B020E0A6B8E0A6B0E0A695E0A6BEE0A6B0
à¦à¦¾à¦¦à¦² চনà§à¦¦à§à¦° সরকার    E0A6E0A6BEE0A6A6E0A6B220E0A69AE0A6A8E0A78DE0A6A6E0A78DE0A6B020E0A6B8E0A6B0E0A695E0A6BEE0A6B0
মিনতি সিংহ    E0A6AEE0A6BFE0A6A8E0A6A4E0A6BF20E0A6B8E0A6BFE0A682E0A6B9
রেখা সরকার    E0A6B0E0A787E0A696E0A6BE20E0A6B8E0A6B0E0A695E0A6BEE0A6B0
রেখা সরকার    E0A6B0E0A787E0A696E0A6BE20E0A6B8E0A6B0E0A695E0A6BEE0A6B0
SUKDEB SARKARপ    53554B444542205341524B4152E0A6AA
KEYAMUL SEKH            4B4559414D554C2053454B480909
घोष पारà¥à¤µà¤¤à¥€  E0A498E0A58BE0A4B720E0A4AAE0A4BEE0A4B0E0A58DE0A4B5E0A4A4E0A580
à¦à¦¨à§à¦Ÿà§ সরকার    E0A69DE0A6A8E0A78DE0A69FE0A78120E0A6B8E0A6B0E0A695E0A6BEE0A6B0
à¦à¦²à¦°à¦¾à¦® সরকার  E0A6E0A6B2E0A6B0E0A6BEE0A6AE20E0A6B8E0A6B0E0A695E0A6BEE0A6B0
মনোতোষ সরকার  E0A6AEE0A6A8E0A78BE0A6A4E0A78BE0A6B720E0A6B8E0A6B0E0A695E0A6BEE0A6B0

Here is my code:

SELECT distinct(FARMER_APPLICATION_ID) as FARMER_APPLICATION_ID,FARMER_AADHAR_NO,FARMER_EPIC_NO,FARMER_NAME,FARMER_GUARDIAN_NAME,FARMER_CROP_NAME,FARMER_L3_NAME,FARMER_L4_NAME,FARMER_L5_NAME,FARMER_L6_NAME,FARMER_BANK_NAME,FARMER_BANK_IFSC,PARTY_NAME,PARTY_CODE,FARMER_BRANCH_NAME
FROM BSBY_UAT.PROPOSAL 
where FARMER_AADHAR_NO NOT regexp '^[2-9]{1}[0-9]{3}[0-9]{4}[0-9]{4}$'
OR FARMER_BANK_IFSC not regexp '^[A-Z]{4}0[A-Z0-9]{6}$'
OR FARMER_NAME NOT REGEXP '[A-Za-z.() ]$'
OR FARMER_GUARDIAN_NAME NOT REGEXP '[A-Za-z.() ]$'
or FARMER_EPIC_NO NOT REGEXP'[A-Za-z0-9\\/]$'
or FARMER_BANK_NAME NOT REGEXP'[A-Za-z.\\-() ]$'
or FARMER_BRANCH_NAME NOT REGEXP'[A-Za-z0-9.,()\\[\\]\\-]$'
shree
  • 125
  • 1
  • 4
  • 15
  • You should provide an exact definition of what "special" character means here. – Tim Biegeleisen Aug 10 '22 at 08:27
  • @TimBiegeleisen characters other than alphabets including space only – shree Aug 10 '22 at 08:32
  • Show us an example of what the text _should_ look. – Rick James Aug 12 '22 at 01:24
  • @RickJames Saurabh shree, Rick James,R.James ,S.shree,.Only spaces and dots between two words allowed.No trailing spaces. – shree Aug 12 '22 at 09:37
  • Please provide `SHOW CREATE TABLE`, and `SHOW VARIABLES LIKE 'char%';`. Also `SELECT col, HEX(col) ...` for a row with a small amount of text. (I've seen a lot of mangled text, but your example is a new one to me.) – Rick James Aug 12 '22 at 17:04
  • @rick james SHOW VARIABLES LIKE 'char%';character_set_client utf8mb4 character_set_connection utf8mb4 character_set_database utf8mb4 character_set_filesystem binary character_set_results utf8mb4 character_set_server utf8mb4 character_set_system utf8mb3 character_sets_dir /rdsdbbin/mysql-8.0.28.R4/share/charsets/ – shree Aug 14 '22 at 07:48
  • @ Rick James SHOW CREATE TABLE CREATE TABLE `PROPOSAL` ( `FARMER_NAME` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `FARMER_GUARDIAN_NAME` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `FARMER_BRANCH_NAME` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `FARMER_BANK_IFSC` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci – shree Aug 14 '22 at 07:57
  • @RickJames I have updated the sql query.Now I am able to find the records with special characters but I am not sure of getting the correct count. – shree Aug 14 '22 at 08:46
  • @RickJames I am also observing horizontal tab character inserted at the end of legitimate names like RAHIM BISWAS,GOLAM NABI MANDAL – shree Aug 14 '22 at 09:08
  • Is this text reasonable? 'घोष पार्वती' or 'ঝন্টু সরকার' (Those Davangari and Bengali strings would, when double-encoded, map to some of the strings you provided. – Rick James Aug 14 '22 at 23:53
  • Thanks @RickJames घोष पार्वती' or 'ঝন্টু সরকার' are valid strings(DEVNAGRI and BANGLA). I would also like to inform that source of this data is a website from which csv files are downloaded and uploaded in system using Excel file using python. Initially we didn't have any clue regarding these mangled texts. Hence,Around 7 million records were inserted with default charset and collation settings of Mysql DB 8.0. – shree Aug 16 '22 at 10:20

1 Answers1

0

This is taken "2" in the Question; I wonder if it gives any clues:

à ¦
à ¦¿
à ¦•
à §Â
à ¦°
à ¦®
à ¦¾
à ¦¦
à ¦¿
à ¦¤
à §Â
à ¦Â

For one of the hex strings, I see that CONVERT(UNHEX( 'E0A69DE0A6A8E0A78DE0A69FE0A78120E0A6B8E0A6B0E0A695E0A6BEE0A6B0') USING utf8mb4) yields ঝন্টু সরকার

That does not necessarily lead to a solution, but it may give a clue that there was an encoding problem during Insertion.

As for a regexp, consider something like

HEX(column) REGEXP '^(..)*[89ABCDEF]'

That will discover whether any byte in the string has an 8-bit code that is not Ascii.

This has a mixture:

CONVERT(UNHEX('53554B444542205341524B4152E0A6AA') USING utf8mb4) --> 'SUKDEB SARKARপ'

That is, it is Ascii, but with a Bengali 'PA' pm the end. The fact that you are seeing 'SUKDEB SARKARপ' is a sign of Mojibake. See this for discussion of Mojibake (and other common messes): Trouble with UTF-8 characters; what I see is not what I stored

Rick James
  • 135,179
  • 13
  • 127
  • 222