Remove four byte UTF-8 characters in classic ASP/VBScript (MySQL related)

Question

I've spent about 18 hours of trying different things and searching around now, finally I give up and have to ask you guys.

Backstory: I am finally migrating a old MS Access database to MySQL (version 5.6.16-log).

Problem: Some Unicode text in the Access database contain four bytes (UTF-8).

MySQL still has a problem with inserting four bytes UTF-8 characters. This problem is getting old and I was surprised to discover it's not fixed yet: http://bugs.mysql.com/bug.php?id=67297

I'm using "MySQL ODBC 5.3 Unicode Driver" to transfer data between databases (the latest beta development release). No matter what I try the process ends up freezing when I try to insert the string with 4 byte UTF8 characters (the thread uses 100% CPU forever). Have tried all workarounds suggested everywhere on the Internet, nothing works.

Now I will just accept the limitations of MySQL: I can't store all Unicode characters.

So I want to remove all 4 byte UTF8 characters from the text before I insert it into the database. But I can't for the life of me find a way to do it in classic ASP.

Can anybody help?

(I can't not use ASP btw, there is way too much code to rewrite it in a different language. Just changing databases is a remarkable feat; there are several of them and it will take days to complete.)

Edit: A solution in JScript is also acceptable, since it can be run from ASP pages.

Why not stop the input of Unicode characters by using `@codepage=1252` and the `Response.CodePage = 1252` so you don't have to deal with the issue until the odbc driver can support it? — user692942, Mar 10 '14 at 18:44
18 hours!! Just go with MS SQL Server and save yourself some heartache. — user692942, Mar 10 '14 at 18:49
New day, new troubles with MySQL. Today I found out that you can't have case insensitive matching (collation) where MySQL doesn't think "a" and "á" is the same character. I need something like a utf8_bin_ci collation but it doesn't exist. It's because I need ti be able to insert both"abc" and "ábc" in an unique-indexed field, and then do a search for "ABC" and find "abc" but not "ábc". This was pretty much the death of MySQL for me, I will try MS SQL... Then I don't have to change a lot of standard SQL escaping either (for example ' need to be \' in MySQL while '' in most other databases). — user3402011, Mar 12 '14 at 07:39

score 1 · Answer 1 · edited Apr 04 '20 at 14:01

1

This should work:

Function UTF8Filter(strString)
    On Error Resume Next
    For i = 1 to Len(strString)

        charCode = AscW(Mid(strString, i, 1))
        If charCode > 32 AND charCode <= 127 then   ' here was OR 
            'Append valid character'
            strString = Mid(strString, i, 1)
        End If
    Next

    UTF8Filter = strString
    On Error Goto 0
End Function

Updated function:

Function Remove4ByteUFT8(strString)
    Set objRegEx = CreateObject("VBScript.RegExp")
    objRegEx.Global = True   
    objRegEx.IgnoreCase = True
    objRegEx.Pattern = "/[\xF0-\xF7].../s"

    Remove4ByteUFT8 = objRegEx.Replace(strString, "")
End Function

edited Apr 04 '20 at 14:01

Henry Vilinskiy

3
2

answered Mar 10 '14 at 16:49

Nathan Rice

3,091
1
20
30

I tried that before and it didn't work. The ASP page is working with strings in UTF8 and Asc always just return either 1 or 32 (I think it was 32) for UTF8 characters... – user3402011 Mar 10 '14 at 16:52
I switched ASC to AscW, give that a try. – Nathan Rice Mar 10 '14 at 16:57
1

I have finally worked out a workaround though. I use two ADODB.Stream objects and go through each character in the UTF8 string, converting it from charset UTF8 to charset Windows-1252 and then the reverse on the other stream object. Before the reverse I can measure the Len() of the Windows-1252 encoded string, if it is 4-6 it is OK (includes the BOM which is always three bytes). Pain in the neck having to do this on every string insert into the database but I can find no other way of preventing a possible thread freeze due to strange UTF8 characters that users can put in. – user3402011 Mar 10 '14 at 16:58
Yep that sounds like a paint. I think the code I posted above should be simpler. AscW should return the correct value for all 65000 characters. – Nathan Rice Mar 10 '14 at 17:00
2

You probably meant "charCode >= 32 and charCode < 127" in your if statement right? Else it found no characters acceptable. Anyways, tried it out and it removes all UTF8 characters which isn't acceptable. But thanks for trying to help. Wish it was that simple. – user3402011 Mar 10 '14 at 17:19
Oh I see you want to remove *JUST* the 4byte UTF8 characters. I wrote another function to do that for you. It should work. – Nathan Rice Mar 10 '14 at 17:34
1

That function reminds me of http://stackoverflow.com/questions/1176904/php-how-to-remove-all-non-printable-characters-in-a-string where everybody is agreeing on different solutions and everybody is pointing out faults about everybody's correct solutions. =P Anyway, tried your code, it didn't work. Process froze forever. Perhaps 4 len UTF8 characters was never my problem in the first place, I never did get that exact error after all. However, whatever was wrong I'm glad I found a workaround eventually. Even if it will drag down performance a bit. – user3402011 Mar 10 '14 at 18:02
@user3402011 Not really relevant when your talking about encoding in classic asp as this article is related to php. – user692942 Mar 10 '14 at 18:40
Just said that was what I was reminded of. Besides, they are doing the exact same thing as he suggested (removing unwanted UTF8 characters with regexp). Same method, different language. Very relevant as a similarity (which is why I was reminded of it). – user3402011 Mar 10 '14 at 19:16

Remove four byte UTF-8 characters in classic ASP/VBScript (MySQL related)

1 Answers1

Linked

Related