1

This question is similar to How to emulate MySQLs utf8_general_ci collation in PHP string comparisons but I want the function for vb.net rather than PhP.

Recently I make a lot of supposedly unique key.

Some of the keys are equivalent under UTF8 unicode collation.

For example, look at these 2 key:

byers-street-bistro__38.15_-79.07 byers-street-bistro‎__38.15_-79.07

If I paste that into front page, and look at the source code you'll see

byers-street-bistro__38.15_-79.07

byers-street-bistro‎__38.15_-79.07

Note: In stack overflow they still look different.

I know it's not the same. I guess even in stack exchange it doesn't show. Say I have 1 million such records and I want to test whether 2 string WILL be declared the same by MySQL UTF8 collation. I want to know that before uploading. How do I do that.

So vb.net think that those are different keys. When we created mysql query and upload that to database, the database complain it's the same key. Just one complain and the upload of 1 million databases will be stuck.

We don't even know what the hell is ‎? Where can we look that up anyway?

Anyway, I want a function that when given 2 strings will tell me whether they will count as the same or not.

If possible we want a function that convert strings into their most "standard" form.

For example, ‎ seems to encode nothing and the function would recoqnize all those nothing character and eliminate that.

Is there such thing?

So far this is what I do. I need something more comprehensive.

    Private Function StraightenQuotesReplacement() As Generic.Dictionary(Of String, String)
    Static replacement As Generic.Dictionary(Of String, String)
    If replacement Is Nothing Then
        replacement = New Generic.Dictionary(Of String, String)
        replacement.Add(ChrW(&H201C), """")
        replacement.Add(ChrW(&H201D), """")
        replacement.Add(ChrW(&H2018), "'")
        replacement.Add(ChrW(&H2019), "'")
    End If
    Return replacement
End Function

<Extension()>
Public Function straightenQuotes(ByVal somestring As String) As String
    For Each key In StraightenQuotesReplacement.Keys
        somestring = somestring.Replace(key, StraightenQuotesReplacement.Item(key))
    Next
    Return somestring
End Function

<Extension()>
Public Function germanCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ä", "ae")
    t = t.Replace("ö", "oe")
    t = t.Replace("ü", "ue")
    t = t.Replace("Ä", "Ae")
    t = t.Replace("Ö", "Oe")
    t = t.Replace("Ü", "Ue")
    t = t.Replace("ß", "ss")
    Return t
End Function
<Extension()>
Public Function japaneseCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ヶ", "ケ")
    Return t
End Function

<Extension()>
Public Function greekCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ς", "σ")
    t = t.Replace("ι", "ί")

    Return t
End Function
<Extension()>
Public Function franceCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("œ", "oe")
    Return t
End Function

<Extension()>
Public Function RemoveDiacritics(ByVal s As String) As String
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

<Extension()>
Public Function badcharacters(ByVal s As String) As String
    Dim t = s
    t = t.Replace(ChrW(8206), "")
    Return t
End Function

<Extension()>
Public Function sanitizeUTF8_Unicode(ByVal str As String) As String
    Return str.ToLower.removeDoubleSpaces.SpacetoDash.EncodeUrlLimited.straightenQuotes.RemoveDiacritics.greekCharacter.germanCharacter
End Function
Community
  • 1
  • 1
user4951
  • 32,206
  • 53
  • 172
  • 282

2 Answers2

1

Probably using different unicode code points for characters that look similar, e.g. hyphen-minus (- U+002D), en-dash (– U+2013), and em-dash (— U+2014) are three different characters that all look similar: - – —

Use the AscW() function to check each character.

EDIT:

As discussed in the comments below, use the System.Text.NormalizationForm namespace to determine which Unicode code points are considered to be equivalent characters.

SSS
  • 4,807
  • 1
  • 23
  • 44
  • I need to be able to do this programatically. Without human interfension show that "ä" is the same with "ae" – user4951 Jun 05 '12 at 07:52
  • If you search the Visual Studio help on System.Text.NormalizationForm I think you might find the answer. Let us know what you find! – SSS Jun 06 '12 at 02:30
  • This will do. I think I need more. – user4951 Jun 12 '12 at 13:41
  • Please turn your comment into an answer. – user4951 Jun 12 '12 at 13:41
  • That works but not perfectly. Look at http://stackoverflow.com/questions/11354999/normalize-all-utf8-character-into-its-most-standard-format#comment14955996_11354999 – user4951 Jul 06 '12 at 03:59
0

I used the VBA code below to investigate strange strings.

I copied the "byers-street" line to cell D18 of an Excel worksheet and typed call DsplInHex(Range("D18")) into the Immediate window. The result was:

62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 5F 33 38 2E 31 35 2D 37 39 2E 30 37 20 62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 200E 5F 33 38 2E 31 35 2D 37 39 2E 30 37 

Adding a line break and some spaces gives:

62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F      5F 33 38 2E 31 35 2D 37 39 2E 30 37 20 
62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 200E 5F 33 38 2E 31 35 2D 37 39 2E 30 37 

According to my Unicode book 200E is a Left-To-Right Mark. I would be interested to know how you managed to add that character to your key.

VB.NET is correct; these keys are different. Either MySQL deletes such characters or your transfer process deleted it. Either way, you need check your source data for funny characters.

Option Explicit
Public Sub DsplInHex(Stg As String)

  Dim Pos As Long

  For Pos = 1 To Len(Stg)
    Debug.Print Hex(AscW(Mid(Stg, Pos, 1))) & " ";
  Next
  Debug.Print

End Sub
Tony Dallimore
  • 12,335
  • 7
  • 32
  • 61
  • I know it's not the same. I guess even in stack exchange it doesn't show. Say I have 1 million such records and I want to test whether 2 string WILL be declared the same by MySQL. I want to know that before uploading. How do I do that. – user4951 May 24 '12 at 05:19
  • `Left-To-Right Mark` is an invisible character; it will not show anywhere. Do you know how you got this character into a key? I do not know how to insert such a character except by program. Knowing might help determine which keys you need to examine. Failing that, I would check every record (including the key) for characters with an AscW value greater than 255 or perhaps 127 and delete them. – Tony Dallimore May 24 '12 at 09:02
  • BTW. Your example does not show an equivalence problem. This is not ordinary versus smart quotes or hyphen versus em-dash. This is a character that I assume should not be there. I have no experience with alphabets that are read right-to-left. I assume this character and its partner allow you to mix left-to-right and right-to-left alphabets. – Tony Dallimore May 24 '12 at 09:08
  • that character is ignored by utf-8 collation. So I think utf-8 collation must have maped those characters to nothing and then compare the string. I want those map. – user4951 May 24 '12 at 09:12
  • Perhaps I misunderstand “collation” but I will explain the issue as I see it. "There’s" is different from "There's" because the first contains a smart quote (201B) while the second contains an ordinary quote (27). Smart or ordinary quote is a collation issue. “byers-street-bistro_38.15-79.07” is different from “byers-street-bistro‎_38.15-79.07” because the second contains an invisible character (200E) and the first does not. An extra character is not a collation issue. **Why do you want a LTM character in the second key?** – Tony Dallimore May 24 '12 at 12:05
  • Basically I took business title and use it as ID after some standardization. Some of those business title happen to have LTM. I am getting rid the LTM. However, I need a more reliable function. – user4951 May 27 '12 at 16:33