0

I am using a hash table to store some names and ids in Greek characters.

    $hsNames = @{}
    $hsNameID = 1

    $name = "Νικος"

    $hsNames.Add($name, $hsNameID)
    $hsNameID++

    $name = "Νίκος"
    $hsNames.Add($name, $hsNameID)

    $hsNames

The output of the above is:

Name                           Value      
----                           -----    
Νικος                          1                                              
Νίκος                          2   

This means that two keys were created for the same name when there is a greek accent in one of them. Now I do not want this to happen, I need to have only one key with the 1st ID (1)- the behavior of utf8_unicode_ci in MySQL. I guess I need to somehow tell powershell to use the Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/tr10-33.html) in string comparisons. But how?

pankal
  • 124
  • 13

1 Answers1

1

Interesting question, even though one could argue that the two names are different because of the accents. You have to decide whether to store the original spelling and "normalized" spelling, or just the normalized spelling as the conversion is a one-way process.

I found two links that provide a method to arrive at a solution. Ignoring accented letters in string comparison and the PowerShell version of this same C# code.

Using the PowerShell script in the ISE, I was able to write the following:

$hsNames = @{}
$hsNameID = 1

$name1 = "Νικος"

$hsNames.Add($name1, $hsNameID)
$hsNameID++

$name2 = "Νίκος"
$hsNames.Add($name2, $hsNameID)

$hsNames

$new1 = Remove-StringDiacritic $name1
$new2 = Remove-StringDiacritic $name2

"With Diacritic removed"
$new1
$new2
$new1 -eq $new2

and the output was:

Name                           Value                                                                                                                                 
----                           -----                                                                                                                                 
Νικος                          1                                                                                                                                     
Νίκος                          2                                                                                                                                     
With Diacritic removed
Νικος
Νικος
True

Based on this, you can "normalize" your strings before inserting in your hash table, and you will end up with a single Key instead of two that you desire.

Community
  • 1
  • 1
Kory Gill
  • 6,993
  • 1
  • 25
  • 33
  • Yes I found out about this "normalization" after I posted the question. A major problem is the performance of Remove-StringDiacritic function. The csv data contains more than 100.000.000 rows and I have two string fields in each row. Some first tests showed a slowed down import process by 80%. So, I will not normalize the strings **before** entering them as I initially did, but use the normalized strings as values in the hashtable only if the original does not already exist... – pankal Jan 26 '16 at 17:40
  • Also I am worried whether this normalization will cover everything utf8_unicode_ci does, but I have to give it a try. – pankal Jan 26 '16 at 17:52