3

I have a VBA formula-function to split a string and add space between each character. It works fines only for an Ascii string. But I want to do the same for the Tamil Language. Since it is Unicode, the result is not readable. It splits even the auxiliary characters, Upper dots, Prefix, Suffix auxilary characters which should not be separated in Tamil/Hindi/Kanada/Malayalam/All India Languages. So, how to write a function to split a Tamil Word into readable characters.

Function AddSpace(Str As String) As String
    Dim i As Long
    For i = 1 To Len(Str)
        AddSpace = AddSpace & Mid(Str, i, 1) & " "
    Next i
    AddSpace = Trim(AddSpace)
End Function

Adding Space is not the important point of this question. Splitting the Unicode string into an array from any of those languages is the requirement.

For example, the word, "பார்த்து" should be separated as "பா ர் த் து", not as "ப ா ர ் த ் த ு". As you can see, the first two letters "பா" (ப + ா) are combined. If I try to manually put a space in between them, I can't do it in any word processor. If you want to test, please put it in Notepad and add space between each character. It won't allow you to separate as ("ப ா"). So "பார்த்து" should be separated as "பா ர் த் து". It is the correct separation in Tamil like languages. This is the one that I am struggling to achieve in VBA.

enter image description here

The Character Code table for Tamil is here. Tamil/Hindi/many Indian languages have (1)Consonants, (2)Independent vowels, (3)Dependent vowel signs, (4)Two-part dependent vowel signs. Among these 4 types, the first two are each one separate lettter, no issues with them. but the last 2 are dependent, they should not be separated from its joint character. For example, the letter, பா (ப + ் ), it contains one independent (ப) and one dependent (ா) letter.

If this info is not enough, please comment what should I post more.

(Note: It is possible in C#.Net using the code from the MS link by @Codo)

Mg Bhadurudeen
  • 1,140
  • 1
  • 12
  • 24
  • I have no way to test, but look into [StrConv](https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/strconv-function) and [THIS](https://stackoverflow.com/questions/13195583/split-string-into-array-of-characters). There are answers to that question that use StrConv to split the string into an array. Then you loop the array possibly. – Scott Craner Aug 13 '21 at 15:25
  • Please [edit] your question to share a [mcve]. How do you call the `AddSpace` function; in particular, where its argument supplied comes from? Afaik, VBA uses UCS-2 internally? – JosefZ Aug 13 '21 at 18:13
  • @ScottCraner I have gone through the links already, now again, they separate the word into bytes, not into readable characters from a string in Tamil or a similar Language. – Mg Bhadurudeen Aug 14 '21 at 05:59
  • @JosefZ I think, I have given the reproducible example. I put the Tamil word in Excel cell A1 and put my provided formula =AddSpace("A1"), it works well for Anscii, not for Tamil. I couldn't write a function that separates the Tamil word into readable letters, I have no idea. – Mg Bhadurudeen Aug 14 '21 at 06:02
  • 2
    Tamil has complex rules of combining "characters" into "graphical symbols" and the representation chosen by Unicode doesn't help either. To achieve your goal, you will need code that is aware of how Unicode codepoints are combined into grapheme clusters. This would be available in .NET (https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.gettextelementenumerator?view=net-5.0) but has to be implemented from scratch in VBA. – Codo Aug 14 '21 at 08:18
  • @Codo Thank you Codo. It works fine in .net. So I should find a way to bring the same to VBA. Thanks again. – Mg Bhadurudeen Aug 14 '21 at 09:40
  • Maybe this could help: [How to call .NET methods from Excel VBA?](https://stackoverflow.com/questions/37074533/) – JosefZ Aug 14 '21 at 13:23

4 Answers4

0

You can assign a string to a Byte array so the following might work

Dim myBytes as Byte
myBytes = "Tamilstring"

which generates two bytes for each character. You could then create a second byte array twice the size of the first by using space$ to crate a suitable string and then use a for loop (step 4) to copy two bytes at a time from the first to the second array. Finally, assign the byte array back to a string.

freeflow
  • 4,129
  • 3
  • 10
  • 18
  • I've added more info into my question. Tamil unicode may use 2 bytes. But a readable letter is not always a single 2 byte. It may be a combined one, I've added this info. Pl chk. – Mg Bhadurudeen Aug 14 '21 at 06:05
  • myBytes = StrConv(TextFromCellHavingTamilWord, vbFromUnicode, 1097) '1097 is the locale id for Tamil, the output is a strange symbol myBytes(x) = "�" – Mg Bhadurudeen Aug 14 '21 at 06:15
0

The problem you have is you are looking for what Unicode calls an extended grapheme cluster.

For a Unicode compatible regex engine that is simply /\X/

Not sure how you do that in VBA.

JGNI
  • 3,933
  • 11
  • 21
0

Referring the link mentioned by @ScottCraner in comments on the question and Character code for Tamil.

Check the result in cell A2 and highlighted in yellow are Dependent vowel signs which are used in DepVow string

Screenshot of the macro result

Sub Split_Unicode_String()
'https://stackoverflow.com/questions/68774781/how-to-split-an-unicode-string-to-readable-characters
Dim my_string As String
    'input string
Dim buff() As String
    'array of input string characters
Dim DepVow As String
    'Create string of Dependent vowel signs
Dim newStr As String
    'result string with spaces as desired
Dim i As Long

my_string = Range("A1").Value

ReDim buff(Len(my_string) - 1) 'array of my_string characters
For i = 1 To Len(my_string)
    buff(i - 1) = Mid$(my_string, i, 1)
    Cells(1, i + 2) = buff(i - 1)
    Cells(2, i + 2) = AscW(buff(i - 1)) 'used this for creating DepVow below
Next i

'Create string of Dependent vowel signs preceded and succeeded by comma
DepVow = "," & Join(Array(ChrW$(3006), ChrW$(3021), ChrW$(3009)), ",")
newStr = ""

For i = LBound(buff) To UBound(buff)
    If InStr(1, DepVow, ChrW$(AscW(buff(i + 1))), vbTextCompare) > 0 Then
        newStr = newStr & ChrW$(AscW(buff(i))) & ChrW$(AscW(buff(i + 1))) & " "
        i = i + 1
    Else
        newStr = newStr & ChrW$(AscW(buff(i))) & " "
    End If
Next i

'result string in range A2
Cells(2, 1) = Left(newStr, Len(newStr) - 1)

End Sub
Naresh
  • 2,984
  • 2
  • 9
  • 15
  • First loop is created to show the resultant array of my_string characters, Thus could be further improved to accommodate in one loop by checking `Mid$(my_string, i+1, 1)` in `DepVow` – Naresh Aug 30 '21 at 12:19
0

Try below algorithm. which will concat all the mark characters with letter characters.

redim letters(0) 
For i=1 To Len(Str) 
If ascW(Mid(Str,i,1)) >3005 And ascW(Mid(Str,i,1)) <3022 Then 
letters(UBound(letters)-1) = letters(UBound(letters)-1)+Mid(Str,i,1) 
Else REDIM PRESERVE
letters(UBound(letters) + 1) 
letters(UBound(letters)-1) = Mid(Str,i,1)
End If 
Next
 
MsgBox(join(letters, ", "))'return பா, ர், த், து,
Neechalkaran
  • 413
  • 4
  • 6