Identify double byte character in a string and convert that into a single byte character

Question

In my Go project, I am dealing with asian languages and There are double byte characters. In my case, I have a string which contains two words and there is a space between them.

EG: こんにちは　世界

Now I need to check if that space is a double byte space and if so, I need to convert that into single byte space.

I have searched a lot, but I couldn't find a way to do this. Since I cannot figure out a way to do this, sorry I have no code sample to add here.

Do I need to loop through each character and pick the double byte space using its code and replace? What is the code I should use to identify double byte space?

blackgreen · Accepted Answer · 2021-10-02T13:10:23.037

2

Just replace?

package main

import (
    "fmt"
    "strings"
)

func main()  {
    fmt.Println(strings.Replace("こんにちは　世界", "　", " ", -1))
}

Notice that the second argument in Replace is 　, as copy-paste from your string in example. This replace function will find all rune that match that in the original string and replace it with ASCII space

edited Oct 02 '21 at 13:10

answered Oct 02 '21 at 09:59

blackgreen

34,072
23
111
129

This does not catch double byte spaces. It returns the same thing since it does not find that one. And double byte space and single byte space look similar to the eye. it is not two spaces as i know – vigamage Oct 02 '21 at 10:08
1

@vigamage, strings.Replace doesn't care what it replaces. It doesn't "catch" anything. It just replaces one byte sequence with another. As long as you use the particular space you're interested in as the second argument this will just work. – Peter Oct 02 '21 at 11:39
1

I was thinking of much complex scenarios like converting it into a rune and then check each char. Dumb me didnt think of this simple solution. The reason for my comment on your answer was not seeing the first space as a bigger one(a double byte one). I saw that as a usual space – vigamage Oct 02 '21 at 12:20

score 2 · Answer 2 · answered Oct 02 '21 at 12:41

In golang there is nothing like double byte character. There is special type rune which is int32 under hood and rune is unicode representation.

your special space is 12288 and normal space is 32 unicode.

To iterate over characters you can use range

for _, char := range chars {...} // char is rune type

To replace this character you can use strings.Replace or strings.Map and define function for replacement of unwanted characters.

func converter(r rune) rune {
    if r == 12288 {
        return 32
    }
    return r
}
result := strings.Map(converter, "こんにちは　世界")

It is also posible to use characters literals instead of numbers

if r == '　' {
    return ' '
}

Identify double byte character in a string and convert that into a single byte character

2 Answers2