Short version: This prints 3, which makes sense because in Go strings are basically a slice of bytes, and it takes three bytes to represent this character. How can I get len, and regexp functions to work in terms of characters, not bytes.
package main
import "fmt"
func main() {
fmt.Println(len("ウ"))//returns 3
fmt.Println(utf8.RuneCountInString("ウ"))//returns 1
}
Background:
I'm saving text into the GAE datastore using JDO (Java).
Then I'm processing the text using Go, specifically I'm using regexp.FindStringIndex and saving the index to the datastore.
Then back in Java land I send the unmodified text, and index to the GWT client via json.
Somewhere along the way the indexes are 'shifting', so by the time its on the client, they are off.
It seems the issue has to do with character encoding, I'm assuming Java/Go are interpreting the text (indexes) differently utf-8 char/byte?. I see references to Runes in the regexp package.
I think I can either make regexp.FindStringIndex return byte indexes in go, or make GWT client understand the utf-8 indexes.
Any suggestions? I should be using UTF-8 incase I need to internationalize the app in the future, right?
Thanks
EDIT:
Also when I was finding the index using Java on the server things just worked.
On the client (GWT) I'm using text.substring(start,end)
TEST:
package main
import "regexp"
import "fmt"
func main() {
fmt.Print(regexp.MustCompile(`a`).FindStringIndex("ウィキa")[1])
}
The code outputs 10, not 4.
The plan is to get FindStringIndex to return 4, any ideas?
Update 2: Position Conversion
func main() {
s:="ab日aba本語ba";
byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]
offset :=0
posMap := make([]int,len(s))//maps byte-positions to char-positions
for pos, char := range s {
fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n", char, pos,offset,pos-offset)
posMap[pos]=offset
offset += utf8.RuneLen(char)-1
}
fmt.Println("posMap =",posMap)
for pos ,value:= range byteIndex{
fmt.Printf("pos:%d value:%d subtract %d\n",pos,value,posMap[value[0]])
value[1]-=posMap[value[0]]
value[0]-=posMap[value[0]]
}
fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]
}
* Update 2 *
lastPos:=-1
for pos, char := range s {
offset +=pos-lastPos-1
fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n", char, pos,offset,pos-offset)
posMap[pos]=offset
lastPos=pos
}