Splitting a rune correctly in golang

Question

I'm wondering if there is an easy way, such as well known functions to handle code points/runes, to take a chunk out of the middle of a rune slice without messing it up or if it's all needs to coded ourselves to get down to something equal to or less than a maximum number of bytes.

Specifically, what I am looking to do is pass a string to a function, convert it to runes so that I can respect code points and if the slice is longer than some maximum bytes, remove enough runes from the center of the runes to get the bytes down to what's necessary.

This is simple math if the strings are just single byte characters and be handled something like:

func shortenStringIDToMaxLength(in string, maxLen int) string {
    if len(in) > maxLen {
        excess := len(in) - maxLen
        start := maxLen/2 - excess/2
        return in[:start] + in[start+excess:]
    }
    return in
}

but in a variable character width byte string it's either going to be a fair bit more coding looping through or there will be nice functions to make this easy. Does anyone have a code sample of how to best handle such a thing with runes?

The idea here is that the DB field the string will go into has a fixed maximum length in bytes, not code points so there needs to be some algorithm from runes to maximum bytes. The reason for taking the characters from the the middle of the string is just the needs of this particular program.

Thanks!

EDIT:

Once I found out that the range operator respected runes on strings this became easy to do with just strings which I found because of the great answers below. I shouldn't have to worry about the string being a well formed UTF format in this case but if I do I now know about the UTF module, thanks!

Here's what I ended up with:

package main

import (
    "fmt"
)

func ShortenStringIDToMaxLength(in string, maxLen int) string {
    if maxLen < 1 {
        // Panic/log whatever is your error system of choice.
    }
    bytes := len(in)
    if bytes > maxLen {
        excess := bytes - maxLen
        lPos := bytes/2 - excess/2
        lastPos := 0
        for pos, _ := range in {
            if pos > lPos {
                lPos = lastPos
                break
            }
            lastPos = pos
        }
        rPos := lPos + excess
        for pos, _ := range in[lPos:] {
            if pos >= excess {
                rPos = pos
                break
            }
        }
        return in[:lPos] + in[lPos+rPos:]
    }
    return in
}

func main() {
    out := ShortenStringIDToMaxLength(`123456789 123456789`, 5)
    fmt.Println(out, len(out))
}

https://play.golang.org/p/YLGlj_17A-j

Why don't you just convert the string to a slice of runes before doing your logic? — Z. Kosanovic, Sep 26 '20 at 21:04
As @Kosanovic suggested does this partly answer your question?https://stackoverflow.com/a/62739051/12817546. — , Sep 26 '20 at 22:51
Before returning the shortened string, call [strings.ToValidUTF8(...)](https://golang.org/pkg/strings/#ToValidUTF8) to remove invalid utf8 bytes, if any, that may result if a cut goes through a multi byte rune. — Mark, Sep 27 '20 at 03:48

LeGEC · Accepted Answer · 2020-09-27T15:45:37.500

Here is an adaptation of your algorithm, which removes incomplete runes from the beginning of your prefix and the end of your suffix :

func TrimLastIncompleteRune(s string) string {
    l := len(s)

    for i := 1; i <= l; i++ {
        suff := s[l-i : l]
        // repeatedly try to decode a rune from the last bytes in string
        r, cnt := utf8.DecodeRuneInString(suff)
        if r == utf8.RuneError {
            continue
        }

        // if success : return the substring which contains
        // this succesfully decoded rune
        lgth := l - i + cnt
        return s[:lgth]
    }

    return ""
}

func TrimFirstIncompleteRune(s string) string {
    // repeatedly try to decode a rune from the beginning
    for i := 0; i < len(s); i++ {
        if r, _ := utf8.DecodeRuneInString(s[i:]); r != utf8.RuneError {
            // if success : return
            return s[i:]
        }
    }
    return ""
}

func shortenStringIDToMaxLength(in string, maxLen int) string {
    if len(in) > maxLen {
        firstHalf := maxLen / 2
        secondHalf := len(in) - (maxLen - firstHalf)

        prefix := TrimLastIncompleteRune(in[:firstHalf])
        suffix := TrimFirstIncompleteRune(in[secondHalf:])

        return prefix + suffix
    }
    return in
}

link on play.golang.org

This algorithm only tries to drop more bytes from the selected prefix and suffix.

If it turns out that you need to drop 3 bytes from the suffix to have a valid rune, for example, it does not try to see if it can add 3 more bytes to the prefix, to have an end result closer to maxLen bytes.

This is great because it drew my attention to the UTF module which I didn't know about and it started me looking around and ultimately coming up with a solution so I'm giving you the credit. When I first started looking into this I kept reading that strings are byte oriented, hence runes, but then I saw as I was looking to the UTF module and around it that even though that is true of strings that the range operator on strings actually returns an index of each next rune, not byte by byte which changed everything. — Reg, Sep 27 '20 at 21:55

score 0 · Answer 2 · answered Sep 27 '20 at 09:14

You can use simple arithmetic to find start and end such that the string s[:start] + s[end:] is shorter than your byte limit. But you need to make sure that start and end are both the first byte of any utf-8 sequence to keep the sequence valid.

UTF-8 has the property that any given byte is the first byte of a sequence as long as its top two bits aren't 10.

So you can write code something like this (playground: https://play.golang.org/p/xk_Yo_1wTYc)

package main

import (
    "fmt"
)

func truncString(s string, maxLen int) string {
    if len(s) <= maxLen {
        return s
    }
    start := (maxLen + 1) / 2
    for start > 0 && s[start]>>6 == 0b10 {
        start--
    }
    end := len(s) - (maxLen - start)
    for end < len(s) && s[end]>>6 == 0b10 {
        end++
    }
    return s[:start] + s[end:]
}

func main() {
    fmt.Println(truncString("this is a test", 5))
    fmt.Println(truncString("日本語", 7))
}

This code has the desirable property that it takes O(maxLen) time, no matter how long the input string (assuming it's valid utf-8).

the `end` part works ; for the `start` part, you also need to handle the first byte : either the whole rune fits in, and you can keep it, or it doesn't, and you have to discard that extra byte. — LeGEC, Sep 27 '20 at 21:30

Splitting a rune correctly in golang

2 Answers2