0

I would like to split a character by spaces but keep the spaces inside the quotes (and the quotes themselves). The problem is, the quotes can be nested, and also I would need to do this for both single and double quotes. So, from the line this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes" I would like to get [this, "'"is a possible option"'", and, ""so is this"", and, '''this one too''', and, even, ""mismatched quotes"].

This question has already been asked, but not the exact question that I'm asking. Here are several solutions: one uses a matcher (in this case """x""" would be split into [""", x"""], so this is not what I need) and Apache Commons (which works with """x""" but not with ""x"", since it takes the first two double quotes and leaves the last two with x). There are also suggestions of writing a function to do so manually, but this would be the last resort.

wouldnotliketo
  • 153
  • 1
  • 10
  • I think there's no solution other than to build a custom parser. No regex will match arbitrary nesting like this. – markspace Feb 04 '19 at 19:51

1 Answers1

1

You can achieve that with the following regex: ["']+[^"']+?["']+. Using that pattern you retrieve the indices where you want to split like this:

val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()

The rest is building the list out of substrings. Here the complete function:

fun String.splitByPattern(pattern: String): List<String> {

    val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()

    var lastIndex = 0
    return indices.mapIndexed { i, ele ->

        val end = if(i % 2 == 0) ele else ele + 1 // magic

        substring(lastIndex, end).apply {
            lastIndex = end
        }
    }
}

Usage:

val str = """
this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes"
""".trim()

println(str.splitByPattern("""["']+[^"']+?["']+"""))

Output:

[this , "'"is a possible option"'", and , ""so is this"", and , '''this one too''', and even , ""mismatched quotes"]

Try it out on Kotlin's playground!

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
  • 1
    Thank you! This does work for fishing out the parts inside the quotes (with quotes included). However, two things: 1. I wanted to also tokenize everything outside the quotes. So, instead of `..., and even, ...` I wanted to get `..., and, even, ...`. 2. For a string like `text "text with spaces" and more text` I get `[text, 'text with spaces"]`, and the last part without the quotes doesn't make it to the final list. However, those are the things that I could just do manually, so, once again, thank you! – wouldnotliketo Feb 05 '19 at 05:44