34

How do I match secret_code_data in string:

xeno://soundcloud/?code=secret_code_data#

I've tried

val regex = Regex("""xeno://soundcloud/?code=(.*?)#""")
field = regex.find(url)?.value ?: ""

without luck. I suspect ? before code might be the problem, should I escape it somehow. Can you help?

dsh
  • 12,037
  • 3
  • 33
  • 51
ssuukk
  • 8,200
  • 7
  • 35
  • 47
  • Why are you using `.*?`? I think you mean simply `.*` – yole Jan 04 '16 at 15:54
  • Try `field = regex.find(input)!!.groups[1]!!.value` and put the `?` into a character class: `[?]`. Or something like `val regex = "xeno://soundcloud/[?]code=(.*?)#".toRegex() // val input = "xeno://soundcloud/?code=secret_code_data#" // val result = regex.find(input)!!.groups[1]!!.value`. – Wiktor Stribiżew Jan 04 '16 at 16:09
  • 1
    I would suggest parsing the url and extracting the query parameter instead of using RegEx. – Kirill Rakhman Jan 04 '16 at 16:10
  • I addressed the regex, as well as using other safe parsing models below. Regex is not safe if the URL may have encoded characters, or you try to decode as one whole string. – Jayson Minard Jan 04 '16 at 17:13
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jayson Minard Jan 22 '16 at 12:00

2 Answers2

44

Here are three options, the first providing a good Regex that does what you want, and the other two for parsing URL's using an alternative to Regex which handle URL component encoding/decoding correctly.

Parsing using Regex

NOTE: Regex method is unsafe in most use cases since it does not properly parse the URL into components, then decode each component separately. Normally you cannot decode the whole URL into one string and then parse safely because some encoded characters might confuse the Regex later. This is similar to parsing XHTML using regex (as described here). See alternatives to Regex below.

Here is a cleaned up regex as a unit test case that handles more URLs safely. At the end of this post is a unit test you can use for each method.

private val SECRET_CODE_REGEX = """xeno://soundcloud[/]?.*[\?&]code=([^#&]+).*""".toRegex()
fun findSecretCode(withinUrl: String): String? =
        SECRET_CODE_REGEX.matchEntire(withinUrl)?.groups?.get(1)?.value

This regex handles these cases:

  • with and without trailing / in path
  • with and without fragment
  • parameter as first, middle or last in list of parameters
  • parameter as only parameter

Note that idiomatic way to make a regex in Kotlin is someString.toRegex(). It and other extension methods can be found in the Kotlin API Reference.

Parsing using UriBuilder or similar class

Here is an example using the UriBuilder from the Klutter library for Kotlin. This version handles encoding/decoding including more modern JavaScript unicode encodings not handled by the Java standard URI class (which has many issues). This is safe, easy, and you don't need to worry about any special cases.

Implementation:

fun findSecretCode(withinUrl: String): String? {
    fun isValidUri(uri: UriBuilder): Boolean = uri.scheme == "xeno"
                    && uri.host == "soundcloud"
                    && (uri.encodedPath == "/" || uri.encodedPath.isNullOrBlank())
    val parsed = buildUri(withinUrl)
    return if (isValidUri(parsed)) parsed.decodedQueryDeduped?.get("code") else null
}

The Klutter uy.klutter:klutter-core-jdk6:$klutter_version artifact is small, and includes some other extensions include the modernized URL encoding/decoding. (For $klutter_version use the most current release).

Parsing with JDK URI Class

This version is a little longer, and shows you need to parse the raw query string yourself, decode after parsing, then find the query parameter:

fun findSecretCode(withinUrl: String): String? {
    fun isValidUri(uri: URI): Boolean = uri.scheme == "xeno"
            && uri.host == "soundcloud"
            && (uri.rawPath == "/" || uri.rawPath.isNullOrBlank())

    val parsed = URI(withinUrl)
    return if (isValidUri(parsed)) {
        parsed.getRawQuery().split('&').map {
            val parts = it.split('=')
            val name = parts.firstOrNull() ?: ""
            val value = parts.drop(1).firstOrNull() ?: ""
            URLDecoder.decode(name, Charsets.UTF_8.name()) to URLDecoder.decode(value, Charsets.UTF_8.name())
        }.firstOrNull { it.first == "code" }?.second
    } else null
}

This could be written as an extension on the URI class itself:

fun URI.findSecretCode(): String? { ... }

In the body remove parsed variable and use this since you already have the URI, well you ARE the URI. Then call using:

val secretCode = URI(myTestUrl).findSecretCode()

Unit Tests

Given any of the functions above, run this test to prove it works:

class TestSo34594605 {
    @Test fun testUriBuilderFindsCode() {
        // positive test cases

        val testUrls = listOf("xeno://soundcloud/?code=secret_code_data#",
                "xeno://soundcloud?code=secret_code_data#",
                "xeno://soundcloud/?code=secret_code_data",
                "xeno://soundcloud?code=secret_code_data",
                "xeno://soundcloud?code=secret_code_data&other=fish",
                "xeno://soundcloud?cat=hairless&code=secret_code_data&other=fish",
                "xeno://soundcloud/?cat=hairless&code=secret_code_data&other=fish",
                "xeno://soundcloud/?cat=hairless&code=secret_code_data",
                "xeno://soundcloud/?cat=hairless&code=secret_code_data&other=fish#fragment"
        )

        testUrls.forEach { test ->
            assertEquals("secret_code_data", findSecretCode(test), "source URL: $test")
        }

        // negative test cases, don't get things on accident

        val badUrls = listOf("xeno://soundcloud/code/secret_code_data#",
                "xeno://soundcloud?hiddencode=secret_code_data#",
                "http://www.soundcloud.com/?code=secret_code_data")

        badUrls.forEach { test ->
            assertNotEquals("secret_code_data", findSecretCode(test), "source URL: $test")
        }
    }
Jayson Minard
  • 84,842
  • 38
  • 184
  • 227
  • Very nice answer, thanks. Maybe `groups!![0]` looks more idiomatic than `groups?.get(0)`. – m0skit0 Apr 10 '17 at 15:39
  • @m0skit0 that suggested change does not propagate the null on missing matches. Nor does adding `?.let{ ... }` anywhere really help clean it up. – Jayson Minard Apr 16 '17 at 14:12
0

Add an escape before the first question mark as it has a special meaning

? 

becomes

\?

You are also capturing the secret code in the first group. Not sure the kotlin code that follows is extracting the first group though.

buckley
  • 13,690
  • 3
  • 53
  • 61
  • `Regex("""xeno://soundcloud/\?code=(.*?)#""").find("xeno://soundcloud/?code=secret_code_data#")?.value` returns `"xeno://soundcloud/?code=secret_code_data#"` so escaping the question mark as @buckley suggests does do the trick. @ssuukk Did you try escaping the question-mark in the `url` and not the `Regex` `pattern` instead? – mfulton26 Jan 04 '16 at 16:24