2

I'd like to remove a 'destinationId' parameter from a batch of URLs.

If i have a URL like this:

https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub

How would i extract the 45? (destinationId=45)

I attempted to use something like this which i cant get to work:

destinationIdParameter <- sub("[^0-9].*","",sub("*?\\destinationId=","",url))
Jan
  • 42,290
  • 8
  • 54
  • 79
Tim496
  • 162
  • 3
  • 19
  • Remove from url or extract from url? – s_baldur Apr 03 '18 at 11:20
  • Possible duplicate of [How to match the bundle id for android app?](https://stackoverflow.com/questions/49628728/how-to-match-the-bundle-id-for-android-app) – Lance Toth Apr 03 '18 at 11:46
  • Possible duplicate of [Extract URL parameters and values in R](https://stackoverflow.com/questions/34811595/extract-url-parameters-and-values-in-r) – Munim Munna Apr 03 '18 at 13:38

5 Answers5

4

With stringr you can get it like this:

> library(stringr)
> address <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"
> str_match(address, "destinationId=(.*?)&")[,2]
[1] "45"

If (like me) you're not comfortable with regular expressions, use the qdapRegex package:

> library(qdapRegex)
> address <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"
> ex_between(address, "destinationId=", "&")
[[1]]
[1] "45"
Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225
  • Thanks! I really like the qdapRegex approach as regular expressions are confusing. It's not as quick to compute as gsub solution tho :( – Tim496 Apr 03 '18 at 12:52
1

With base R you can extract the number in few ways. If you are certain there is always only one number in this kind of urls, you can just erase everything which is not a number by:

> url <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"
> gsub("[^0-9]", "", url)
[1] "45"

Or if you want to be more safe and want the particular number which comes after "destinationId=" not any other, then you would do something like this:

destId <- regmatches(url, gregexpr("destinationId=\\d+", url)) 
gsub("[^0-9]", "", destId)
1

If you were to extract the destinationId value from the url, then you could do:

gsub(".+destinationId=(\\d+).+", "\\1", url)
  • Here \\1 refers to what is within ().
  • .+ matches any character sequence.
s_baldur
  • 29,441
  • 4
  • 36
  • 69
1

I think the best way is parameters()

library(urltools)
example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true"
parameters(example_url)
stevec
  • 41,291
  • 27
  • 223
  • 311
0

With base R, we can do:

url <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"

extract <- function(url) {
  pattern <- "destinationId=\\K\\d+"
  (id <- regmatches(url, regexpr(pattern, url, perl = TRUE)))
}

print(extract(url))


Alternatively (no perl = TRUE):
vanilla_extract <- function(url) {
  pattern <- "destinationId=([^&]+)"
  (regmatches(url, regexec(pattern, url))[[1]][2])
}

Both yield

[1] "45"
Jan
  • 42,290
  • 8
  • 54
  • 79