Cleaning Data / Truncate short URL out of data

Question

I am cleaning some URL data from an eCom at the moment and since I want to get a better overview from which referrer traffic came.

I have tried the sub() function in R but I am running into difficulties in properly applying RegEx.

sub("*.com", "", q2$Session.First.Referrer)

I want to simply clean a URL looking like "http\://www\.gazelle\.com/main/home\.jhtml" the basic URL so "www.gazelle.com".

Take a look through [tag:regex] questions on extracting parts of URLs. You can likely adapt regex used in another language to R — camille, Apr 03 '19 at 17:15
There are a lot of complications addressed in some other SO posts to think about: do you check for `http` and `https`? Do all URLs include `http://`, or do some start only with `www.`? Are there any subdomains, such as `http://stats.stackexchange.com/`, so that there won't be a `www`? What about `ww2.`? `.edu`? `.co.uk`? `.io`? It's actually a bigger task than it might seem at first. — camille, Apr 03 '19 at 17:47

score 1 · Accepted Answer · answered Apr 03 '19 at 17:19

Assuming that all your URL's are of the same form, you can use gsub to remove text that appears before "www" and after ".com using the following as a guide:

# Example string
my.string = "http://www.gazelle.com/main/home.jhtml"
> my.string
[1] "http://www.gazelle.com/main/home.jhtml"

# remove everything after .com
output.string = gsub(".com.*",".com", my.string)

# remove everything before www.
output.string = gsub(".*www.", "www.", output.string)

> output.string
[1] "www.gazelle.com"

score 1 · Answer 2 · answered Apr 03 '19 at 17:24

I used str_extract from the stringr package (a part of the tidyverse):

library(tidyverse)
library(stringr)

my_data <- tibble(addresses = c("https://www.fivethirtyeight.com/features/is-there-still-room-in-the-democratic-primary-for-biden/",
                                "https://www.docs.aws.amazon.com/sagemaker/latest/dg/sms.html",
                                "https://www.stackoverflow.com/questions/55500553/cleaning-data-truncate-short-url-out-of-data"))

str_extract(my_data$addresses, "www.+com")

Which returns:

[1] "www.fivethirtyeight.com" "www.docs.aws.amazon.com"
[3] "www.stackoverflow.com"

Cleaning Data / Truncate short URL out of data

2 Answers2