-1

So this question is relating to specifically how R handles regex - I would like to find some regex in conjunction with gsub to extract out the text all but before the 3rd forward slash.

Here are some string examples:

/google.com/images/video 
/msn.com/bing/chat
/bbc.com/video

I would like to obtain the following strings only:

/google.com/images
/msn.com/bing
/bbc.com/video

So it is not keeping the information after the 3rd forward slash.

I cannot seem to get any regex working along with using gsub to solve this!

The closest I have got is:

gsub(pattern = "/[A-Za-z0-9_.-]/[A-Za-z0-9_.-]*$", replacement = "", x = the_data_above )

I think R has some issues regarding forward slashes and escaping them.

Beans On Toast
  • 903
  • 9
  • 25

3 Answers3

2

From the start of the string match two instances of slash and following non-slash characters followed by anything and replace with the two instances.

paths <- c("/google.com/images/video", "/msn.com/bing/chat", "/bbc.com/video")
sub("^((/[^/]*){2}).*", "\\1", paths)
## [1] "/google.com/images" "/msn.com/bing"      "/bbc.com/video"   
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
2

You can take advantage of lazy (vs greedy) matching by adding the ? after the quantifier (+ in this case) within your capture group:

gsub("(/.+?/.+?)/.*", "\\1", text)
[1] "/google.com/images" "/msn.com/bing"      "/bbc.com/video" 

Data:

text <- c("/google.com/images/video",
"/msn.com/bing/chat",
"/bbc.com/video")
Andrew
  • 5,028
  • 2
  • 11
  • 21
0

Try this out:

^\/[A-Za-z0-9_.-]+\/[A-Za-z0-9_.-]+

As seen here: https://regex101.com/r/9ZYppe/1

Your problem arises from the fact that [A-Za-z0-9_.-] matches only one such character. You need to use the + operator to specify that there are multiple of them. Also, the $ at the end is pretty unnecessary because using ^ to assert the start of the sentence solves a great many problems.

Robo Mop
  • 3,485
  • 1
  • 10
  • 23
  • I can't seem to extract the strings using this code in R - it seemingly only extracts the information after the 3rd slash – Beans On Toast Feb 26 '20 at 14:20
  • @BeansOnToast It might be because the `.` hasn't been escaped. Try this: `^\/[A-Za-z0-9_\.-]+\/[A-Za-z0-9_\.-]+` – Robo Mop Feb 26 '20 at 14:23
  • Nope - when using this code `gsub(pattern = "^\\/[A-Za-z0-9_\\.-]+\\/[A-Za-z0-9_\\.-]+",x = t, replacement = "")` it should replace those things after the third slash with a blank character but it is not - is there an alternative way to do this instead? – Beans On Toast Feb 26 '20 at 14:25