0

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .

I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:

regex_exp_R   <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)

I need this to work in pure regex and grep function, without using any string R package. Thank you.

Simplified Case: After important contributions of you all, one last issue remains. Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.

The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried

grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Fabio Correa
  • 1,257
  • 1
  • 11
  • 17
  • 2
    Use this as regex: `(?:http://)?compras\.dados\.gov\.br.*\?[^/]*` There is no need to use lookbehind here. – anubhava Nov 12 '19 at 16:47
  • `gsub('/$', '', x)` will make a copy of `x` without the `/` at the end (if there is one for the given element of `x`) – IceCreamToucan Nov 12 '19 at 16:53
  • 1
    I am not completely clear on what you are looking for--what do you mean by ignore? Do you want it returned without the last `/` or do you want it to be an optional element of your search pattern. – Andrew Nov 12 '19 at 16:55
  • 1
    Dear Andrew, I want the string returned without the last "/". Thank you – Fabio Correa Nov 12 '19 at 17:01
  • @FabioCorrea: Did you try my suggested regex? – anubhava Nov 12 '19 at 17:04
  • @anubhava, yes, it did not work. Simplifying the problem and building on your proposed solution, when I try grep(".*[^//]" ,"abc/", perl = T, value = T), I get "abc/" instead of "abc". Thank you. – Fabio Correa Nov 12 '19 at 17:14
  • 1
    With `grep()`, even if you correctly match part of the string, it will return the original string regardless. E.g., `grep("a", "abc", value = T)` – Andrew Nov 12 '19 at 17:16
  • 1
    @FabioCorrea: Check this: https://stackoverflow.com/a/23901600/548225 – anubhava Nov 12 '19 at 17:20
  • 2
    Just `grep` the `gsub("/+$", "", x)` – Wiktor Stribiżew Nov 12 '19 at 17:25
  • I edited the original question for a simplified last issue, after all contributions. Thank you. – Fabio Correa Nov 12 '19 at 19:08
  • 1
    @FabioCorrea, it cannot work with only `grep()` because `grep()` is not designed to return partial matches. It is designed to return an index--`grep("c", letters)`--but it can return the value of the original string instead of the index--`grep("c", letters, value = T)`. I would suggest using another base function such as `gsub` (on its own, or with `grep`). Read the value header in `?grep` – Andrew Nov 12 '19 at 19:30

4 Answers4

0

If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:

Using a back-reference in gsub() (sub() would work too here):

gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"

# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"

By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)

ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"

Data:

x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Andrew
  • 5,028
  • 2
  • 11
  • 21
0

Use sub to remove a trailing /:

x <- c("a1bc/", "a2bc")
sub("/$", "", x)

This changes nothing on a string that does not end in /.

As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.

ngwalton
  • 383
  • 3
  • 8
0

You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:

.+(?<!\/)

You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.

David542
  • 104,438
  • 178
  • 489
  • 842
-1

How about trying gsub("(.*?)/+$","\\1",s)?

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81