0

There are a few posts that cover something like the question:

Remove square brackets from a string vector

... but regex is so damned hard I can't seem to get anything I try to work.

I've copied and pasted a large table from html and its structure is fine. There are some trailing artefacts in one column.

Here's some example data:

df <- structure(list(From = c("3 February 2015[N 4]", "23 February 2017[N 3]", 
                    "17 March 2010[N 1]", "22 July 2016[N 2]", "14 May 1986", "22 February 1995", 
                    "8 June 1995", "12 August 1996"), Until = c("4 November 2015", 
                                                                "17 October 2017", "9 May 2010", "3 January 2017", "21 February 1995", 
                                                                "8 June 1995", "12 August 1996", "13 September 1996")), class = c("spec_tbl_df", 
                                                                                                                                  "tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L), spec = structure(list(
                                                                                                                                    cols = list(Name = structure(list(), class = c("collector_character", 
                                                                                                                                                                                   "collector")), Nat. = structure(list(), class = c("collector_logical", 
                                                                                                                                                                                                                                     "collector")), Club = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                       "collector")), From = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                                                                         "collector")), Until = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                                                                                                                            "collector")), `Duration
                                                                                                                                                (days)` = structure(list(), class = c("collector_double", 
                                                                                                                                                                                      "collector")), `Years in
                                                                                                                                                League` = structure(list(), class = c("collector_character", 
                                                                                                                                                                                      "collector")), Ref. = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                        "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                              "collector")), skip = 1), class = "col_spec"))

The artefacts are in the format of square brackets with a letter and a number in it eg. [N1].

When I go to parse into a date column Until works just fine:

library(lubridate)
df %>%
  mutate(Until = dmy(Until))

But the column From with the odd artefact fails to parse for those entries:

df %>%
  mutate(From = dmy(From))

I've tried gsub with plain text first, even tho it would be one at a time:

gsub("[N1]", "", df$From)

... but text in the column beyond artefact entries gets messed up - I'm assuming due to the square brackets.

I then tried regex, but can't get it to work:

gsub("\\[.*?\]/", "", df$From)

gsub("\\[N\d\\]", "", df$From)

both giving the same: Error: '\]' is an unrecognized escape in character string starting

I don't really mind if the solution is gsub or str_replace_all from tidyverse, I just need to remove / replace [N1], [N2] and so forth.

nycrefugee
  • 1,629
  • 1
  • 10
  • 23
  • 1
    Why `/` at the end of the first regex? Use `gsub("\\[.*?]", "", df$From)` or `gsub("\\[[^][]*]", "", df$From)` – Wiktor Stribiżew Mar 18 '19 at 20:21
  • 1
    In R, you need two backslashes to escape, not 1 like in some other languages. In your `"\\[.*?\]/"`, you've got only 1 to escape the second `]`, and in `"\\[N\d\\]"` you only have 1 to create `\\d`. Unless there's more to it, I'd consider this more or less a typo – camille Mar 18 '19 at 20:21
  • I checked out the referenced duplicate and found a working answer `gsub('\\[.*?\\]', '', text)` - for some reason it didn't appear in search. – nycrefugee Mar 18 '19 at 20:29
  • 1
    No need for regex if you use `as.Date`: `as.Date(c("3 February 2015[N 4]", "14 May 1986"), "%d %B %Y")`. "Character strings are processed as far as necessary for the format specified: any trailing characters are ignored." – Henrik Mar 18 '19 at 20:29
  • @Henrik - thank you, I wondered about that but hadn't tried. – nycrefugee Mar 18 '19 at 20:30

0 Answers0