There are a few posts that cover something like the question:
Remove square brackets from a string vector
... but regex is so damned hard I can't seem to get anything I try to work.
I've copied and pasted a large table from html and its structure is fine. There are some trailing artefacts in one column.
Here's some example data:
df <- structure(list(From = c("3 February 2015[N 4]", "23 February 2017[N 3]",
"17 March 2010[N 1]", "22 July 2016[N 2]", "14 May 1986", "22 February 1995",
"8 June 1995", "12 August 1996"), Until = c("4 November 2015",
"17 October 2017", "9 May 2010", "3 January 2017", "21 February 1995",
"8 June 1995", "12 August 1996", "13 September 1996")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Nat. = structure(list(), class = c("collector_logical",
"collector")), Club = structure(list(), class = c("collector_character",
"collector")), From = structure(list(), class = c("collector_character",
"collector")), Until = structure(list(), class = c("collector_character",
"collector")), `Duration
(days)` = structure(list(), class = c("collector_double",
"collector")), `Years in
League` = structure(list(), class = c("collector_character",
"collector")), Ref. = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
The artefacts are in the format of square brackets with a letter and a number in it eg. [N1]
.
When I go to parse into a date column Until
works just fine:
library(lubridate)
df %>%
mutate(Until = dmy(Until))
But the column From
with the odd artefact fails to parse for those entries:
df %>%
mutate(From = dmy(From))
I've tried gsub
with plain text first, even tho it would be one at a time:
gsub("[N1]", "", df$From)
... but text in the column beyond artefact entries gets messed up - I'm assuming due to the square brackets.
I then tried regex, but can't get it to work:
gsub("\\[.*?\]/", "", df$From)
gsub("\\[N\d\\]", "", df$From)
both giving the same: Error: '\]' is an unrecognized escape in character string starting
I don't really mind if the solution is gsub
or str_replace_all
from tidyverse
, I just need to remove / replace [N1]
, [N2]
and so forth.