remove an specific typo after first and last appearance

Question

Suppose the next dataframe:

#  code                                            countries
#1 A001 [[Germany, China, Japan], [Chile, Mexico], [Poland]]
#2 A002     [[], [Japan], [Singapore, Indonesia, Micronesia]]
#3 A003       [[Tuvalu, Chile], [], [North Macedonia, Sweden]]

How could I remove all [ after its firts appareance and all ] before last appareance?

In a way thtat dataframe could look like this:

   code countries
#1 A001 [Germany, China, Japan, Chile, Mexico, Poland]
#2 A002     [Japan, Singapore, Indonesia, Micronesia]
#3 A003       [Tuvalu, Chile, North Macedonia, Sweden]

data

df <- data.frame(code=c('A001', 'A002', 'A003'),
                 countries=c('[[Germany, China, Japan], [Chile, Mexico], [Poland]]',
                             '[[], [Japan], [Singapore, Indonesia, Micronesia]]',
                             '[[Tuvalu, Chile], [], [North Macedonia, Sweden]]')
                )

Easy option is `map_chr(str_extract_all(df$countries, "\\w+"), ~ sprintf("[%s]", toString(.x)))` — akrun, Jun 22 '21 at 21:29

score 5 · Answer 1 · answered Jun 22 '21 at 21:41

Here is a method using regex in base R

df$countries <- gsub("(?<=\\[),\\s*|(?<=\\,)\\s+,", "", 
    gsub("(^\\[|\\]$)(*SKIP)(*FAIL)|([][])", "", df$countries, perl = TRUE), perl = TRUE)
df$countries
#[1] "[Germany, China, Japan, Chile, Mexico, Poland]" 
#[2] "[Japan, Singapore, Indonesia, Micronesia]"    
#[3]  "[Tuvalu, Chile, North Macedonia, Sweden]"

Or another option is to extract the words and then paste them together

library(stringr)
library(purrr)
df$countries <- map_chr(str_extract_all(df$countries, "\\w+"), 
     ~ sprintf("[%s]", toString(.x)))

score 2 · Answer 2 · answered Jun 22 '21 at 21:46

A base R option using regmatches + toString + sprintf

transform(
    df,
    countries = sprintf(
        "[%s]",
        sapply(
            regmatches(countries, gregexpr("(\\w+\\s?)+", countries)),
            toString
        )
    )
)

gives

  code                                      countries
1 A001 [Germany, China, Japan, Chile, Mexico, Poland]
2 A002      [Japan, Singapore, Indonesia, Micronesia]
3 A003       [Tuvalu, Chile, North Macedonia, Sweden]

score 2 · Accepted Answer · answered Jun 22 '21 at 21:57

In dplyr:

Here, I wrapped gsub() in paste(). The regex expression is finding anything in countries that starts with ^ a bracket [ or ends with $ a bracket ]. A great explanation of the *SKIP and *FAIL control verbs (and thereby the rest of the regex statement in gsub() lives here and is more cordial than I can probably articulate.

df %>% 
  mutate(countries = paste("[",gsub("(^\\[|\\]$)(*SKIP)(*FAIL)|([][])", "", countries), "]" , sep = ""))

remove an specific typo after first and last appearance

data

3 Answers3