0

Suppose the next dataframe:

#  code                                            countries
#1 A001 [[Germany, China, Japan], [Chile, Mexico], [Poland]]
#2 A002     [[], [Japan], [Singapore, Indonesia, Micronesia]]
#3 A003       [[Tuvalu, Chile], [], [North Macedonia, Sweden]]

How could I remove all [ after its firts appareance and all ] before last appareance?

In a way thtat dataframe could look like this:

   code countries
#1 A001 [Germany, China, Japan, Chile, Mexico, Poland]
#2 A002     [Japan, Singapore, Indonesia, Micronesia]
#3 A003       [Tuvalu, Chile, North Macedonia, Sweden]

data

df <- data.frame(code=c('A001', 'A002', 'A003'),
                 countries=c('[[Germany, China, Japan], [Chile, Mexico], [Poland]]',
                             '[[], [Japan], [Singapore, Indonesia, Micronesia]]',
                             '[[Tuvalu, Chile], [], [North Macedonia, Sweden]]')
                )
AlSub
  • 1,384
  • 1
  • 14
  • 33
  • 1
    Easy option is `map_chr(str_extract_all(df$countries, "\\w+"), ~ sprintf("[%s]", toString(.x)))` – akrun Jun 22 '21 at 21:29

3 Answers3

5

Here is a method using regex in base R

df$countries <- gsub("(?<=\\[),\\s*|(?<=\\,)\\s+,", "", 
    gsub("(^\\[|\\]$)(*SKIP)(*FAIL)|([][])", "", df$countries, perl = TRUE), perl = TRUE)
df$countries
#[1] "[Germany, China, Japan, Chile, Mexico, Poland]" 
#[2] "[Japan, Singapore, Indonesia, Micronesia]"    
#[3]  "[Tuvalu, Chile, North Macedonia, Sweden]"   

Or another option is to extract the words and then paste them together

library(stringr)
library(purrr)
df$countries <- map_chr(str_extract_all(df$countries, "\\w+"), 
     ~ sprintf("[%s]", toString(.x)))
akrun
  • 874,273
  • 37
  • 540
  • 662
2

A base R option using regmatches + toString + sprintf

transform(
    df,
    countries = sprintf(
        "[%s]",
        sapply(
            regmatches(countries, gregexpr("(\\w+\\s?)+", countries)),
            toString
        )
    )
)

gives

  code                                      countries
1 A001 [Germany, China, Japan, Chile, Mexico, Poland]
2 A002      [Japan, Singapore, Indonesia, Micronesia]
3 A003       [Tuvalu, Chile, North Macedonia, Sweden]
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
2

In dplyr:

Here, I wrapped gsub() in paste(). The regex expression is finding anything in countries that starts with ^ a bracket [ or ends with $ a bracket ]. A great explanation of the *SKIP and *FAIL control verbs (and thereby the rest of the regex statement in gsub() lives here and is more cordial than I can probably articulate.

df %>% 
  mutate(countries = paste("[",gsub("(^\\[|\\]$)(*SKIP)(*FAIL)|([][])", "", countries), "]" , sep = "")) 

enter image description here

L. South
  • 141
  • 8