R strsplit ignore some text

Question

I'm working on a survey, and many of the written categories on an answer are separated by commas. I have used gsub successfully in order to separate them, like this.

sss6 <- str_trim(unlist(strsplit(aiprm$step_do_you_anticipate, split=",")))

I have successfully separated strings like these, so I can count them each correctly in order to make visualizations.

Grammar, None of the above, Grammar, Subject matter expertise, Grammar, Subject matter expertise, Bias, Grammar, Subject matter expertise, Bias, Fact-checking

The problem now is that I have text with parenthesis and commas inside, and I would like that the commas inside the parenthesis "()" are ignored. Here are some examples of that.

Ad copy, JavaScript code, headlines, compelling copy, commercial ideas, Ad copy, Title & meta description, Idea generation (topics, headlines), Code, Idea generation (topics, headlines), Ad copy, Idea generation (topics, headlines)

Is there any way to tell the strsplit() function to not separate or ignore the commas that are inside the parenthesis? The main problem is (topics, headlines)

Thanks!

Is the problematic string in parentheses *always* "(topics, headlines)" or will it change? Also, do you want to keep the problematic part in parentheses or do you want to remove it? — jpsmith, Aug 30 '23 at 21:06
https://stackoverflow.com/users/12109788/jpsmith Hello, it will never change, and I want to keep it. It's always the same, it repeats a lot . — Humberto R, Aug 31 '23 at 00:51

score 0 · Answer 1 · answered Aug 30 '23 at 21:41

0

Horrible (and really slow) solution:

chrs        <- strsplit(s, "")[[1]]
commas      <- as.integer(chrs == ",")
parenthesis <- cumsum(chrs == "(" | chrs == ")")
ind         <- which((commas == 1) & (parenthesis %% 2 == 0))

sapply(seq_along(ind), function(i) {
  start <- ifelse(i == 1, 1, ind[i - 1] + 2)
  end   <- ind[i] - 1
  paste(chrs[start:end], collapse = "")
})

Best way to go about it is probably to use a regex. See this thread.

answered Aug 30 '23 at 21:41

weakCoder

11
3

That link is a nice find - you should just steal the first regex from that thread and adapt it to R - `trimws(strsplit(x, ",(?![^(]*\\))", perl=TRUE)[[1]])` works really well from what I can see. – thelatemail Aug 30 '23 at 21:53

jpsmith · Answer 2 · 2023-08-31T02:35:02.970

In this specific case, since you note that the problematic string within parentheses is always the same ("topics, headlines"), and if you're up for a slight modification, this could be easily done by subbing out the comma within the phrase with another non-comma punctuation, such as a hyphen, ie:

gsub("topics, headlines", "topics-headlines", aiprm$step_do_you_anticipate)

Which will just require you replacing the aiprm$step_do_you_anticipate in your original code with the above:

sss6 <- stringr::str_trim(unlist(strsplit(
  gsub("topics, headlines", "topics-headlines", aiprm$step_do_you_anticipate), 
  split=",")))

# [1] "Ad copy"                            "JavaScript code"                   
# [3] "headlines"                          "compelling copy"                   
# [5] "commercial ideas"                   "Ad copy"                           
# [7] "Title & meta description"           "Idea generation (topics-headlines)"
# [9] "Code"                               "Idea generation (topics-headlines)"
# [11] "Ad copy"                            "Idea generation (topics-# headlines)"

If you really wanted the commas, you could sub back out quickly:

gsub("topics-headlines", "topics, headlines", sss6) 

# [1] "Ad copy"                             "JavaScript code"                    
# [3] "headlines"                           "compelling copy"                    
# [5] "commercial ideas"                    "Ad copy"                            
# [7] "Title & meta description"            "Idea generation (topics, headlines)"
# [9] "Code"                                "Idea generation (topics, headlines)"
# [11] "Ad copy"                             "Idea generation (topics, headlines)"

As an aside, you may also want to look into tidyr::separate_longer_delim():

aiprm$comma_replaced <- gsub("topics, headlines", "topics-headlines", aiprm$step_do_you_anticipate)

tidyr::separate_longer_delim(aiprm, comma_replaced, ",")

#                        comma_replaced
#1                              Ad copy
#2                      JavaScript code
#3                            headlines
#4                      compelling copy
#5                     commercial ideas
#6                              Ad copy
#7             Title & meta description
#8   Idea generation (topics-headlines)
#9                                 Code
#10  Idea generation (topics-headlines)
#11                             Ad copy
#12  Idea generation (topics-headlines)

R strsplit ignore some text

2 Answers2