Extracting text between parenthesis in columns in dataframe into new columns in dataframes

Question

I have a dataframe called reasons with columns where in some rows, there is text that have numbers in parenthesis. The format is like this.

concern                          notaware           scenery
(2) chat community (4) more      
(1) didn't know                  (1) beautiful      (3) stunning
(3) often                                           (1) always

Reproducible version:

structure(list(concern = c("(2) chat community (4) more", "(1) didn't know", 
"(3) often"), notaware = c("", "(1) beautiful", ""), scenery = c("", 
"(3) stunning", "(1) always")), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

I want a new data frame with just the parenthesis and numbers

concern                          notaware            scenery
(2) (4) 
(1)                             (1)                (3) 
(3)                             (1)

I realise there is a similar question here but the data is not in a column

Extracting data into new columns using R

and this but it doesn't seem to apply to a dataframe

Extract info inside all parenthesis in R

From the questions I've looked up I've tried to cobble a workaround. I tried

reasons %>% mutate(concern1 = str_match(concern, pattern = "\\(.*?\\)"))

Which resulted in an unchanged dataframe.

And this

reasons$concern1 <- sub(regmatches(reasons$concern, gregexpr(pat, reasons$concern, perl=TRUE)))

Which comes up with this

Error in sub(regmatches(UltraCodes$concern, gregexpr(pat, 
UltraCodes$concern,  : 
argument "x" is missing, with no default

I looked at this which I know is a duplicate of the second question but it made more sense to me.

Using R to parse and return text in parenthesis

And I used

pat <- "(?<=\\()([^()]*)(?=\\))"
concern1 <- regmatches(reasons$concern, gregexpr(pat, reasons$concern, 
perl=TRUE))

This gives me a list with a name and a type and a value - the values are what I want even though its '2' rather than (2)

So I figure I can make multiple lists and try to put them into a dataframe so I make a list notaware1 out of column notaware and so on. I have a feeling that the blank values are throwing things of as I try

reasons1 <-data.frame(concern1, notaware1)
reasons1 <-as.data.frame(concern1, notaware1)

Which gives me

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = 
TRUE,  : 
arguments imply differing number of rows: 0, 1, 2

Which I don't quite understand as all my lists are the same lengths, I feel I'm misunderstanding some fundamentals here.

Next I thought I could do a wrap around by exporting the list to csv, but the answers I've found seem to want me to turn the list into a dataframe first, which is my problem.

Then I find this

reasons$concern3 <-paste(concern1)

Which does add the list to my dataframe, and I can repeat this for all my lists.

However it is a bit messy as blanks are now given as character(0), one bracket is single numbers and where there are two brackets is c("2", "9") so my columns now look like this

concern                          adventure          scenery
c("2", "9")                      character(0)       character(0)
1                                1                  3
3                                1                  character(0)

But I have something that I can put into a csv file to tidy.

Is there a simpler way?

Onyambu · Accepted Answer · 2018-08-15T22:31:19.813

1

Are you looking for:

 data.frame(gsub("[^()0-9]","",as.matrix(dat)))

  concern notaware scenery
1  (2)(4)                 
2     (1)      (1)     (3)
3     (3)              (1)

EDIT

 data.frame(gsub("(?<!\\))(?:\\w+|[^()])(?!\\))","",as.matrix(dat),perl=T))
   concern notaware scenery
1 (2) (4)                  
2     (1)      (1)     (3) 
3     (3)              (1)

edited Aug 15 '18 at 22:31

answered Aug 15 '18 at 22:08

Onyambu

67,392
3
24
53

Yes, thank you. That is much simpler! The only thing is I just realised there numbers in my text that aren't in parenthesis which I'm not interested in. – Lan Aug 15 '18 at 22:15
@Lan are there any spaces in the parenthesis?? – Onyambu Aug 15 '18 at 22:28
There are no spaces in the paraenthesis – Lan Aug 15 '18 at 22:30

divibisan · Answer 2 · 2018-08-15T22:14:22.233

What we do here, is loop the data.frame by column and use str_extract_all from the stringr package to extract all numbers in parentheses.

Since there can be multiple values to extract from a single cell, we need str_extract_all with the simplify=T argument, which returns a data.frame for each column (rows are rows in df with a column for each match found).

We then need to go through those tables with apply to bind each row together into a single character vector (separated by a space here, but you can change that). Now, we just have a vector for each column, so apply can stitch them together nicely into a data.frame.

apply(df, 2, function(x) {
    temp <- stringr::str_extract_all(x, '\\([0-9]\\)', simplify = T)
    apply(temp, 1, paste0, collapse = ' ')
})

     concern   notaware scenery
[1,] "(2) (4)" ""       ""     
[2,] "(1) "    "(1)"    "(3)"  
[3,] "(3) "    ""       "(1)"

score 0 · Answer 3 · answered Aug 15 '18 at 22:12

0

Remove everything except number and brackets with gsub:

     data <- cbind("concern" = c("(2) chat community (4) more ", "(1) didn't know ", "(3) often  "), notaware=c("", "(2) chat community", "" ) )  

      gsub("[^0-9\\(\\)]", "", data)

answered Aug 15 '18 at 22:12

Nar

648
4
8

Extracting text between parenthesis in columns in dataframe into new columns in dataframes

3 Answers3

EDIT

Linked