Split character column value into 4 new value columns using gsub and drop values of original column

Question

I have a column that holds values of arrays like this:

 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]]

I need the last two in 4 separate columns (in this case: [["5.1", "2"], ["90.2", "2"]]) but only their values:

5.1 2 90.2 and 2 (in separate columns)

I understand that I can achieve this with tidyR like described here: split character data into numbers and letters

    df %>%
  separate(mycol, 
           into = c("text", "num"), 
           sep = "(?<=[A-Za-z])(?=[0-9])"
           )

but every trial and every attempt failed so far. I cannot manage to access only the last 2 (or 4) items.

I'd be grateful for any idea.. Thanks

Try `stringr::str_extract_all(x, "\\d+(?:\\.\\d+)?(?=(?:[^,]*,){0,3}[^,]*$)")` — Wiktor Stribiżew, Mar 09 '21 at 16:08
Hi Wiktor, thank you! however, this does not produce 4 separate columns, just 1 with all 4 values in "" like this: "0.10", "35", "0.2", and "36". — IndiPsi, Mar 09 '21 at 17:19

akrun · Accepted Answer · 2021-03-09T17:57:49.393

We could group by rows (rowwise), then convert the 'mycol' element with fromJSON to a list of matrixes, unlist to vector, convert the vector to a data.frame with 4 columns using as.data.frame.list, wrap it in a list, then we ungroup and unnest the list column with unnest_wider (from tidyr) and finally, convert the column types based on its value with type.convert

library(dplyr)
library(jsonlite)
library(tidyr)
d %>%
  rowwise %>%
  mutate(newcol = list(setNames(as.data.frame.list(unlist(fromJSON(mycol, 
             simplifyVector  = FALSE)[[2]] )), paste0("X", 1:4)))) %>%
  ungroup %>%
  unnest_wider(c(newcol))   %>%
  type.convert(as.is = TRUE)

-output

# A tibble: 3 x 5
#  mycol                                                                                 X1    X2    X3    X4
#  <chr>                                                                              <dbl> <int> <dbl> <int>
#1 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]"   5.1     2  90.2     2
#2 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]"   5.1     2  90.2     2
#3 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]"   5.1     2  90.2     2

data

d <- structure(list(mycol = c("[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]", 
"[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]", 
"[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]"
)), class = "data.frame", row.names = c(NA, -3L))

Thank you for this suggestion, but this solution does not work for whatever reason. Does not even through an error, just keeps computing... — IndiPsi, Mar 09 '21 at 17:37
@IndiPsi sorry, a closing `)` was missing in the code. Can you test now. I added a reproducible example — akrun, Mar 09 '21 at 17:42
Hi, this worked. Amazing! Can you explain me a little what the function exactly does? I understand that JSON somehow can deal with the arrays, I guess? — IndiPsi, Mar 09 '21 at 17:55

Chris Ruehlemann · Answer 2 · 2021-03-09T19:44:49.170

Here's a base R solution based on regex and @akrun's data:

d1 <- sapply(strsplit(d$mycol, ","), function(x) gsub("(?!\\.)\\D", "", x, perl = T))

We first split d at the comma and pass the result to a gsubfunction which removes anything that is not a digit (\\D) and not a .. We transpose the resulting dataframe d1 to turn columns into rows and select the data of interest:

d2 <- as.data.frame(t(d1[5:8,]))
d2
   V1 V2   V3 V4
1 5.1  2 90.2  2
2 5.1  2 90.2  2
3 5.1  2 90.2  2

If you want to have the result together with the original data, then cbind and change the column names as suits you:

d3 <- cbind(d, d2)
names(d3) <- c("mycol", "x1", "x2", "x3", "x4")

Result:

d3
                                                             mycol  x1 x2   x3 x4
1 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1  2 90.2  2
2 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1  2 90.2  2
3 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1  2 90.2  2

It does and it is a little more economical. If it helps plz consider upvoting/accepting it. — Chris Ruehlemann, Mar 09 '21 at 21:14
Hi, I upvoted it - may take some time until it is counted since I am new to Stackoverflow. I actually find the other solution a little practical in my case as I can concatenate it with more functions, for example I modified it to do the same thing with some more arrays. However, I am just grateful that I received these 2 perfect solutions for my problem that already cost my hours. So, thank you very much! — IndiPsi, Mar 09 '21 at 21:30

Split character column value into 4 new value columns using gsub and drop values of original column

2 Answers2

data