Regex for names_pattern while pivoting longer

Question

I'm trying to get the right regex (following this) to use inside names_pattern.

The strings are: CRIS_CLAU_ENG_O and LARI_CLAU_ENG_O
Desired output: CRIS_O and LARI_O

ID | CLAU_VALUE | RATER

attempt so far:

data1 %>% 
  select(ID, contains("CLAU")) %>% 
  pivot_longer(c(CRIS_CLAU_ENG_O, LARI_CLAU_ENG_O),
               names_to = c("RATER", ".value"),
               names_pattern = "^([^_]+)([^_]+)") %>% 
 ## mutate(RATER = case_when(RATER == "CRI" ~ 'RATER1',    
                           RATER == "LAR" ~ 'RATER2')) %>% 
 ## mutate(RATER = factor(RATER, levels = c('RATER1', 'RATER2')))

If it's possible, ideally, the desired output should contain two value columns, like this:

ID | CLAU_VALUE | TUNITS_VALUE | RATER

in this case, tho, the rater would be different: CRIS_WRI and LARI_WRI to differ from O and WRI. or I could have another column called 'type' containing this information

pivoting the "TUNITS" columns at the same time as "CLAU" columns.

I'm slipting the strings into the value columns, not into my factor column (I honestly don't know why. I'd like single values columns instead and a single 'RATER' column. I'm probably doing something silly, but thanks in advance, I'd really appreciate.
data:

> dput(data1)
structure(list(ID = c("A", "B", "C", "D", "E", "F", "G", "H", 
"I", "J", "K", "L", "M", "N", "O", "P"), CRIS_CLAU_ENG_O = c(6, 
5, 6, 7, 6, 3, 5, 5, 6, 6, 7, 9, 8, 6, 6, 6), CRIS_TUNITS_WRI_O = c(5, 
5, 4, 5, 5, 3, 5, 5, 4, 4, 7, 7, 7, 6, 6, 5), LARI_CLAU_ENG_O = c(6, 
5, 5, 7, 7, 3, 5, 5, 6, 6, 9, 9, 8, 8, 6, 6), LARI_TUNITS_WRI_O = c(5, 
3, 4, 6, 5, 3, 2, 5, 4, 4, 7, 8, 7, 6, 6, 5)), row.names = c(NA, 
-16L), spec = structure(list(cols = list(ALUNO = structure(list(), class = c("collector_character", 
"collector")), CRIS_CLAU_ENG_O = structure(list(), class = c("collector_double", 
"collector")), CRIS_TUNITS_WRI_O = structure(list(), class = c("collector_double", 
"collector")), LARI_CLAU_ENG_O = structure(list(), class = c("collector_double", 
"collector")), LARI_TUNITS_WRI_O = structure(list(), class = c("collector_double", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"),  class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

Thanks for providing `dput`-data! Please remove the `problems=` of the output so that it can be used directly (removing it does nothing in this context, but we can never use it anyway). — r2evans, Jan 27 '23 at 15:49

r2evans · Accepted Answer · 2023-01-27T16:44:10.190

1

You're both selecting out the TUNITS variable and choosing to pivot only a couple of columns. If we keep all columns around, we can get closer. Also, your regex is incomplete, we need to add a literal _ between your two pattern groups.

library(dplyr)
library(tidyr0
data1 %>% 
  pivot_longer(-ID,
               names_to = c("RATER", ".value"),
               names_pattern = "^([^_]+)_([^_]+)_.*")
# # A tibble: 32 × 4
#    ID    RATER  CLAU TUNITS
#    <chr> <chr> <dbl>  <dbl>
#  1 A     CRIS      6      5
#  2 A     LARI      6      5
#  3 B     CRIS      5      5
#  4 B     LARI      5      3
#  5 C     CRIS      6      4
#  6 C     LARI      5      4
#  7 D     CRIS      7      5
#  8 D     LARI      7      6
#  9 E     CRIS      6      5
# 10 E     LARI      7      5
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows

Regex:

^([^_]+)_([^_]+)_.*
^                      beginning of the string
 '-----' '-----'       pattern groups
  [^_]                 character group of anything except '_'
      +                one or more ('*' is zero-or-more, '?' is 0-or-1)
        _              the literal underscore
                 .*    zero or more ('*') of anything ('.')

edited Jan 27 '23 at 16:44

answered Jan 27 '23 at 15:29

r2evans

141,215
6
77
149

Thank you, @r2Evans ! Now, I'd need to have different raters' names in order to show the type of test OR another column. Let me explain = "O" stands for oral, and "WRI" for written. Hence, I'd need to have LARI|CRIS_O or LARI|CRIS_WRI OR another column indicating the type of test like "TYPE" : Oral or Written – Larissa Cury Jan 27 '23 at 15:33
I believe the second option would be better, maybe there's a way to extract either O and WRI from the strings and make a new colunm "type" ? – Larissa Cury Jan 27 '23 at 15:42
1

In your sample data, the `TUNITS`/`CLAU` are perfectly correlated with `ENG`/`WRI`, so using `names_to = c("RATER", ".value"), names_pattern = "^([^_]+)_(.*)$"` keeps them together. If you have more variety than that, it'd help to update your sample data. – r2evans Jan 27 '23 at 15:47
yes, indeed! thank you! If you don't mind, would you explain what's happening behind the curtains in the regex? Whenever I think I'm finally learning that... – Larissa Cury Jan 27 '23 at 16:02
thank you for the update! I really appreciate that! Sth is off, I have to pivot wider each variable (CLAU and Tunit) afterwards, but whenever I do ```data1 %>% pivot_wider(names_from = RATER, values_from = CLAU_ENG_O)``` I'm getting some duplicated weird results. It's duplicating some names and leaving NAs, tho there's no NA – Larissa Cury Jan 27 '23 at 17:23
I don't know, it works with the data you provided. – r2evans Jan 27 '23 at 17:26

Regex for names_pattern while pivoting longer

1 Answers1