0

I'm trying to get the right regex (following this) to use inside names_pattern.

The strings are: CRIS_CLAU_ENG_O and LARI_CLAU_ENG_O
Desired output: CRIS_O and LARI_O

ID | CLAU_VALUE | RATER

  • attempt so far:
data1 %>% 
  select(ID, contains("CLAU")) %>% 
  pivot_longer(c(CRIS_CLAU_ENG_O, LARI_CLAU_ENG_O),
               names_to = c("RATER", ".value"),
               names_pattern = "^([^_]+)([^_]+)") %>% 
 ## mutate(RATER = case_when(RATER == "CRI" ~ 'RATER1',    
                           RATER == "LAR" ~ 'RATER2')) %>% 
 ## mutate(RATER = factor(RATER, levels = c('RATER1', 'RATER2')))
  • If it's possible, ideally, the desired output should contain two value columns, like this:

ID | CLAU_VALUE | TUNITS_VALUE | RATER

in this case, tho, the rater would be different: CRIS_WRI and LARI_WRI to differ from O and WRI. or I could have another column called 'type' containing this information

pivoting the "TUNITS" columns at the same time as "CLAU" columns.

  • I'm slipting the strings into the value columns, not into my factor column (I honestly don't know why. I'd like single values columns instead and a single 'RATER' column. I'm probably doing something silly, but thanks in advance, I'd really appreciate.

  • data:

> dput(data1)
structure(list(ID = c("A", "B", "C", "D", "E", "F", "G", "H", 
"I", "J", "K", "L", "M", "N", "O", "P"), CRIS_CLAU_ENG_O = c(6, 
5, 6, 7, 6, 3, 5, 5, 6, 6, 7, 9, 8, 6, 6, 6), CRIS_TUNITS_WRI_O = c(5, 
5, 4, 5, 5, 3, 5, 5, 4, 4, 7, 7, 7, 6, 6, 5), LARI_CLAU_ENG_O = c(6, 
5, 5, 7, 7, 3, 5, 5, 6, 6, 9, 9, 8, 8, 6, 6), LARI_TUNITS_WRI_O = c(5, 
3, 4, 6, 5, 3, 2, 5, 4, 4, 7, 8, 7, 6, 6, 5)), row.names = c(NA, 
-16L), spec = structure(list(cols = list(ALUNO = structure(list(), class = c("collector_character", 
"collector")), CRIS_CLAU_ENG_O = structure(list(), class = c("collector_double", 
"collector")), CRIS_TUNITS_WRI_O = structure(list(), class = c("collector_double", 
"collector")), LARI_CLAU_ENG_O = structure(list(), class = c("collector_double", 
"collector")), LARI_TUNITS_WRI_O = structure(list(), class = c("collector_double", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"),  class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))
user438383
  • 5,716
  • 8
  • 28
  • 43
Larissa Cury
  • 806
  • 2
  • 11
  • Thanks for providing `dput`-data! Please remove the `problems=` of the output so that it can be used directly (removing it does nothing in this context, but we can never use it anyway). – r2evans Jan 27 '23 at 15:49

1 Answers1

1

You're both selecting out the TUNITS variable and choosing to pivot only a couple of columns. If we keep all columns around, we can get closer. Also, your regex is incomplete, we need to add a literal _ between your two pattern groups.

library(dplyr)
library(tidyr0
data1 %>% 
  pivot_longer(-ID,
               names_to = c("RATER", ".value"),
               names_pattern = "^([^_]+)_([^_]+)_.*")
# # A tibble: 32 × 4
#    ID    RATER  CLAU TUNITS
#    <chr> <chr> <dbl>  <dbl>
#  1 A     CRIS      6      5
#  2 A     LARI      6      5
#  3 B     CRIS      5      5
#  4 B     LARI      5      3
#  5 C     CRIS      6      4
#  6 C     LARI      5      4
#  7 D     CRIS      7      5
#  8 D     LARI      7      6
#  9 E     CRIS      6      5
# 10 E     LARI      7      5
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows

Regex:

^([^_]+)_([^_]+)_.*
^                      beginning of the string
 '-----' '-----'       pattern groups
  [^_]                 character group of anything except '_'
      +                one or more ('*' is zero-or-more, '?' is 0-or-1)
        _              the literal underscore
                 .*    zero or more ('*') of anything ('.')
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you, @r2Evans ! Now, I'd need to have different raters' names in order to show the type of test OR another column. Let me explain = "O" stands for oral, and "WRI" for written. Hence, I'd need to have LARI|CRIS_O or LARI|CRIS_WRI OR another column indicating the type of test like "TYPE" : Oral or Written – Larissa Cury Jan 27 '23 at 15:33
  • I believe the second option would be better, maybe there's a way to extract either O and WRI from the strings and make a new colunm "type" ? – Larissa Cury Jan 27 '23 at 15:42
  • 1
    In your sample data, the `TUNITS`/`CLAU` are perfectly correlated with `ENG`/`WRI`, so using `names_to = c("RATER", ".value"), names_pattern = "^([^_]+)_(.*)$"` keeps them together. If you have more variety than that, it'd help to update your sample data. – r2evans Jan 27 '23 at 15:47
  • yes, indeed! thank you! If you don't mind, would you explain what's happening behind the curtains in the regex? Whenever I think I'm finally learning that... – Larissa Cury Jan 27 '23 at 16:02
  • thank you for the update! I really appreciate that! Sth is off, I have to pivot wider each variable (CLAU and Tunit) afterwards, but whenever I do ```data1 %>% pivot_wider(names_from = RATER, values_from = CLAU_ENG_O)``` I'm getting some duplicated weird results. It's duplicating some names and leaving NAs, tho there's no NA – Larissa Cury Jan 27 '23 at 17:23
  • I don't know, it works with the data you provided. – r2evans Jan 27 '23 at 17:26