1

Working with survey data classified into various "waves", with each wave labeled either 1 - 14, or the letters "A" or "E", followed by the variable name.

For example, want to parse:

  • 3educ > wave: 3, variable: educ
  • Aage > wave: A, variable: age

Tried various strings, such as

^([0-9]?|A|E)(\\w+)

to no effect. Please advise.

(Using stringr with R)

Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
Zhaochen He
  • 610
  • 4
  • 12
  • 1
    Based on your input strings, you can consider this regex [`\b([1-9]|1[0-4]|[AE])([a-z]+)\b`](https://regex101.com/r/QHKr6d/1) – Gurmanjot Singh Jan 04 '22 at 07:35

2 Answers2

1

Nevermind I think I got it:

^([0-9][0-9]?|a|e)(\\w+)
Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
Zhaochen He
  • 610
  • 4
  • 12
1

If you need to create a regex for a numeric range, consider using the automatic numeric range regex generator. The regex to match integer numbers from 1 to 14 is (?:[1-9]|1[0-4]).

So, you need to use

(?i)^(?P<wave>[1-9AE]|1[0-4])(?P<variable>\w+)

See the regex demo. (?i) sets the case insensitive mode on and [1-9AE] matches either a non-zero digit or A or E chars.

In R, you can use named capturing groups with namedCapture library:

x <- c("3educ","Aage","14abc","Ekajshklasjf")
library(namedCapture)
str_match_all_named(x, "(?i)^(?<wave>[1-9AE]|1[0-4])(?<variable>\\w+)")

Output:

[[1]]
     wave variable
[1,] "3"  "educ"  

[[2]]
     wave variable
[1,] "A"  "age"   

[[3]]
     wave variable
[1,] "1"  "4abc"  

[[4]]
     wave variable     
[1,] "E"  "kajshklasjf
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563