1

I need to extract some information from a string.

Here is an example dataset

data <- data.frame(id = c(1,2),
                  text = c("GK_Conciencia fonologica (FSS)_Form_Number_1.csv",
                           "G1_Conciencia fonologica (FSL)_Form_Number_3.csv"))

> data
  id                                             text
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv

Basically, I need extract the text inside of the paranthesis and numerical value after Form_Number.

How can I get the desired information below.

  id                                             text  cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv. FSS. 1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv. FSl. 3
amisos55
  • 1,913
  • 1
  • 10
  • 21
  • 1
    Does this answer your question? [Extract info inside all parenthesis in R](https://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r) – jpsmith Jan 06 '23 at 01:11

2 Answers2

1

Using str_extract

library(dplyr)
library(stringr)
data %>% 
    mutate(cat = str_extract(text, "\\(([^)]+)", group = 1),
    form = as.integer(str_extract(text, "Number_(\\d+)", group = 1)))

-output

  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3

Or with extract

library(tidyr)
 extract(data, text, into = c("cat", "form"), 
    ".*\\(([^)]+).*_Number_(\\d+)\\..*", remove = FALSE, 
    convert = TRUE)
  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3

Or using base R

cbind(data, strcapture(".*\\(([^)]+)\\)_Form_Number_(\\d+)\\..*", 
   data$text, data.frame(cat =character(), form = integer() )))
  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3
akrun
  • 874,273
  • 37
  • 540
  • 662
1

A solution using gsub

library(dplyr)

data %>% 
  mutate(cat = gsub(".*\\(|\\).*", "", text), 
         form = gsub(".*Form_Number_|\\.csv$", "", text))
  id                                             text cat form
1  1 GK_Conciencia fonologica (FSS)_Form_Number_1.csv FSS    1
2  2 G1_Conciencia fonologica (FSL)_Form_Number_3.csv FSL    3
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29