1

I have text that I am trying to organizing for some text mining and am using the TidyText library. I have tried setting the token to a regex and setting a custom pattern, but it sends up returning just the bracket (or nothing) and not the content of the brackets.

library(tidytext)
library(stringr)

df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))

un <- unnest_regex(df,elements,text,pattern = "\\[(.*?)\\]")

head(un)
  line                                                                  elements
1    1                                                                          
2    1                                                            mortgage loans
3    2                                                                          
4    2                                                                          
5    2                                                                          
6    2  please indicate the reason(s) you would not purchase this check package.

un2 <- unnest_regex(df,elements,text,pattern = "(?<=\\[).+?(?=\\])")

head(un2)
  line        elements
1    1               [
2    1             ] [
3    1              ][
4    1 ]mortgage loans
5    2               [
6    2             ] [

My ultimate goal is to get this:

  line             elements
1    1        [instruction]
2    1           [Mortgage]
3    1       [Show if Q1A5]
4    2         [checkboxes]
5    2              [min 1]
6    2            [max OFF]

Is this possible?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
maijuli
  • 23
  • 3

2 Answers2

0

This should work, if a bit hacky. The idea is to extract out all the stuff in brackets using stringr, and then "explode" the output. Since it isn't space-delimited, explode on the closing bracket, and then just add it back later.

library(dplyr)
library(stringr)
library(tidyr)

df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))

df <- df %>%
    dplyr::mutate(
        text_in_brackets = stringr::str_extract_all(text, "\\[[^()]+\\]")
    ) %>%
    tidyr::separate_rows(text_in_brackets, sep = "]") %>%
    dplyr::filter(text_in_brackets != "") %>%
    dplyr::mutate( # some cleaning
        text_in_brackets = paste0(text_in_brackets, "]"), # add back "]"
        text_in_brackets = stringr::str_trim(text_in_brackets) # remove leading/trailing spaces
    )

Output

# A tibble: 7 × 2
   line text_in_brackets
  <dbl> <chr>           
1     1 [instruction]   
2     1 [Mortgage]      
3     1 [Show if Q1A5]  
4     2 [checkboxes]    
5     2 [min 1]         
6     2 [max OFF]       
7     2 [Show if Q29A2] 
stressed
  • 328
  • 2
  • 7
0

We could gregexpr text out of the brackets1 and put them back into brackets with help of Map.

Map(\(x, y, ...) data.frame(line=x, elements=sprintf(y, fmt='[%s]')), 
    df$line, regmatches(x, gregexpr(r'{[^[\]]+(?=])}', df$text, perl=TRUE))) |>
  do.call(what=rbind)
#   line        elements
# 1    1   [instruction]
# 2    1      [Mortgage]
# 3    1  [Show if Q1A5]
# 4    2    [checkboxes]
# 5    2         [min 1]
# 6    2       [max OFF]
# 7    2 [Show if Q29A2]

Data:

df <- structure(list(text = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans", 
"[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."
), line = c(1, 2)), class = "data.frame", row.names = c(NA, -2L
))
jay.sf
  • 60,139
  • 8
  • 53
  • 110