1

Here is the text:

  data$charge[1]
  [1] "Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"

I am currently trying to extract statutes from legal data. My code looks like this:

str_extract_all(data$charge[1:3], "(?<=Violation of;)(\\D|\\d){4,20}(?=;Count |;Docket)") 

[[1]]
[1] "21 O.S. 645"      "21 O.S. 1541.1"

[[2]]
[1]  "21 O.S. 1435     "21 O.S. 1760(A)(1)

[[3]]
[1]   "21 O.S. 1592"

And I'd like to add them as columns to a data frame like this:

id           name           statute1           statute2           statute3
1           BLACK, JOHN     21 O.S. 645        21 O.S. 1541.1     NA
2           DOE, JANE       21 O.S. 1435       21 O.S. 1760(A)(1) NA
3           ROSS, BOB       21 O.S. 1592       NA                 NA

Thank you! Does that make sense?

Anna Rouw
  • 69
  • 2
  • 8

4 Answers4

3

Since you haven't included a reproducible example of your data or expected output, I can't be sure, but I think what you're looking for is the simplify = TRUE argument for str_extract_all.

From the examples on ?str_extract_all:

shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")

# without simplify = TRUE
str_extract_all(shopping_list, "\\b[a-z]+\\b")
[[1]]
[1] "apples"

[[2]]
[1] "bag"   "of"    "flour"

[[3]]
[1] "bag"   "of"    "sugar"

[[4]]
[1] "milk"

# with simplify = TRUE
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
     [,1]     [,2] [,3]   
[1,] "apples" ""   ""     
[2,] "bag"    "of" "flour"
[3,] "bag"    "of" "sugar"
[4,] "milk"   ""   ""     

Using your added example:

dat <- "Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"

str_extract_all(dat, "(?<=Violation of;)(\\D|\\d){4,20}(?=;Count |;Docket)",
                simplify = TRUE)

     [,1]             
[1,] " 21 O.S. 1541.1"
divibisan
  • 11,659
  • 11
  • 40
  • 58
  • Actually this worked! Thank you! Now would you know how to turn this output into data frame columns? – Anna Rouw Aug 08 '18 at 21:45
  • Are you sure there are no typos and that you're using `str_extract_all`? That error happens when you use an argument that's not recognized by the function, usually because of a spelling error, or a misplaced parentheses that associates the argument with a different function than you intended. – divibisan Aug 08 '18 at 21:50
  • No you were right, the `simplify=TRUE` worked! I just need to turn the output into data frame columns – Anna Rouw Aug 08 '18 at 21:52
2

This is by far not the most efficient solution, but compared to others, one that I could understand:

df = tribble(
  ~foo,
  "1,2",
  "3,4"
)

df %>% mutate(
  col1 = str_extract_all(foo, "\\d+", simplify = TRUE)[,1],
  col2 = str_extract_all(foo, "\\d+", simplify = TRUE)[,2],
)

Returns:

# A tibble: 2 x 3
  foo   col1  col2 
  <chr> <chr> <chr>
1 1,2   1     2    
2 3,4   3     4 
slhck
  • 36,575
  • 28
  • 148
  • 201
0

You can do this with the tidyverse package. The regex pattern from your sample doesn't work for some of the sample text provided because it always needs a trailing semicolon. The pattern used below should be simpler, but might need some tweaking depending on the actual text.

library(tidyverse)

df %>% 
  mutate(charges = str_extract_all(charge, "(?<=Violation of;\\s).+?(?=(;|$))")) %>% # extracts the different charges
  select(-charge) %>%  # dropping the raw text can be skipped
  unnest(charges) %>%  # seperates the different charges for each name
  group_by(name) %>%   # in this sample there is only a name, but hopefully the real data has some sort of unique id - there could be lots of Jane Doe's in this data
  mutate(statute = paste0('statute', row_number())) %>% # adds a statute number to each charge
  spread(statute, charges) # shift the data from long to wide

# A tibble: 3 x 3
# Groups:   name [3]
  name       statute1        statute2             
  <chr>      <chr>           <chr>                
1 BLACK,JOHN 21 O.S. 645  21 O.S. 1541.1    
2 DOE, JANE  21 O.S. 1435 21 O.S. 1760(A)(1)
3 ROSS, BOB  21 O.S. 1592 NA      

Sample data:

df <- data_frame(name = c('BLACK,JOHN', 'DOE, JANE', 'ROSS, BOB'), 
                 charge = c('Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1',
                            'Count #3 as Filed: In Violation of; 21 O.S. 1435; Count #4 as Filed: In Violation of; 21 O.S. 1760(A)(1)',
                            'Count #2 as Filed: In Violation of; 21 O.S. 1592'))
sbha
  • 9,802
  • 2
  • 74
  • 62
0

You can use the separate_wider_regex function:

data <- data.frame(
    charge = c("Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"))

library(tidyr)

separate_wider_regex(data, charge, patterns = c("Count #1 as Filed: In Violation of; ", statute1 = "[^;]+", "; Count #2 as Filed: In Violation of; ", statute2 = "[^;]+","; Count #3 as Filed: In Violation of; ", statute3 = "[^;]+"), too_few = "align_start")

# Output
# A tibble: 1 × 3
  statute1    statute2       statute3
  <chr>       <chr>          <chr>   
1 21 O.S. 645 21 O.S. 1541.1 NA      
Mark
  • 7,785
  • 2
  • 14
  • 34