4

I am munging some data that requires me to combine regex functions using strsplit. I have figured out how to split up my string, but am struggling to apply the guidance in this post around keeping delimiters.

Here's an example of a string that I'm scraping:

text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")

And, here is code that successfully splits the string, but trims the delimiter:

strsplit(as.character(free_text), "[0-9](?=[A-Z])|[a-z](?=[A-Z])|[')'](?=[A-Z])", perl=TRUE)

As you will note, I'm looking for places where:

  • Lowercase letters are directly next to uppercase letters
  • Numbers next to uppercase letters
  • Close-parentheses symbols next to uppercase letters

Unfortunately, the output below shows the issue with my code:

[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembl" [2] "Material: Woo"
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L"
[4] "Weight: 6.0 pound"
[5] "Holds up to: 20.0 pound"
[6] "Intended Pet Type: Bir"
[7] "Care and Cleaning: Hand was"
[8] "Pet activity: Clim"
[9] "TCIN: 1670783"
[10] "UPC: 03017202559"
[11] "Item Number (DPCI): 083-01-024"
[12] "Report incorrect product information"

i.e., the last letter is trimmed from assemble [1], Wood [2], and so on. How does one keep the delimiter when you are looking for regex combinations like mine?

roody
  • 2,633
  • 5
  • 38
  • 50
  • Why didn't you put the consuming parts into lookbehinds? [`"(?<=[0-9])(?=[A-Z])|(?<=[a-z])(?=[A-Z])|(?<=\\))(?=[A-Z])"`](https://regex101.com/r/WFvy8v/2) – Wiktor Stribiżew Jul 17 '18 at 16:42

1 Answers1

4

You may put the consuming patterns in your regex into lookbehinds:

> text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
> strsplit(text, "(?<=[0-9])(?=[A-Z])|(?<=[a-z])(?=[A-Z])|(?<=\\))(?=[A-Z])", perl=TRUE)
[[1]]
 [1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembly"
 [2] "Material: Wood"                                                                                                                                                                                                                                                                                                                                                
 [3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)"                                                                                                                                                                                                                                                                                     
 [4] "Weight: 6.0 pounds"                                                                                                                                                                                                                                                                                                                                            
 [5] "Holds up to: 20.0 pounds"                                                                                                                                                                                                                                                                                                                                      
 [6] "Intended Pet Type: Bird"                                                                                                                                                                                                                                                                                                                                       
 [7] "Care and Cleaning: Hand wash"                                                                                                                                                                                                                                                                                                                                  
 [8] "Pet activity: Climb"                                                                                                                                                                                                                                                                                                                                           
 [9] "TCIN: 16707835"                                                                                                                                                                                                                                                                                                                                                
[10] "UPC: 030172025594"                                                                                                                                                                                                                                                                                                                                             
[11] "Item Number (DPCI): 083-01-0246"                                                                                                                                                                                                                                                                                                                               
[12] "Report incorrect product information"     

See the regex demo and the online R demo.

The [0-9] is converted to (?<=[0-9]), [a-z] is now (?<=[a-z]) and [')']is now (?<=\)).

Note that (?<=...) is a positive lookbehind that matches a location in a string that is immediately preceded with some pattern defined in the lookbehind.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Follow-up question (and I am more than happy to post a separate question, or modify my question above): does strsplit also allow me to send the substrings to columns when there is inconsistency in the content of the substrings? For example, I might be evaluating a second string that doesn't contain the `Material` substring. I would still want to have a `Material` column when casting to a dataframe, and the value would be NA for the string that doesn't have that substring. – roody Jul 17 '18 at 17:00
  • 1
    @roody It definitely not possible with the splitting approach, you should consider matching, or even capturing, because the number of *captured* "fields" is always constant (the number of captures is defined by the number of capturing groups in the regex pattern). There is a nice code written by someone, see [this post](https://stackoverflow.com/questions/45802057/regex-named-groups-in-r) if you want to follow that approach. – Wiktor Stribiżew Jul 17 '18 at 17:04