Keep delimiter in Strsplit with regex combinations

Question

I am munging some data that requires me to combine regex functions using strsplit. I have figured out how to split up my string, but am struggling to apply the guidance in this post around keeping delimiters.

Here's an example of a string that I'm scraping:

text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")

And, here is code that successfully splits the string, but trims the delimiter:

strsplit(as.character(free_text), "[0-9](?=[A-Z])|[a-z](?=[A-Z])|[')'](?=[A-Z])", perl=TRUE)

As you will note, I'm looking for places where:

Lowercase letters are directly next to uppercase letters
Numbers next to uppercase letters
Close-parentheses symbols next to uppercase letters

Unfortunately, the output below shows the issue with my code:

[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembl" [2] "Material: Woo" [3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L" [4] "Weight: 6.0 pound" [5] "Holds up to: 20.0 pound" [6] "Intended Pet Type: Bir" [7] "Care and Cleaning: Hand was" [8] "Pet activity: Clim" [9] "TCIN: 1670783" [10] "UPC: 03017202559" [11] "Item Number (DPCI): 083-01-024" [12] "Report incorrect product information"

i.e., the last letter is trimmed from assemble [1], Wood [2], and so on. How does one keep the delimiter when you are looking for regex combinations like mine?

Why didn't you put the consuming parts into lookbehinds? [`"(?<=[0-9])(?=[A-Z])|(?<=[a-z])(?=[A-Z])|(?<=\\))(?=[A-Z])"`](https://regex101.com/r/WFvy8v/2) — Wiktor Stribiżew, Jul 17 '18 at 16:42

Wiktor Stribiżew · Accepted Answer · 2018-07-17T16:47:36.817

You may put the consuming patterns in your regex into lookbehinds:

> text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
> strsplit(text, "(?<=[0-9])(?=[A-Z])|(?<=[a-z])(?=[A-Z])|(?<=\\))(?=[A-Z])", perl=TRUE)
[[1]]
 [1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembly"
 [2] "Material: Wood"                                                                                                                                                                                                                                                                                                                                                
 [3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)"                                                                                                                                                                                                                                                                                     
 [4] "Weight: 6.0 pounds"                                                                                                                                                                                                                                                                                                                                            
 [5] "Holds up to: 20.0 pounds"                                                                                                                                                                                                                                                                                                                                      
 [6] "Intended Pet Type: Bird"                                                                                                                                                                                                                                                                                                                                       
 [7] "Care and Cleaning: Hand wash"                                                                                                                                                                                                                                                                                                                                  
 [8] "Pet activity: Climb"                                                                                                                                                                                                                                                                                                                                           
 [9] "TCIN: 16707835"                                                                                                                                                                                                                                                                                                                                                
[10] "UPC: 030172025594"                                                                                                                                                                                                                                                                                                                                             
[11] "Item Number (DPCI): 083-01-0246"                                                                                                                                                                                                                                                                                                                               
[12] "Report incorrect product information"

See the regex demo and the online R demo.

The [0-9] is converted to (?<=[0-9]), [a-z] is now (?<=[a-z]) and [')']is now (?<=\)).

Note that (?<=...) is a positive lookbehind that matches a location in a string that is immediately preceded with some pattern defined in the lookbehind.

Follow-up question (and I am more than happy to post a separate question, or modify my question above): does strsplit also allow me to send the substrings to columns when there is inconsistency in the content of the substrings? For example, I might be evaluating a second string that doesn't contain the `Material` substring. I would still want to have a `Material` column when casting to a dataframe, and the value would be NA for the string that doesn't have that substring. — roody, Jul 17 '18 at 17:00
@roody It definitely not possible with the splitting approach, you should consider matching, or even capturing, because the number of *captured* "fields" is always constant (the number of captures is defined by the number of capturing groups in the regex pattern). There is a nice code written by someone, see [this post](https://stackoverflow.com/questions/45802057/regex-named-groups-in-r) if you want to follow that approach. — Wiktor Stribiżew, Jul 17 '18 at 17:04

Keep delimiter in Strsplit with regex combinations

1 Answers1