I am munging some data that requires me to combine regex
functions using strsplit
. I have figured out how to split up my string, but am struggling to apply the guidance in this post around keeping delimiters.
Here's an example of a string that I'm scraping:
text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
And, here is code that successfully splits the string, but trims the delimiter:
strsplit(as.character(free_text), "[0-9](?=[A-Z])|[a-z](?=[A-Z])|[')'](?=[A-Z])", perl=TRUE)
As you will note, I'm looking for places where:
- Lowercase letters are directly next to uppercase letters
- Numbers next to uppercase letters
- Close-parentheses symbols next to uppercase letters
Unfortunately, the output below shows the issue with my code:
[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembl"
[2] "Material: Woo"
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L"
[4] "Weight: 6.0 pound"
[5] "Holds up to: 20.0 pound"
[6] "Intended Pet Type: Bir"
[7] "Care and Cleaning: Hand was"
[8] "Pet activity: Clim"
[9] "TCIN: 1670783"
[10] "UPC: 03017202559"
[11] "Item Number (DPCI): 083-01-024"
[12] "Report incorrect product information"
i.e., the last letter is trimmed from assemble [1]
, Wood [2]
, and so on. How does one keep the delimiter when you are looking for regex combinations like mine?