R - split column based on text pattern

Question

I want to split a column in R that has description for Size, Color and customer for an apparel product.

The column looks like this

Size:AS - Adult Small, Colours: Black , Sold By: Elmwood Size:YS - Youth Small, Colours: Black, Sold By: Rosemary Brown PS Colours - Black, Sold By: Rosemary Brown PS

so, some columns has size, some starts with color

I need to make three columns - 1. Size 2. Colors 3. Sold By For size - I just need the two letter size code , For Color - Need Color Description For Sold By : I need to parse it until it find any 'comma'

Thanks for your help

This could use some clarification on what the input looks like. Right now I would interpret this as a single string, that includes multiple entries, and that mostly, but not always, has fields separated by commas. Please clarify: Do you reasonably expect all the fields to be comma-separated? [Is "Rosemary Brown PS Colours" a typo missing the comma, or is that correct?] Are multiple apparel items included in one input entry (e.g., multiple on a line or in a cell)? Can you expect the order to always be Size, Colours, Sold By, even if some might be missing? — Darlingtonia, Jul 01 '22 at 18:51
Hello Sorry for not explaining Clearly. So the columns looks like this - Column 1 : Size: AS - Adult Small, Colours: Black , Sold By: Elmwood Column 2 : Size:YS - Youth Small, Colours: Black, Sold By: Rosemary Brown Column 3 : Colours - Black, Sold By: Rosemary Brown PS Thanks, I want to split each column into three column by Size, Color and Sold by. But Some columns do not have sizes, it only has colors and sold by. In that case, the size field for that record cound be NA or empty. Hope I made it clear. Thanks — Prasenjit Datta, Jul 01 '22 at 21:12
Please save your and our time and read our [tutorial](https://stackoverflow.com/a/5963610/6574038) and use the [edit](https://stackoverflow.com/posts/72833410/edit) functionalities, cheers! — jay.sf, Jul 01 '22 at 21:16
Please provide enough code so others can better understand or reproduce the problem. — Community, Jul 02 '22 at 08:39

Level-Headed Services · Answer 1 · 2022-07-02T00:12:26.000

This should do the trick for the specific example you gave:

library(tidyverse)

df <- tribble(
  ~col1, 
  'Size: AS - Adult Small, Colours: Black , Sold By: Elmwood , Some Other Column: Other Data ',   
  'Size:YS - Youth Small, Colours: Black, Sold By: Rosemary Brown, Some Other Column: Other Data ',   
  'Colours: Black, Sold By: Rosemary Brown, Some Other Column: Other Data '
) %>% 
  mutate(
    size = str_trim(str_extract(col1, "(?<=Size:)[^,]*(?= -)")),
    colours = str_trim(str_extract(col1, "(?<=Colours:)[^,]*(?=,)")),
    sold_by = str_trim(str_extract(col1, "(?<=Sold By:)[^,]*(?=,)"))
  ) %>% 
  select(-col1)

OUTPUT:

# A tibble: 3 × 3
  size  colours sold_by       
  <chr> <chr>   <chr>         
1 AS    Black   Elmwood       
2 YS    Black   Rosemary Brown
3 NA    Black   Rosemary Brown

You can alter the [regular expressions](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) to deal with particular changes you'd like to make in the pattern matching. — Level-Headed Services, Jul 01 '22 at 23:45

R - split column based on text pattern

1 Answers1