1

[Edit: The responses are great; thank you. Unfortunately, I oversimplified my reprex. My actual data have dashes inside the parentheses sometimes. Also, as was noted, sometimes there are more than two variants. I am editing my reprex accordingly.]

I am tidying a table of prices for products from a given manufacturer. The models come in variants, with each variant of the same model being the same price, so the original table creator listed the price only once, and distinguished the models with an internal parenthesized alternative scheme. E.g. "1000-a" and "1000-b" are variants of the same model, indicated on the price list as "1000-a(b)". I would like to list each model out on its own row.

Reprex:

prices <- data.frame(
  model = c("1000-a(-b)", "2000-a(b)", "150-L(R)", "TX-a(-b)-J", "350-LX(-S)(-AB)"),
  price = seq(1999,5999,1000)
)
> prices
            model price
1      1000-a(-b)  1999
2       2000-a(b)  2999
3        150-L(R)  3999
4      TX-a(-b)-J  4999
5 350-LX(-S)(-AB)  5999

Sometimes the differences are single characters, sometimes multiple. Sometimes they are at the end of the model number, sometimes in the middle. I would like code which produces this:

> desired_prices
    model price
1  1000-a  1999
2  1000-b  1999
3  2000-a  2999
4  2000-b  2999
5   150-L  3999
6   150-R  3999
7  TX-a-J  4999
8  TX-b-J  4999
9  350-LX  5999
10  350-S  5999
11 350-AB  5999

This seems related to the question Split comma-separated strings in a column into separate rows, but in my case the values aren't comma delimited.

How may I produce this split/expansion? A tidyverse solution is preferred but not required.

Rob Creel
  • 563
  • 2
  • 13
  • Is there only ever a single alternative in parentheses, or could there be multiple? Like `150-lx(sn)(av)(px)` ? – Allan Cameron Aug 26 '21 at 13:19
  • @AllanCameron, yes, there are some multiples, although there are few enough that I was willing to handle those separately. – Rob Creel Aug 26 '21 at 13:27
  • You are transforming `TXa(-b)-J` into `TX-a-J` or `TX-b-J`. It this correct? It's hard to recognise a pattern... Should it be `TX-a(-b)-J`? – Martin Gal Aug 26 '21 at 16:30

5 Answers5

0

Updated to reflect OPs revised data frame. Key regex steps:

  1. Separate all variations i.e. parentheses and content
  2. Clean reference model
  3. Separate row for each of the additional model ids
  4. Remove parentheses and dash from additional model ids
  5. Replace reference id with additional model id
library(dplyr)
library(tidyr)
library(stringr)

  prices %>% 
  mutate(var = str_extract_all(model, "\\([\\-A-z]{1,3}\\)"),
         mod_1 = str_remove_all(model, "\\(([\\-A-z]{1,3})\\)")) %>% 
  unnest(var) %>% 
  mutate(var = str_remove_all(var, "[(\\-)]"),
         mod_2 = str_replace(mod_1, "(?<=\\-)[A-z]{1,3}", var)) %>% 
  select(price, mod_1, mod_2) %>% 
  pivot_longer(-price) %>% 
  select(-name) %>% 
  distinct()

#> # A tibble: 11 x 2
#>    price value 
#>    <dbl> <chr> 
#>  1  1999 1000-a
#>  2  1999 1000-b
#>  3  2999 2000-a
#>  4  2999 2000-b
#>  5  3999 150-L 
#>  6  3999 150-R 
#>  7  4999 TX-a-J
#>  8  4999 TX-b-J
#>  9  5999 350-LX
#> 10  5999 350-S 
#> 11  5999 350-AB

Created on 2021-08-26 by the reprex package (v2.0.0)

data

prices <- data.frame(
  model = c("1000-a(-b)", "2000-a(b)", "150-L(R)", "TX-a(-b)-J", "350-LX(-S)(-AB)"),
  price = seq(1999,5999,1000))

Initial dataset from OP:

library(dplyr)
library(tidyr)
library(stringr)


prices <- data.frame(
  model = c("1000-a(b)", "2000-a(b)", "150-lx(sn)", "350-lx(sn)", "TXa(b)-J"),
  price = seq(1999,5999,1000)
)


prices1 <- 
  prices %>% 
  mutate(model1 = str_remove(model, "\\(([a-z]{1,2})\\)"),
         model2 = str_remove(model, "[a-z]{1,2}(?=\\()"),
         model2 = str_replace_all(model2, "[()]", "")) %>% 
  select(-model)%>% 
  pivot_longer(-price, values_to = "model") %>% 
  select(-name)


prices1       
#> # A tibble: 10 x 2
#>    price model 
#>    <dbl> <chr> 
#>  1  1999 1000-a
#>  2  1999 1000-b
#>  3  2999 2000-a
#>  4  2999 2000-b
#>  5  3999 150-lx
#>  6  3999 150-sn
#>  7  4999 350-lx
#>  8  4999 350-sn
#>  9  5999 TXa-J 
#> 10  5999 TXb-J

Created on 2021-08-26 by the reprex package (v2.0.0)

Peter
  • 11,500
  • 5
  • 21
  • 31
0

This is a quick amendment to Peter's answer in case the letters preceding the variant aren't lower case. This just adds a step to remove the letters of equal length of those within the bracket.

prices %>%
  mutate(
    model1 = str_remove_all(model, '\\s*\\([^)]*\\)' ), #remove everything between bracket
    size = str_extract(prices$model, '(?<=\\().+?(?=\\))') %>% nchar(),
    model2 = str_remove_all(model, sprintf("[A-Za-z]{%s}(?=\\()|\\(|\\)", size))
  ) %>% 
  select(-model, -size)%>% 
  pivot_longer(-price, values_to = "model") %>% 
  select(-name)
Quixotic22
  • 2,894
  • 1
  • 6
  • 14
0

I think that it can be simplified a bit based on what both @Peter and @Quixotic22 wrote. This assumes that the variants contain only letters:

df <- prices %>%
    mutate(
        model1 = str_replace( model, "(.+)\\((\\w+)\\)", "\\1"),
        length = nchar(sub("(.+)\\((\\w+)\\)(.+)?", "\\2", model)),
        model2 = str_replace(
            model, 
            paste("(.+)\\w{", length, "}\\((\\w+)\\)", sep=""), 
            "\\1\\2"
            )
        ) %>%
    select(-model, -length) %>%
    pivot_longer(-price, values_to = "model") %>%
    select(-name)
df
# A tibble: 10 x 2
   price model 
   <dbl> <chr> 
 1  1999 1000-a
 2  1999 1000-b
 3  2999 2000-a
 4  2999 2000-b
 5  3999 150-lx
 6  3999 150-sn
 7  4999 350-lx
 8  4999 350-sn
 9  5999 TXa-J 
10  5999 TXb-J 
mikebader
  • 1,075
  • 3
  • 12
0

This also can be done as follows using the stringr package:

prices %>%
mutate(model_x = str_replace_all(model,"\\([^()]+\\)","")) %>%
mutate(extract = str_extract(model,"\\([^()]+\\)")) %>% 
mutate(model_y = str_remove_all(
                                 str_remove_all(model, extract),"[()]")) %>% 
select(-model, -extract)%>% 
pivot_longer(-price, values_to = "model") %>% 
select(-name)

First line removes extract of anything between parenthesis i.e. () as shown in the extract column.

Second line is a mere repetition to just show what has been extracted.

Third line removes extraction from the original field along with the parenthesis.

Finally we tidy the data using pivot_longer. You may use gather as well but it has been deprecated.

M Daaboul
  • 214
  • 2
  • 4
0

I don't know if this will work in all cases:

library(dplyr)
library(tidyr)
library(stringr)

prices %>% 
  mutate(mod_1 = str_extract_all(model, "((?<=-)\\w*?(?=\\(-?\\w*?\\))|(?<=\\(-?)\\w*?(?=\\)))")) %>% 
  unnest(mod_1) %>%
  mutate(model = str_replace(model, "(?<=-)\\w*?\\(-?\\w*\\)", mod_1), .keep = "unused")

returns

# A tibble: 11 x 2
   model       price
   <chr>       <dbl>
 1 1000-a       1999
 2 1000-b       1999
 3 2000-a       2999
 4 2000-b       2999
 5 150-L        3999
 6 150-R        3999
 7 TX-a-J       4999
 8 TX-b-J       4999
 9 350-LX(-AB)  5999
10 350-S(-AB)   5999
11 350-AB(-AB)  5999
Martin Gal
  • 16,640
  • 5
  • 21
  • 39