1

I have an excel file I need to read that has multiple observations and values per row, with complicated names. It looks something like this when you load in:

library(tidyverse)
library(janitor)

# An input table read from xlsx, with a format similar to this
# An input table read from xlsx, with a format similar to this
input_table <- tribble(~"product" , 
                       ~"Price Store 1 ($1000/unit)",
                       ~"Quantity in Store 1 (units)",
                       ~"Price Store 2 ($1000/unit)",
                       ~"Quantity in Store 2 (units)",
                       'product a', 10, 100, 20, 70,
                       'product b', 30, 10, 35, 10)

I want to use some form of gather/pivot_longer to make it tidy, and have an output that looks like this:

# Desired output
output_table <- tribble(~'product',~'store',~'price',~'quantity',
                        'product a', 1, 10, 100,
                        'product a', 2, 20, 70,
                        'product b', 1, 30, 10,
                        'product b', 2, 35, 10)

Is there an easy way to get there using pivot_longer? Extracting the key number (in this case, store) would probably need some complex regex that I don't know how to create.

Leo Barlach
  • 480
  • 3
  • 13

2 Answers2

3

Yes, we can do

tidyr::pivot_longer(input_table, 
                   cols = -product, 
                   names_to = c(".value", "Store"),
                   names_pattern =  "(\\w+).*?(\\d)")

#  product   Store Price Quantity
#  <chr>     <chr> <dbl>    <dbl>
#1 product a 1        10      100
#2 product a 2        20       70
#3 product b 1        30       10
#4 product b 2        35       10

We get the column names (Price or Quantity) along with store number using names_pattern. The first word (\\w+) is the column name whereas first digit coming after it (\\d) is considered as store number.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

We can use the names_pattern in pivot_longer to match one or more letters followed by characters that are not a digit and capture the digit

library(tidyr)
pivot_longer(input_table, cols = -product, 
               names_to = c(".value", "Store"),
                names_pattern =  "([A-Za-z]+)[^0-9]+([0-9])")
# A tibble: 4 x 4
#  product   Store Price Quantity
#  <chr>     <chr> <dbl>    <dbl>
#1 product a 1        10      100
#2 product a 2        20       70
#3 product b 1        30       10
#4 product b 2        35       10
akrun
  • 874,273
  • 37
  • 540
  • 662