How to load and reshape from wide to long format when columns have repeated names and no identifier

Question

I have an Excel spreadsheet that I would like to change from wide to long format in R. However, the original layout of the excelsheet makes this problematic.

Spreadsheet:

Date	Unit	Day	Treatment	Value	Unit	Day	Treatment	Value
01-03-2023	1	1	treatment1	8.5	2	1	treatment2	9.4
01-03-2023	1	2	treatment1	10.2	2	2	treatment2	10.5

This goes on for more than 200 different units, i.e. it is a very wide spreadsheet and there are more columns than added in this example

# I would like to change the format to the following:

Date	Unit	Day	Treatment	Value
01-03-2023	1	1	treatment1	8.5
01-03-2023	2	1	treatment2	9.4
02-03-2023	1	2	treatment1	10.2
02-03-2023	2	2	treatment2	10.5

The first problem arises when trying to import the sheet, because R creates unique names for each column by adding numbers sequentially to the names of each column, e.g.

Date	Unit...2	day...3	treatment...4	value...5	Unit...6	day...7	treatment...8	value...9

The second problem is how to obtain the desired long-format table using pivot_longer (or other alternatives), with grouped/repeated data like this?

The original layout of the spreadsheet cannot be modified since many persons use it and add data to it everyday.

I hope to find a solution using the pivot_longer function, since I'm familiar with this - but I will accept all suggestions with graditude

I found some inspiration from other posts, but these all have the "unit identifiers" added to each column name, whereas I have random unique identifiers for each column generated by R, e.g. https://stackoverflow.com/questions/59891956/converting-data-from-wide-to-long-format-when-id-variables-are-encoded-in-column

or

https://stackoverflow.com/questions/73765965/reshape-from-wide-to-long-with-multiple-columns-that-have-different-naming-patte

Welcome to Stack Overflow! You need to provide a [minimal, reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your data. Taking the [Tour](https://stackoverflow.com/tour) and reading [How to Ask](https://stackoverflow.com/help/how-to-ask) can be helpful as well. — M--, Mar 01 '23 at 21:40
Where did you gate `02-03-2023` from? The original data does not have that — Onyambu, Mar 01 '23 at 21:47
Read the data without the first row. Set up a loop to select the columns, 4 at a time, (looks like from Unit, Day, Treatment, Value, the columns continue in that order?) Something like `df = mydata[, i:i+3]`. Use `rbind()` in the loop to bind them into a table, using the dates as the first column. — Eric Krantz, Mar 01 '23 at 23:33

score 1 · Answer 1 · answered Mar 01 '23 at 22:49

You can do it with pivot_longer() by just not taking the numbers in the pivot and then creating the column group number after the fact. Here's an example. First, I make the data in a way that reproduces the format, but without having to read in an external file.

library(dplyr)
library(tidyverse)
dat <- read.table(textConnection('
Date    Unit    Day Treatment   Value   Unit    Day Treatment   Value
01-03-2023  1   1   treatment1  8.5 2   1   treatment2  9.4
01-03-2023  1   2   treatment1  10.2    2   2   treatment2  10.5
'), header=TRUE) 
names(dat) <- c("Date", "Unit", "Day", "Treatment", "Value", 
                "Unit", "Day", "Treatment", "Value")
dat <- as_tibble(dat, .name_repair = "unique" )
dat
#> # A tibble: 2 × 9
#>   Date       Unit...2 Day...3 Treatmen…¹ Value…² Unit.…³ Day...7 Treat…⁴ Value…⁵
#>   <chr>         <int>   <int> <chr>        <dbl>   <int>   <int> <chr>     <dbl>
#> 1 01-03-2023        1       1 treatment1     8.5       2       1 treatm…     9.4
#> 2 01-03-2023        1       2 treatment1    10.2       2       2 treatm…    10.5
#> # … with abbreviated variable names ¹Treatment...4, ²Value...5, ³Unit...6,
#> #   ⁴Treatment...8, ⁵Value...9

Now, do the pivot. The number column identifies the column group (e.g., what would be Unit_1, Day_1, Treatment_1, value_1, and then Unit_2, Day_2, Treatment_2, value_2, in a more conventional setup.

dat %>% 
  mutate(obs = 1:n()) %>% 
  pivot_longer(-c(Date, obs), names_pattern="(.*)\\.\\.\\..*", 
               names_to = ".value") %>% 
  group_by(obs) %>% 
  mutate(number = 1:n())
#> # A tibble: 4 × 7
#> # Groups:   obs [2]
#>   Date         obs  Unit   Day Treatment  Value number
#>   <chr>      <int> <int> <int> <chr>      <dbl>  <int>
#> 1 01-03-2023     1     1     1 treatment1   8.5      1
#> 2 01-03-2023     1     2     1 treatment2   9.4      2
#> 3 01-03-2023     2     1     2 treatment1  10.2      1
#> 4 01-03-2023     2     2     2 treatment2  10.5      2

^{Created on 2023-03-01 with reprex v2.0.2}

score 0 · Answer 2 · answered Mar 01 '23 at 23:39

I'm not sure I fully understand the structure of your data based on the example, but it sounds like one thing that will work is to extract each treatment unit as a sub-df and then bind_rows them together. For example if data is read using as_tibble's universal column names, you could do:

library(tidyverse)
dat <- read.table(textConnection('
Date    Unit    Day Treatment   Value   Unit    Day Treatment   Value   Unit    Day Treatment   Value
01-03-2023  1   1   treatment1  8.5 2   1   treatment2  9.4 3   1   treatment1 3.3
01-03-2023  1   2   treatment1  10.2    2   2   treatment2  10.5    3   1   treatment2  5.5
01-04-2023  1   1   treatment1  10.2    2   1   treatment2  10.5    3   1   treatment2  5.5
'), header=TRUE)

(dat <- as_tibble(dat, .name_repair = "unique")
# A tibble: 3 × 13
#  Date        Unit   Day Treatment  Value Unit.1 Day.1 Treatment.1 Value.1 Unit.2 Day.2 Treatment.2 Value.2
#  <chr>      <int> <int> <chr>      <dbl>  <int> <int> <chr>         <dbl>  <int> <int> <chr>         <dbl>
#1 01-03-2023     1     1 treatment1   8.5      2     1 treatment2      9.4      3     1 treatment1      3.3
#2 01-03-2023     1     2 treatment1  10.2      2     2 treatment2     10.5      3     1 treatment2      5.5
#3 01-04-2023     1     1 treatment1  10.2      2     1 treatment2     10.5      3     1 treatment2      5.5

dat |> 
    select(!contains(".")) |> 
    bind_rows(map(1:2, ~{
        d <- select(dat, Date, contains(paste0(".", .x)))
        colnames(d) <- str_remove(colnames(d), "\\.\\d+")
        d
    }))
# A tibble: 9 × 5
#  Date        Unit   Day Treatment  Value
#  <chr>      <int> <int> <chr>      <dbl>
#1 01-03-2023     1     1 treatment1   8.5
#2 01-03-2023     1     2 treatment1  10.2
#3 01-04-2023     1     1 treatment1  10.2
#4 01-03-2023     2     1 treatment2   9.4
#5 01-03-2023     2     2 treatment2  10.5
#6 01-04-2023     2     1 treatment2  10.5
#7 01-03-2023     3     1 treatment1   3.3
#8 01-03-2023     3     1 treatment2   5.5
#9 01-04-2023     3     1 treatment2   5.5

For a fully general solution, replace the 1:2 in map with e.g. max(as.double(str_extract(colnames(dat), "\\d+")), na.rm=TRUE)

score 0 · Answer 3 · answered Mar 01 '23 at 23:46

Read in the data with readxls readxlsx and give each block of Unit to Value its own id.

Assuming the structure: Date [(Unit Day Treatment Value) x-times]

rename function

nms[1] = Date, leave as is
nblock = change to amount of blocks
nelem = elements per block (i.e Unit, Day, Treatment, Value)

library(dplyr)
library(readxl)

nblock <- 3
nelem <- 4

rnms <- function(nms) c(nms[1], paste0(nms[-1], rep(seq(nblock), each = nelem)))

Use rnms with .name_repair to read data

df <- read_xlsx("file.xlsx", .name_repair = rnms)

df                                                                          
# A tibble: 3 × 13
  Date       Unit1  Day1 Treatme…¹ Value1 Unit2  Day2 Treat…² Value2 Unit3  Day3
  <chr>      <dbl> <dbl> <chr>      <dbl> <dbl> <dbl> <chr>    <dbl> <dbl> <dbl>
1 01-03-2023     1     1 treatmen…    8.5     2     1 treatm…    9.4     3     1
2 02-03-2023     1     2 treatmen…   10.2     2     2 treatm…   10.5     3     2
3 03-03-2023     1     3 treatmen…   11.2     2     3 treatm…   11.5     3     3
# … with 2 more variables: Treatment3 <chr>, Value3 <dbl>, and abbreviated
#   variable names ¹Treatment1, ²Treatment2
# ℹ Use `colnames()` to see all variable names

Getting the long format with bind_rows

bind_rows(lapply(1:nblock, function(x) 
  setNames(bind_cols(df[1], df[grepl(x, colnames(df))]), 
    unique(gsub("\\d+$", "", colnames(df))))))
# A tibble: 9 × 5
  Date        Unit   Day Treatment  Value
  <chr>      <dbl> <dbl> <chr>      <dbl>
1 01-03-2023     1     1 treatment1   8.5
2 02-03-2023     1     2 treatment1  10.2
3 03-03-2023     1     3 treatment1  11.2
4 01-03-2023     2     1 treatment2   9.4
5 02-03-2023     2     2 treatment2  10.5
6 03-03-2023     2     3 treatment2  11.5
7 01-03-2023     3     1 treatment3   8.5
8 02-03-2023     3     2 treatment3  10.2
9 03-03-2023     3     3 treatment3  11.2

Data

Exported (from .xlsx) as .csv

% cat file.csv
Date,Unit,Day,Treatment,Value,Unit,Day,Treatment,Value,Unit,Day,Treatment,Value
01-03-2023,1,1,treatment1,8.5,2,1,treatment2,9.4,3,1,treatment3,8.5
02-03-2023,1,2,treatment1,10.2,2,2,treatment2,10.5,3,2,treatment3,10.2
03-03-2023,1,3,treatment1,11.2,2,3,treatment2,11.5,3,3,treatment3,11.2

How to load and reshape from wide to long format when columns have repeated names and no identifier

3 Answers3

Data