1

I'm trying to gather a dataset that has multiple header rows:

Sample Date Site Code   Arsenic Barium  Boron   Filterable Antimony
.   .   ug/L    ug/L    ug/L    ug/L
.   .   TM66TW  TM66TW  TM66TW  TM66FW
21/07/2016  1   0.7 88.6    24  <0.3
21/07/2016  A   0.7 110 19  <0.3
15/09/2016  1   0.5 67  32  <0.3
15/09/2016  A   0.4 96.5    22  <0.3
15/09/2016  2   0.6 122 26  <0.3

I would like to gather my dataset so that the first three rows of columns 3:6 become variables, i.e. "Metal", "Unit of Detection" and "Test Method Code"

Is there an easy way to achieve this in tidyverse?

M--
  • 25,431
  • 8
  • 61
  • 93
tm95
  • 77
  • 5
  • 1
    Use `dput()` function to re-create your dataset and edit your post. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – xwhitelight Sep 16 '20 at 04:02
  • 1
    While I generally agree with the aversion to images-of-data, @xwhitelight, it seems unlikely that this is an R object, so `dput` is not likely to be of any use. tm92, it seems that this is either xlsx or csv (or some text delimited file); preferably the latter, in which case can you please [edit] your question and paste the top 6 (or so) lines of your text file into a [code block](https://stackoverflow.com/editing-help). Thanks! – r2evans Sep 16 '20 at 04:35
  • tm95, your last column (Antimony) is non-numeric. Any method in R that hopes to pivot the data in a way that combines (say) Antimony with Boron will either (a) silently convert all numbers to strings; or (b) complain about it and fail. You have three options, I think: (1) remove the `<` and convert those values to the number `0.3`, though this is likely wrong from a scientific-process perspective; (2) convert it to 0, if your interpretation of this is that `<0.3` effectively means zero; or (3) convert to `NA`, to differentiate between `0` and *"not a value we can use/measure"*. – r2evans Sep 16 '20 at 05:22
  • @r2evans, eventually I have a method to treat the non-detect values (by removing the "<" and halving the number). It's much easier to apply these transformations once the measured values are all in a single column. For the data gathering and manipulating stage, I've left these values as characters – tm95 Sep 16 '20 at 07:11

2 Answers2

1

Note: Find the original dataset (i.e. df0) at the bottom.


It can be done in tidyverse by creating a character vector with metal name, unit, and method pasted together. Later we can use that vector to rename the columns. After that, reshaping the data into long format we'll have a column with all those three together (i.e. name column) which can be separated into three columns in order to get the desired output. See below;

library(dplyr)
library(tidyr)

df1 <- df0
## creating a character vector with the desired column names
df1 %>% 
  select(-c(1,2)) %>% 
  slice(1:2) %>% 
  mutate_if(is.factor, as.character) %>%
  rbind(names(.), .) %>% 
  summarise_all(toString) %>% 
  as.character -> comp.header

## renaming columns and then converting to long format  
df1 %>% 
  slice(-c(1,2)) %>% 
  rename_at(3:ncol(.), list(~comp.header)) %>% 
  pivot_longer(-c(Sample_Date, Site_Code)) %>% 
  separate(name, sep = ", ", into = c("Metal", "Detection", "Method"))

#> # A tibble: 20 x 6
#>   Sample_Date Site_Code Metal    Detection Method value
#>   <fct>       <fct>     <chr>    <chr>     <chr>  <fct>
#> 1 21/07/2016  1         Arsenic  ug/L      TM66TW 0.7  
#> 2 21/07/2016  1         Barium   ug/L      TM66TW 88.6 
#> 3 21/07/2016  1         Boron    ug/L      TM66TW 24  
#> 4 21/07/2016  1         Antimony ug/L      TM66FW <0.3 
#> 5 21/07/2016  A         Arsenic  ug/L      TM66TW 0.7  
#> 6 21/07/2016  A         Barium   ug/L      TM66TW 110
#> # ... with 14 more rows 


In base we can achieve this by appending the site code and units to the headers, then converting data to long format and at last separating the column with column names to three columns with "Metal", "Unit", and "Method". See below;

df1 <- df0
## making sure that everything is character and not factpr
df1[] <- lapply(df1, as.character)

## appending unit and site info to the header names
names(df1)[3:ncol(df1)] <- paste(names(df1)[3:ncol(df1)], 
                                 df1[1,3:ncol(df1)], 
                                 df1[2,3:ncol(df1)], sep = ";")

## removing second and third row
df1 <- df1[3:nrow(df1),]

## converting wide to long
df2 <- cbind(df1[1:2], stack(lapply(df1[-c(1, 2)], as.character)))

## separate ind column to three desired variables and append to the dataframe
df2 <- data.frame(subset(df2, select = -ind), 
                  setNames(as.data.frame(do.call(rbind, strsplit(as.character(df2$ind), ';'))), 
                           c("Metal", "Detection", "Code")))

## rearranging the columns
df2 <- df2[c(1,2,4:ncol(df2), 3)]

## result
head(df2)
#>   Sample_Date Site_Code   Metal Detection   Code values
#> 1  21/07/2016         1 Arsenic      ug/L TM66TW    0.7
#> 2  21/07/2016         A Arsenic      ug/L TM66TW    0.7
#> 3  15/09/2016         1 Arsenic      ug/L TM66TW    0.5
#> 4  15/09/2016         A Arsenic      ug/L TM66TW    0.4
#> 5  15/09/2016         2 Arsenic      ug/L TM66TW    0.6
#> 6  21/07/2016         1  Barium      ug/L TM66TW   88.6
#> ...



Data:

## reading data as df0 to preserve it for both solutions
read.table(text="Sample_Date Site_Code   Arsenic Barium  Boron   Antimony
.   .   ug/L    ug/L    ug/L    ug/L
.   .   TM66TW  TM66TW  TM66TW  TM66FW
21/07/2016  1   0.7 88.6    24  <0.3
21/07/2016  A   0.7 110 19  <0.3
15/09/2016  1   0.5 67  32  <0.3
15/09/2016  A   0.4 96.5    22  <0.3
15/09/2016  2   0.6 122 26  <0.3", header= T) -> df0
M--
  • 25,431
  • 8
  • 61
  • 93
0

Step 1: extract the header info.

hdr <- readr::read_csv("~/StackOverflow/14285118/Book1.csv", n_max = 3, col_names = FALSE)
# Parsed with column specification:
# cols(
#   X1 = col_character(),
#   X2 = col_character(),
#   X3 = col_character(),
#   X4 = col_character(),
#   X5 = col_character()
# )
colnms <- unlist(hdr[1,], use.names = FALSE)
colnms
# [1] "Sample Date" "Site Code"   "Arsenic"     "Barium"      "Antimony"   

library(dplyr)
dat_units <- hdr %>%
  select(-(1:2)) %>%
  t() %>%
  as.data.frame() %>%
  setNames(., c("Metal", "Units", "TestCode"))
dat_units
#       Metal Units TestCode
# X3  Arsenic  ug/L   TM66TW
# X4   Barium  ug/L   TM66TW
# X5 Antimony  ug/L   TM66FW

(That last part is a little inelegant in the tidyverse ecosystem, but it's fast and does what we need.)

Step 2: read all data, skipping the first three rows.

dat <- readr::read_csv("~/StackOverflow/14285118/Book1.csv", skip = 3, col_names = colnms)
# Parsed with column specification:
# cols(
#   `Sample Date` = col_character(),
#   `Site Code` = col_character(),
#   Arsenic = col_double(),
#   Barium = col_double(),
#   Antimony = col_character()
# )
dat
# # A tibble: 4 x 5
#   `Sample Date` `Site Code` Arsenic Barium Antimony
#   <chr>         <chr>         <dbl>  <dbl> <chr>   
# 1 21/07/2016    1               0.7   88.6 <0.3    
# 2 21/07/2016    A               0.7  110   <0.3    
# 3 15/09/2016    1               0.5   67   <0.3    
# 4 15/09/2016    A               0.4   96.5 <0.3    

(Base R's read.csv would work too, with read.csv(..., skip=3, header=FALSE).)

Step 3: adjust <0.3 to 0.15

I'll write a function named reduce_lessthan which allows you to determine the scaling:

  • if the parameter is already numeric, it passes through untouched;
  • if a string and it is just a number, then it converts to numeric unmodified;
  • if a string and it starts with <, then it removes the <, converts to a numeric, and multiplies by a factor (mult); and
  • otherwise, it will error
reduce_lessthan <- function(x, mult = 1) {
  if (is.numeric(x)) return(x)
  out <- numeric(length(x))
  islessthan <- grepl("^\\s*<", x)
  nolessthan <- grepl("^-?[0-9]+(\\.[0-9]*)?", x)
  out[islessthan] <- as.numeric(gsub("^\\s*<\\s*", "", x[islessthan])) * mult
  out[nolessthan] <- as.numeric(x[nolessthan])
  if (!all(islessthan | nolessthan)) {
    nm <- deparse(substitute(x))
    stop("non-number/lessthan in ", sQuote(nm), ": ",
         sQuote(x[ which(!islessthan & !nolessthan)[1] ]), " (may be more data)")
  }
  out
}
Arsenic <- c("<0.3", "0.3", "-0.3")
reduce_lessthan(Arsenic, mult = 0.5)
# [1]  0.15  0.30 -0.30
Arsenic <- ">0.5"
reduce_lessthan(Arsenic)
# Error in reduce_lessthan(Arsenic) : 
#   non-number/lessthan in 'Arsenic': '>0.5' (may be more data)

Step 4: use that last function on the data, and rejoin with the original header information.

library(dplyr)
library(tidyr) # pivot_longer
combined <- dat %>%
  mutate(`Sample Date` = as.Date(`Sample Date`, "%d/%m/%Y")) %>%
  mutate_at(vars(-one_of(colnms[1:2])), ~ reduce_lessthan(., mult = 0.5)) %>%
  pivot_longer(-one_of(colnms[1:2]), names_to = "Metal") %>%
  left_join(., dat_units, by = "Metal")
combined
# # A tibble: 12 x 6
#    `Sample Date` `Site Code` Metal     value Units TestCode
#    <date>        <chr>       <chr>     <dbl> <chr> <chr>   
#  1 2016-07-21    1           Arsenic    0.7  ug/L  TM66TW  
#  2 2016-07-21    1           Barium    88.6  ug/L  TM66TW  
#  3 2016-07-21    1           Antimony   0.15 ug/L  TM66FW  
#  4 2016-07-21    A           Arsenic    0.7  ug/L  TM66TW  
#  5 2016-07-21    A           Barium   110    ug/L  TM66TW  
#  6 2016-07-21    A           Antimony   0.15 ug/L  TM66FW  
#  7 2016-09-15    1           Arsenic    0.5  ug/L  TM66TW  
#  8 2016-09-15    1           Barium    67    ug/L  TM66TW  
#  9 2016-09-15    1           Antimony   0.15 ug/L  TM66FW  
# 10 2016-09-15    A           Arsenic    0.4  ug/L  TM66TW  
# 11 2016-09-15    A           Barium    96.5  ug/L  TM66TW  
# 12 2016-09-15    A           Antimony   0.15 ug/L  TM66FW  

Book1.csv:

Sample Date,Site Code,Arsenic,Barium,Antimony
,,ug/L,ug/L,ug/L
,,TM66TW,TM66TW,TM66FW
21/07/2016,1,0.7,88.6,<0.3
21/07/2016,A,0.7,110,<0.3
15/09/2016,1,0.5,67,<0.3
15/09/2016,A,0.4,96.5,<0.3
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Isn't this a bit too complicated? – M-- Sep 17 '20 at 00:50
  • 1
    Which part: calc of `dat_units` or calc of `combined`? The hardest part (and main difference between our answers) is that I read it as a number and kept them numbers, whereas your data is shaped well and then needs post-processing. (Come to think of it, I think your approach of discarding `numeric` from the start is commendable, I hadn't considered it.) – r2evans Sep 17 '20 at 00:53