0

I am trying to use pls package to analyse my data in R. My data is similar to gasoline data, my data contains many columns of UV data (at different wavelengths) and one column of alum data. gasoline data contains a numeric vector (octane) and matrix with 401 columns (NIR). It seems NIR data is treated as a group. I want to formate my data just like gasoline data and use the similar codes as below.

library(pls)
data("gasoline")
gas1 <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation = "LOO")

A small set of my data as follows:

enter image description here

I have tried

library(readxl)
Data <- read_excel("test.xlsx")
x = as.matrix(Data[,1:6])
y = Data[,7] 
df1 <- data.frame(x,y)

but it did not form a dataframe as the gasoline data.

Please help me to format a data format like gasoline data, so I can use the pls code to process my data and use UV data to predict alum. Any suggestion is welcome. Many thanks. :)

gasoline data is obtained from the pls package in R.

I used dput() function to show my data as below.

dput(head(Data))

structure(list(`UV. 200 nm` = c(35.0310061349693, 34.5507472222222, 
34.3612970711297, 33.942698457223, 33.7440041666667, 33.5717955493741
), `UV. 222.5 nm` = c(34.3149110429448, 33.8141833333333, 33.6073877266388, 
33.181190743338, 32.9606347222222, 32.7796870653686), `UV. 225 nm` = c(33.4781748466258, 
32.9576319444444, 32.7334881450488, 32.2993730715287, 32.0620333333333, 
31.870173852573), `UV. 227.5 nm` = c(32.7270429447853, 32.1803916666667, 
31.9470181311018, 31.5060967741936, 31.2553597222222, 31.0520792767733
), `UV. 230 nm` = c(32.0851104294479, 31.5236361111111, 31.2877782426778, 
30.8468849929874, 30.586125, 30.3832002781641), `UV. 232.5 nm` = c(31.1708558282209, 
30.6077847222222, 30.3719414225941, 29.9375497896213, 29.6742291666667, 
29.4762865090403), Alum = c(76.000324025669, 75.95384102484, 
75.9992186218653, 75.9955211469609, 75.9996022222152, 76.0093745773557
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
UseR10085
  • 7,120
  • 3
  • 24
  • 54
Linda
  • 27
  • 9
  • We would need a [minimal reproducible example][1] to help you. [1]: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Desmond Jun 15 '20 at 02:45
  • Hello @Linda, please, type in the console `dput(head(gasoline))` and copy the output in your question, so we can understand the dataframe you need. – Alexis Jun 15 '20 at 02:58
  • gasoline data is obtained from pls package in R. – Linda Jun 15 '20 at 03:33
  • I want to apply plsr for my data (UV data and alum data), by following the example codes applying plsr on gasoline data. I am still new to R and don't know other ways to do it. – Linda Jun 15 '20 at 03:43
  • I tried dput(head(gasoline)), but the result is very long, so I removed some numbers and wavelengths. the output is: structure(list(octane = c(85.3, 85.25, 88.45, 83.4, 87.9, 85.5 ), NIR = structure(c(-0.050193, -0.044227, -0.0468........ 1.221135, 1.198851, 1.208742, 1.206696, 1.202926, 1.207576), .Dim = c(6L, 401L), .Dimnames = list(c("1", "2", "3", "4", "5", "6"), c("900 nm", "902 nm", "904 nm", "906 nm", "908 nm", ......"1690 nm", "1692 nm", "1694 nm", "1696 nm", "1698 nm", "1700 nm" )), class = "AsIs")), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame") – Linda Jun 15 '20 at 03:48
  • 1
    @RonakShah, @ Desmond @ Alexis, many thanks for the suggestions. I have edited the question, hopefully it's more clear now. I used dput function on my data and showed the output. – Linda Jun 15 '20 at 03:56

3 Answers3

1

In addition to @Ronak Shah's answer, you can use more generalised code to implement pls like

library(pls)
data("gasoline")
gas1 <- plsr(octane ~ ., ncomp = 10, data = gasoline, validation = "LOO") 

where octane is the dependent variable (y) and all other variables were used as indepentednt variables. Then you can select the optimum number of ncomp using following functions

selectNcomp(gas1, "onesigma", plot = TRUE)
selectNcomp(gas1, "randomization", plot = TRUE)

Likewise for your dataset, you can use the following code

pls.fit <- plsr(Alum~., ncomp = 2, data = df, validation = "LOO")

where Alum is the dependent variable. Note that the ncomp can not be more than the number of independent variables. Ideally, you should select a sufficiently large value for ncomp (e.g. 40 or 50) when your independent variables are many(>100) which is typically the case for spectroscopic data. Then you can select the optimum number of ncomp or latent variables (LVs) using

selectNcomp(pls.fit, "onesigma", plot = TRUE)
selectNcomp(pls.fit, "randomization", plot = TRUE)

Hope this helps you out.

UseR10085
  • 7,120
  • 3
  • 24
  • 54
1

The pls help file recommends to create a structure like Gasoline in Section 4.2 "Data Frames". If you do want to do this, using the first lines of your example matrix in a text file:

Sample Data File

Then use the following sample code. The multi column matrix should be protected by the 'protect function' I()

Data <- as.data.frame(read.csv(dataFile, header=TRUE, sep="\t", check.names = FALSE))
UV = as.matrix(Data[,1:6])
Alum = Data[,7]
df1 <- data.frame(I(UV), Alum)

Results are as for the Gasoline object: Example Results

Jennifer B
  • 38
  • 6
0

You can keep the data as matrix in plsr. No need to convert it same as gasoline.

For example for the data shared you can use something like this :

library(pls)
gas1 <- plsr(Alum~as.matrix(data[-7]), data = data)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213