Transform data to meet normality assumptions

Question

(Below I reported my dataframe and the code I used) I have a dataframe with fish biomass as my variable of interest, which was measured at three different stations in different years.

I want to check if data b/w stations are significantly different.

The problem is that there are several missing data. So in a range of 18 years of measurements, I have Station1 with 10 measurements, Station2 with only 4 and Station3 with 9.

Hence, I ran a shapiro.test to check for each group's normality, which was met only for Station3.

So I log10 transformed the data and I re-ran shapiro.test, this time also Station1 met normality assumptions. However, Station2 still got a p-value below 0.05 (0.03492 to be precise).

I thought at this point to maybe try to transform my data with a log(base = 100), but running shapiro.test I got this error:

Error in shapiro.test(Station2$Biomass_log100) : is.numeric(x) is not TRUE

Also running the histogram I got a similar error:

Error in hist.default(Station2$Biomass_log100) : 'x' must be numeric

I checked the Biomass_log100 type of data, but it says it is numeric:

class(df$Biomass_log100) [1] "numeric"

What would you recommend to do in this case? Should I try to adopt a different type of transformation? Or should I proceed and check for homoscedasticity and then decide which test run, if Kruskall-Wallis or ANOVA? Thank you.

Here is my whole dataframe:

> dput(df[1:54, 1:3])
structure(list(Year = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 
13L, 14L, 15L, 16L, 17L, 18L), .Label = c("1998", "1999", "2000", 
"2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", 
"2009", "2010", "2011", "2012", "2013", "2014", "2015"), class = "factor"), 
    Biomass = c(57.544, NA, NA, 12.34, 9.03, 12.67, 19.31, NA, 
    29.69, NA, 26.93, 42.023, NA, NA, NA, 12.36, 10.15, NA, NA, 
    NA, NA, NA, 3.05, NA, NA, NA, NA, NA, 11.204, NA, NA, NA, 
    NA, 2.273, 2.35, NA, NA, NA, NA, 4.49, 0.43, 2.31, 2.21, 
    NA, 2.412, NA, 10.38, NA, NA, NA, NA, 4.35, 8.71, 4.58), 
    Station = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
    ), .Label = c("Perarolo", "Termine di Cadore", "Cadola"), class = "factor")), row.names = c(NA, 
54L), class = "data.frame")

Here is the code I used:

pacman::p_load(pacman, rio, tidyverse)

#import dataframe

df <- import("C:/Users/soneg/OneDrive/Desktop/articolo-con-picco/dati-censimenti-pesci/excel_per_R/R_part_1.xlsx",
             sheet = "Sheet1",
             range = "A1:D55",
             col_names = TRUE,
             na = "**")

#Check if all fixed factors have been correctly identified as factor

str(df)
df$Year <- as.factor(df$Year)
df$Station <- as.factor(df$Station)
df$Discharge <- as.factor(df$Discharge)
str(df)

#Label stations

df$Station = factor(df$Station, labels = c("Perarolo", "Termine di Cadore", "Cadola"))
class(df$Station)
view(df$Station)
view(df)

#Log transform data

df$Biomass_log10<-log10(df$Biomass + 1)
hist(df$Biomass_log10, breaks = 20)

#Check for normality in each group

Station1 <- subset(df, Station == "Perarolo")
Station2 <- subset(df, Station == "Termine di Cadore")
Station3 <- subset(df, Station == "Cadola")
shapiro.test(Station1$Biomass_log10)
shapiro.test(Station2$Biomass_log10)
shapiro.test(Station3$Biomass_log10)
hist(Station2$Biomass_log10)

#Station2 is still not normally distributed, try a new transformation

df$Biomass_log100<-log(df$Biomass + 1, base = 100)
shapiro.test(df$Biomass_log100)
summary(df$Biomass_log100)
shapiro.test(Station2$Biomass_log100)
hist(Station2$Biomass_log100)
class(df$Biomass_log100)

My RStudio version is: 1.4.1103

Can you provide a reproducible example? Can you provide a reproducible example? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — william3031, Mar 01 '21 at 05:14
Dear @william3031 I edited my question. I hope that now it is more understandable. — user15295151, Mar 02 '21 at 15:02

Transform data to meet normality assumptions

0 Answers0