(Below I reported my dataframe and the code I used) I have a dataframe with fish biomass as my variable of interest, which was measured at three different stations in different years.
I want to check if data b/w stations are significantly different.
The problem is that there are several missing data. So in a range of 18 years of measurements, I have Station1 with 10 measurements, Station2 with only 4 and Station3 with 9.
Hence, I ran a shapiro.test to check for each group's normality, which was met only for Station3.
So I log10 transformed the data and I re-ran shapiro.test, this time also Station1 met normality assumptions. However, Station2 still got a p-value below 0.05 (0.03492 to be precise).
I thought at this point to maybe try to transform my data with a log(base = 100), but running shapiro.test I got this error:
Error in shapiro.test(Station2$Biomass_log100) : is.numeric(x) is not TRUE
Also running the histogram I got a similar error:
Error in hist.default(Station2$Biomass_log100) : 'x' must be numeric
I checked the Biomass_log100 type of data, but it says it is numeric:
class(df$Biomass_log100) [1] "numeric"
What would you recommend to do in this case? Should I try to adopt a different type of transformation? Or should I proceed and check for homoscedasticity and then decide which test run, if Kruskall-Wallis or ANOVA? Thank you.
Here is my whole dataframe:
> dput(df[1:54, 1:3])
structure(list(Year = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L), .Label = c("1998", "1999", "2000",
"2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008",
"2009", "2010", "2011", "2012", "2013", "2014", "2015"), class = "factor"),
Biomass = c(57.544, NA, NA, 12.34, 9.03, 12.67, 19.31, NA,
29.69, NA, 26.93, 42.023, NA, NA, NA, 12.36, 10.15, NA, NA,
NA, NA, NA, 3.05, NA, NA, NA, NA, NA, 11.204, NA, NA, NA,
NA, 2.273, 2.35, NA, NA, NA, NA, 4.49, 0.43, 2.31, 2.21,
NA, 2.412, NA, 10.38, NA, NA, NA, NA, 4.35, 8.71, 4.58),
Station = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L
), .Label = c("Perarolo", "Termine di Cadore", "Cadola"), class = "factor")), row.names = c(NA,
54L), class = "data.frame")
Here is the code I used:
pacman::p_load(pacman, rio, tidyverse)
#import dataframe
df <- import("C:/Users/soneg/OneDrive/Desktop/articolo-con-picco/dati-censimenti-pesci/excel_per_R/R_part_1.xlsx",
sheet = "Sheet1",
range = "A1:D55",
col_names = TRUE,
na = "**")
#Check if all fixed factors have been correctly identified as factor
str(df)
df$Year <- as.factor(df$Year)
df$Station <- as.factor(df$Station)
df$Discharge <- as.factor(df$Discharge)
str(df)
#Label stations
df$Station = factor(df$Station, labels = c("Perarolo", "Termine di Cadore", "Cadola"))
class(df$Station)
view(df$Station)
view(df)
#Log transform data
df$Biomass_log10<-log10(df$Biomass + 1)
hist(df$Biomass_log10, breaks = 20)
#Check for normality in each group
Station1 <- subset(df, Station == "Perarolo")
Station2 <- subset(df, Station == "Termine di Cadore")
Station3 <- subset(df, Station == "Cadola")
shapiro.test(Station1$Biomass_log10)
shapiro.test(Station2$Biomass_log10)
shapiro.test(Station3$Biomass_log10)
hist(Station2$Biomass_log10)
#Station2 is still not normally distributed, try a new transformation
df$Biomass_log100<-log(df$Biomass + 1, base = 100)
shapiro.test(df$Biomass_log100)
summary(df$Biomass_log100)
shapiro.test(Station2$Biomass_log100)
hist(Station2$Biomass_log100)
class(df$Biomass_log100)
My RStudio version is: 1.4.1103