0

I have simulated some data in R to simulate disease prevalence curves.

I would like the data to plot as a smoothed curve, so in ggplot2 I have used the function geom_smooth. The problem is that I would each of the lines to begin at x&y = 0, but geom_smooth has them start at varying points from 0.1 to 0.15.

**Note - I am not querying how to have the axis begin at 0, I would like the lines on the graph beginning around zero, so this is not a duplicate

enter image description here

Is there a way to constrain how the smoothing is done, so that the lines begin at the same point (0,0), or close to it, similar to this plot?

enter image description here

The script I have used to simulate the data and plot it are:

library(ggplot2)
library(tidyverse)
library(colorspace)

# Set the random seed for reproducibility
set.seed(42)

n <- 80 # Number of data points to simulate
age <- seq(0, 80, length.out = n) # Create the "age" variable ranging from 0 to 80

# Calculate "seroprev" as a natural logarithmic sequence that plateaus at 0.7 after the 40th observation
max_seroprev <- 0.7
seroprev <- pmin(max_seroprev, log(age + 1) / log(40 + 1) * max_seroprev)

# Create a data frame to store the simulated data for year 1988
simulated_data.1988 <- data.frame(seroprev, age)
simulated_data.1988$year <- "1988"

#### 1990 ####
max_seroprev <- 0.65
seroprev <- pmin(max_seroprev, log(age + 1) / log(40 + 1) * max_seroprev)
simulated_data.1990 <- data.frame(seroprev, age)
simulated_data.1990$year <- "1990"

#### 2003 ####
max_seroprev <- 0.53
seroprev <- pmin(max_seroprev, log(age + 1) / log(40 + 1) * max_seroprev)
simulated_data.2003 <- data.frame(seroprev, age)
simulated_data.2003$year <- "2003"

#### 2008 ####
max_seroprev <- 0.45
seroprev <- pmin(max_seroprev, log(age + 1) / log(40 + 1) * max_seroprev)
simulated_data.2008 <- data.frame(seroprev, age)
simulated_data.2008$year <- "2008"

#### 2011 ####
# Initialize "seroprev" with zeros
seroprev <- rep(0, n)
# Calculate "seroprev" as a natural logarithmic sequence starting from age 5
start_age <- 5
seroprev[start_age:n] <- log(1:(n - start_age) + 1) / log((n - start_age) + 1) * 0.4

# Round the "seroprev" values to two decimal places
seroprev <- round(seroprev, 2)
simulated_data.2011 <- data.frame(seroprev, age)
simulated_data.2011$year <- "2011"
simulated_data.2011 <- simulated_data.2011[-c(80),] #fix error in 2011 data
simulated_data.2011 <- simulated_data.2011 %>% 
  add_row(age = 80, seroprev=0.4, year = "2011")

sim.full <- rbind(simulated_data.1988, simulated_data.1990,
                  simulated_data.2003, simulated_data.2008, 
                  simulated_data.2011) #bind each simulated year
sim.full$year <- as.factor(sim.full$year)

ggplot(sim.full, aes(x = age, y = seroprev, colour = year, group = year)) + 
  geom_smooth(se = F) + 
  xlab("Age (years)") + 
  ylab("Seroprevalence") + 
  scale_color_manual(values = c("1988" = "#990000",
                                "1990" = "red",
                                "2003" = "orange",
                                "2008" = "#FFCC00",
                                "2011" = "yellow")) +
  ylim(0, 1) +
  theme_classic() +
  theme(legend.title = element_blank(),
        legend.position=c(0.1,0.9), # Position legend top left
        legend.text = element_text( size = 15), 
        axis.title.x = element_text(face = "bold", size = 15), 
        axis.title.y = element_text(face = "bold", size = 15), 
        axis.text = element_text(face = "bold", size = 12)) 

'''

bellbyrne
  • 67
  • 7
  • 1
    So what type of smooth model are you trying to fit? What's the parameterization you want? The default smoother in this case is "loess" which does not have any means of forcing a line to go through a certain point. – MrFlick Jul 18 '23 at 13:43
  • I would just like to smooth the line of the simulated data. With the default geom_smooth, it is smoothed but the beginning of the line gets plotted far from 0 @MrFlick – bellbyrne Jul 18 '23 at 14:38
  • 1
    There's no smoothing method that I know if that allows for such a constraint. That's why I was asking if you knew of one. It's not clear mathematically how you would create such a line or what properties that line should have or you need it to have. Maybe you should get some statistical advice from [stats.se] first. – MrFlick Jul 18 '23 at 14:51
  • 1
    It sounds like you need to identify a function family that has the properties you’re looking for, then you can fit to the data to find representative curves. – Jon Spring Jul 18 '23 at 16:33

0 Answers0