1

I have a data frame "DF" of 2020 observations and 79066 variables. The first column is the year spanning continuously from 1 to 2020, the other columns (variables) are numbers.

For reproducibility, I created I fake data frame with 20 years from 2000 to 2020, and only 100 variables. E.g.:

set.seed(123)

i <- 100

DF <- data.frame(year=c(2000:2020), 
                 setNames(
                   as.data.frame(lapply(1:i, function(k) c(rnorm(21)))), 
                   paste("Var_", 1:i, sep = "")))

I then created a Mean by row

DF$Aver <- apply(DF[, 2:101], 1, mean, na.rm=TRUE)

I then plotted the average as a line and added the points

plot(DF$year, DF$Aver, type="l", col=1, cex=0.5, las=1, xlab="", ylab="", ylim=c(-4, 4)) 
for (i in 2:101) {
  points(DF$year, DF[, i], pch=20, cex=1, col='gray')
}

However, what I would like to have is a scatterplot where the points close to the mean are dark grey and the grey colour goes shading (light grey) towards the tail values.

jay.sf
  • 60,139
  • 8
  • 53
  • 110

1 Answers1

0

You could normalize the variable in iteration and use it as alpha value in the rgb(red, gree, blue, alpha) function, where values are between [0, 1]. Since smallest values get zero alpha, we simply make points a second time with a very light gray.

normalize <- function(x, na.rm=FALSE) {
  na <- is.na(x)
  x[na] <- min(x, na.rm=TRUE)
  num <- x - min(x)
  denom <- max(x) - min(x)
  return(replace(num/denom, na, NA_real_))
}

plot(DF$year, DF$Aver, type="l", col=1, cex=0.5, las=1, xlab="", ylab="", ylim=c(-4, 4)) 
for (i in 2:101) {
  # points(DF$year, DF[, i], pch=20, cex=1, col=rgb(.8, .8, .8, 1 - normalize(abs(DF[, i]))))
  points(DF$year, DF[, i], pch=20, cex=1, col=rgb(.8, .8, .8, 1 - {nz <- normalize(abs(DF[, i])); replace(nz, is.na(nz), 0)}))
  points(DF$year, DF[, i], pch=20, cex=1, col=rgb(.1, .1, .1, .05))
  # points(DF$year, DF[, i], pch=20, cex=1, col='gray')
}

enter image description here

Update

It is not feasible to plot a huge number of points like 159,713,320 in your case. The human eye cannot resolve this, and a nasty surprise could be waiting for us in the copy store.

A common way of solving the issue is to use a smaller random sample s of the data with a fraction of the columns in the size= argument. (No need to care for integers, sample will round reals.) It will represent your data adequately.

set.seed(42)  ## for sake of reproducibility
s <- sample(2:ncol(DF_huge), size=(ncol(DF_huge) - 1)*.0005)  ## see data below

We might initialize an empty plot using type=n, since the line will be overlaid from the points` anyway.

Next, in the for loop for the points we iterate exactly over this subset s. Here with that many points, we could simplify to a single color '#RRGGBBaa' defined in hexadecimal format 1-F, where R=red, G=green, B=blue and a=alpha (opacity). I commented on the approach from above, but you might want to try both.

Finally, using lines we draw the average line as the top layer, where we may use the whole dataset.

plot(DF_huge$year, DF_huge$Aver, type="n", las=1, xlab="", ylab="", ylim=c(-4, 4)) 
for (i in s) {
  v <- .5
  # points(DF_huge$year, DF_huge[, i], pch='.', cex=3,
  #        col=rgb(v, v, v, 1 - {nz <- normalize(abs(DF_huge[, i])); replace(nz, is.na(nz), 0)}))
  points(DF_huge$year, DF_huge[, i], pch='.', cex=3, col="#88888833")
}
lines(DF_huge$year, DF_huge$Aver, type="l", col=1, cex=0.5, las=1, xlab="", ylab="", ylim=c(-4, 4)) 

enter image description here

See also these older questions dealing with plotting of large data: 1, 2, 3.


Data:

n <- 2020
m <- 79066
set.seed(42)
Var <- replicate(m, rnorm(n))
Var[sample(seq_along(Var), 2000)] <- NA_real_  ## produce 2000 random missings
DF_huge <- data.frame(year=seq_len(n), Var=Var)

DF_huge$Aver <- rowMeans(DF_huge[-1], na.rm=TRUE)  ## much faster than `apply(., mean)`!
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you for your answer however I got this error message because I have NA values `Error in rgb(red = 0.8, green = 0.8, blue = 0.8, alpha = 1 - normalize(abs(DF[, : alpha level NA, not in [0,1]` – Alma Piermattei Jul 03 '22 at 19:46
  • @AlmaPiermattei I see, I enhanced `normalize()` a little, see if it works now. – jay.sf Jul 03 '22 at 20:08
  • Thank you, I used your modified `normalize()` function but unfortunately, I got the same error message – Alma Piermattei Jul 04 '22 at 13:52
  • @AlmaPiermattei Ok, I implemented a hack, that `replace`s NAs with 0 (i.e. no alpha thus blank in this case which makes sense), please review! – jay.sf Jul 04 '22 at 14:08
  • Thank you, I ran the script for the 2020 rows and 79066 columns, and I got this error: `Error in dev.off() : internal compression error -5 in PDF_endpage`. Additionally, I'm not so sure it works with all values. I ran the script for 1000 columns and I don't see the gradient – Alma Piermattei Jul 05 '22 at 20:50
  • @AlmaPiermattei Great, so the code works. Do I understand you right and you are attempting to plot 2020*79,066=159,713,320 points? This is quite a load, and your error might be related to that. – jay.sf Jul 06 '22 at 05:42
  • Yes, I'm trying to plot 159,713,320 points but it doesn't work – Alma Piermattei Jul 06 '22 at 20:29
  • 1
    @AlmaPiermattei You could try `png` but I'm sure, the number is not feasible for plotting—technically, and also the human eye wouldn't be able to resolve it anyway. Draw an appropriate random sample, see updated answer. – jay.sf Jul 07 '22 at 07:22