0

I have been trying to plot 100 variables having millions of datapoints in R studio. I used scatter plot to plot the graph but the Rstudio is not responding. While it worked for few variables and datapoints. I would really appreciate any input on how to tackle these problems.

daten_h19_149_small <- daten_h19_149[1:500,] dim(daten_h19_149_small) library(lattice) splom(daten_h19_149_small[ ,3:7])

The above is working but not the below one: dim(daten_h19_149_kl) # 1811313 splom(daten_h19_149_kl[ ,1:4])

Surendra
  • 41
  • 5
  • Sometimes you may just have to wait. It may look like nothing is happening when the code is complete, but the `plot` panel is still processing the image. – Lime Oct 12 '20 at 14:07
  • 1
    Using a subset, say 1000 points, will likely produce a similar result but much faster. – Oliver Oct 12 '20 at 14:25
  • 1
    In a 100x100 matrix of plots, each plot will be very small and therefore not very useful. Instead, try using a subset of the variables - possibly "important" variables, or ones that are somehow related. – G5W Oct 12 '20 at 14:51
  • https://stackoverflow.com/questions/10945707/speed-up-plot-function-for-large-dataset/10946907#10946907 – Ben Bolker Oct 12 '20 at 17:46

1 Answers1

1

This is a pretty standard problem in any language.

First step: ask yourself: what is the story you want your graphics to tell? (See also Ed. Tufte's books on quality graphing). In almost all cases, plotting your millions of points will not enhance your story.

One option: plot every N-th point (perhaps plotting a total of 10%.). This will be a lot faster, and depending on how quickly your data changes, may suffice.

Another option: use clustering algorithms to reduce your dataset to some "averaged" or "centroided" collection of points. The package hexbin is a good thing to try here.

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73