How to create a scatter plot of a really big data

Question

I have a really big data of size more than 3 gb (only X and Y variables). The data frame IBD has 300 million rows. Is there a better and faster way to plot this?

First, I read the dataframe:

IBD <- fread("/40/AD/LL_Cohorts_MERGED-IBD.genome", select = c("X", "Y"))

and tried to plot, but it's been over 12 hours and I am not getting any output.

ggplot(IBD, aes(x=X, y=Y))+ 
  geom_point() + 
  ggtitle("ADGC EOAD") + 
  scale_x_continuous(limits=c(0,1)) + 
  scale_y_continuous(limits=c(0,1))

Calculate the clusters and use scatter plot for the centroids. Otherwise, I think you will be facing with a big full rectangular without any space. See [this](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans). — maydin, Oct 13 '20 at 14:18
I tried with 1/3rd of the data and it gives me what I expect. So If I can plot it, I think it should be fne. I just need to see the outliers. — Yamuna_dhungana, Oct 13 '20 at 14:22

Dylan_Gomes · Answer 1 · 2020-10-13T14:47:44.193

One way to get the run time down is to play with smaller datasets and different code to see which would be faster.

Then you can use system.time() to see how long something takes and compare:

Measuring function execution time in R

For example:

size<-100000
IBD<-data.frame(X=rbeta(n = size,shape1=2,shape2 = 2),Y=rbeta(n = size,shape1=2,shape2 = 2))

Using your code on this fake dataset:

system.time(
ggplot(IBD, aes(x=X, y=Y))+ geom_point() + ggtitle("ADGC EOAD") + scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))
)

   user  system elapsed 
   0.01    0.00    0.01

Using base plot as a comparison point:

system.time(
plot(Y~X, data=IBD)
)

   user  system elapsed 
   2.13    2.34    4.56

You can see that plot takes a lot longer. I realize this isn't a solution to making your code faster, but it is a tool that you can use to figure out what would be faster on such a large dataset.

Edit:

Adding in the methods from comments by @maydin:

cluster<-kmeans(x = IBD, centers = 1000)
Clus<-data.frame(cluster$centers)

system.time(
  ggplot(Clus, aes(x=X, y=Y))+ geom_point() + ggtitle("ADGC EOAD") + scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))
)

   user  system elapsed 
      0       0       0

absolutely it is. You can wrap that within `system.time()` to see if it would take longer or shorter. — Dylan_Gomes, Oct 13 '20 at 18:15

How to create a scatter plot of a really big data

1 Answers1

Edit: