2

Background: I have a large data frame data_2014, containing ~ 1,000,000 rows like this

library(tidyverse)

tibble(
date_time = "4/1/2014 0:11:00",
Lat = 40.7690,
Lon = -73.9549,
Base = "B02512"
)

Problem: I want to create a plot like this enter image description here

This is what I've attempted to do:

library(tidyverse)
library(ggthemes)
library(scales)

    min_lat <- 40.5774
    max_lat <- 40.9176
    min_long <- -74.15
    max_long <- -73.7004
    
    ggplot(data_2014, aes(Lon, Lat)) + 
      geom_point(size = 1, color = "chocolate") + 
      scale_x_continuous(limits = c(min_long, max_long)) + 
      scale_y_continuous(limits = c(min_lat, max_lat)) +
      theme_map() + 
      ggtitle("NYC Map Based on Uber Rides Data (April-September 2014)")

However, when I ran this code, Rstudio crashed. I'm not particularly sure how to fix or improve this. Is there any suggestion?

S10000
  • 141
  • 3
  • 1
    possibly useful information in this question: https://stackoverflow.com/questions/10945707/speed-up-plot-function-for-large-dataset – Ben Bolker Dec 28 '20 at 02:11
  • What operating system are you using and how much RAM is installed on the machine? I am able to produce the chart on a MacBook Pro 15 with 16Gb of RAM, using R 4.0.3 and the latest versions of packages listed in your question. The plot took about 2 minutes to run, and another 4 minutes to render in the RStudio graphics viewer. – Len Greski Dec 28 '20 at 02:49

1 Answers1

4

A million points is a lot for ggplot2, but do-able if your computer is good enough. Yours may or may not be. Short of getting a bigger computer here's what you should do.

  1. This is spatial data, so use the sf package.
library(sf)
data_2014 <- st_as_sf(data_2014, coords = c('Lon', 'Lat')) %>%
  st_set_crs(4326)
  1. If you're only plotting the points, get rid of the columns of data you don't need. I'm guessing they might include trip distance, time, borough, etc. Use dplyr's select, or whatever other method you're familiar with.
  2. Try plotting some of the data, and then a little more. See where your computer slows down & stop there. You can plot the data from row 1:n, or sample x number of rows.
# try starting with 100,000 and go up from there.
n <- 100000
ggplot(data_2014[1:n,]) +
 geom_sf()

# Alternatively sample a fraction of the data.
# Start with ~10% and go up until R crashes again.
data_2015 %>% 
  sample_frac(.1) %>%
ggplot() + 
 geom_sf()

mrhellmann
  • 5,069
  • 11
  • 38