I'm working with a large dataset, and trying to offload it to Amazon EC2 for quicker processing.
The data starts as two tables - 6.5M x 6, and 11K x 15. I am then merging them into a single 6.5M x 20 table.
Here's my R code:
library(data.table)
library(dplyr)
download.file("http://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip", "data.zip")
unzip("data.zip")
data <- readRDS("summarySCC_PM25.rds")
scckey <- readRDS("Source_Classification_Code.rds")
system.time(data <- data %>% inner_join(scckey))
On my home laptop (i7 1.9GHZ, 8GB RAM, SSD) here is my output
user system elapsed
226.91 0.36 228.39
On an Amazon EC2 c4.8xlarge (36 vCPU, 132 EPU, 60GB RAM, EBS storage)
user system elapsed
302.016 0.396 302.422
On an Amazon EC2 c3.8large (32 vCPU, 108 EPU, 60GB RAM, SSD storage)
user system elapsed
374.839 0.367 375.178
How can it be that both EC2 systems are slower than my own laptop? c4.8large in particular seems to be the MOST powerful computational solution Amazon offers.
Am I doing something wrong?
EDIT:
I've checked the Monitoring system - it looks like the join is running at 3-5% CPU usage. That seems very low - on my laptop it runs around 30-40%.
EDIT:
Per suggestion, I tried data.table
's merge()
3.8xlarge @ ~1% CPU utilization:
system.time(datamerge <- merge(data, scckey, by = "SCC"))
user system elapsed
193.012 0.658 193.654
4.8xlarge @ ~2% CPU utilization:
system.time(datamerge <- merge(data, scckey, by = "SCC"))
user system elapsed
162.829 0.822 163.638
Laptop:
Initially was taking 5+ minutes, so I restarted R.
system.time(datamerge <- merge(data, scckey, by = "SCC"))
user system elapsed
133.45 1.34 135.81
This is obviously a more efficient function, but I'm still beating the best Amazon EC2 machines by a decent margin.
EDIT:
scckey[data]
reduces the time for this operation to less than 1 second on my laptop. I'm still curious as to how I could make better use of EC2.