4

I'm working with a large dataset, and trying to offload it to Amazon EC2 for quicker processing.

The data starts as two tables - 6.5M x 6, and 11K x 15. I am then merging them into a single 6.5M x 20 table.

Here's my R code:

library(data.table)
library(dplyr)

download.file("http://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip", "data.zip")

unzip("data.zip")

data <- readRDS("summarySCC_PM25.rds")
scckey <- readRDS("Source_Classification_Code.rds")

system.time(data <- data %>% inner_join(scckey))

On my home laptop (i7 1.9GHZ, 8GB RAM, SSD) here is my output

   user  system elapsed 
 226.91    0.36  228.39 

On an Amazon EC2 c4.8xlarge (36 vCPU, 132 EPU, 60GB RAM, EBS storage)

   user  system elapsed 
302.016   0.396 302.422 

On an Amazon EC2 c3.8large (32 vCPU, 108 EPU, 60GB RAM, SSD storage)

   user  system elapsed 
374.839   0.367 375.178

How can it be that both EC2 systems are slower than my own laptop? c4.8large in particular seems to be the MOST powerful computational solution Amazon offers.

Am I doing something wrong?

EDIT:

I've checked the Monitoring system - it looks like the join is running at 3-5% CPU usage. That seems very low - on my laptop it runs around 30-40%.

EDIT:

Per suggestion, I tried data.table's merge()

3.8xlarge @ ~1% CPU utilization:

system.time(datamerge <- merge(data, scckey, by = "SCC"))
   user  system elapsed 
193.012   0.658 193.654

4.8xlarge @ ~2% CPU utilization:

system.time(datamerge <- merge(data, scckey, by = "SCC"))
   user  system elapsed 
162.829   0.822 163.638 

Laptop:

Initially was taking 5+ minutes, so I restarted R.

system.time(datamerge <- merge(data, scckey, by = "SCC"))
   user  system elapsed 
133.45    1.34  135.81 

This is obviously a more efficient function, but I'm still beating the best Amazon EC2 machines by a decent margin.

EDIT:

scckey[data] reduces the time for this operation to less than 1 second on my laptop. I'm still curious as to how I could make better use of EC2.

Chris
  • 313
  • 1
  • 11
  • 3
    Unrelated, but If you already using `data.table` why not using its binary join capabilities? It is specially designed for big data joins. – David Arenburg Apr 25 '15 at 21:29
  • Sorry, I'm still new to R - do you mean the `merge()` function within `data.table`? If not, can you give me an example? – Chris Apr 25 '15 at 21:36
  • No, take a look [here](http://stackoverflow.com/questions/12773822/why-does-xy-join-of-data-tables-not-allow-a-full-outer-join-or-a-left-join) for some examples – David Arenburg Apr 25 '15 at 22:14
  • 2
    Wow. You know a function is efficient when merging 130M datapoints is 13x as fast as running `View()` on the resulting dataset. Thank you for the help. – Chris Apr 25 '15 at 22:27
  • When you run it locally, how well does it take advantage of multiple cores? There is likely somewhat of a virtualization tax here, but it shouldn't be an issue if your workload can effectively use all cores. – datasage Apr 25 '15 at 23:17
  • `scckey[data]`, which is the binary join, finishes in less than a second. So it's hard to know how well it uses the cores. – Chris Apr 26 '15 at 00:51
  • To answer @Chris Bail - I'm using US Virginia, which is the closest server to NYC. – Chris Apr 26 '15 at 00:52
  • 1
    @Chris The largest benefit of using ec2 for data process is when your workload requires more memory than your local machine does, or it can be split and run on multiple cores/instances. Otherwise, virtualization tax will reduce single core performance compared to a bare metal machine. – datasage Apr 26 '15 at 00:58
  • @datasage Thanks, this helps clarify. – Chris Apr 26 '15 at 01:01

1 Answers1

3

Not that I'm an expert on Amazon EC2, but it's probably using commodity servers as the base hardware platform. "Commodity" in this context means x86 CPUs, which have the same basic architecture as your laptop. Depending on how powerful your laptop is, it may even have a higher clockspeed than the cores in your EC2 instance.

What EC2 gets you is scalability, meaning more cores and memory than you have locally. But your code has to be written to take advantage of those cores; meaning it has to be parallelised in execution. I'm pretty sure data.table is single-threaded like nearly all R packages, so getting more cores won't make things faster on its own. Also, if your data already fits in your memory, then getting more isn't going to produce a significant gain.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • I'm not an expert yet on R either, but from what I can find out about dplyr, which provides inner_join(), most procedures are done in parallel. eg http://blog.revolutionanalytics.com/2014/01/fast-and-easy-data-munging-with-dplyr.html – Chris Apr 26 '15 at 00:57
  • @Chris, I think it's a mistake, there's no parallelism implemented in dplyr, yet, although it's on the works. – Arun Apr 26 '15 at 22:36