Slow performance of split() function

Question

I have a csv file consisting of around 200.000 rows of transactions. Here is the import and little preprocessing of the data:

data <- read.csv("bitfinex_data/trades.csv", header=T)
data$date <- as.character(data$date)
data$date <- substr(data$date, 1, 10)
data$date <- as.numeric(data$date)
data$date <- as.POSIXct(data$date, origin="1970-01-01", tz = "GMT")

head(data)

id          exchange  symbol                date price     amount  sell
1 24892563       bf   btcusd 2018-01-02 00:00:00 13375 0.05743154 False
2 24892564       bf   btcusd 2018-01-02 00:00:01 13374 0.12226129 False
3 24892565       bf   btcusd 2018-01-02 00:00:02 13373 0.00489140 False
4 24892566       bf   btcusd 2018-01-02 00:00:02 13373 0.07510860 False
5 24892567       bf   btcusd 2018-01-02 00:00:02 13373 0.11606086 False
6 24892568       bf   btcusd 2018-01-02 00:00:03 13373 0.47000000 False

My goal is to obtain hourly sums of amount of token being traded. For this I need to split my data based on hours, which I did in a following way:

tmp <- split(data, cut(data$date,"hour"))

However this is taking way too long (up to 1 hour) and I wonder whether or not this is normal behaviour for functions such as split() and cut()? Is there any alternative to using those two functions?

UPDATE:

After using great suggestion by @Maurits Evers, my output table looks like this:

# A tibble: 25 x 2
   date_hour     amount.sum
   <chr>              <dbl>
 1 1970-01-01 00       48.2
 2 2018-01-02 00     2746. 
 3 2018-01-02 01     1552. 
 4 2018-01-02 02     2010. 
 5 2018-01-02 03     2171. 
 6 2018-01-02 04     3640. 
 7 2018-01-02 05     1399. 
 8 2018-01-02 06      836. 
 9 2018-01-02 07      856. 
10 2018-01-02 08      819. 
# ... with 15 more rows

This is exactly what I wanted, expect for the first row, where the date is from year 1970. Any suggestion on what might be causing the problem? I tried to change the origin parameter of as.POSIXct() function but that did not solve the problem.

Share example data or... (reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — s_baldur, Jun 18 '18 at 10:09
R's `split` function will go in a create a list of data frames, split by whatever column you tell it to use. Is this what you really want? Or, do you want something else? What is your expected output? — Tim Biegeleisen, Jun 18 '18 at 10:10
"My goal is to obtain hourly sums of amount of goods being traded." Don't split the data. Use package data.table or package dplyr for their "apply-by-group" functionality. — Roland, Jun 18 '18 at 10:11
Relevant answer on different split-apply alternatives: https://stackoverflow.com/a/34167477/1851712 — Henrik, Jun 18 '18 at 10:51
Regarding the 1970 timestamp: perhaps you have a `0` in your file where there should be a timestamp? `.POSIXct(0)` is `1970-01-01 00:00:00`. — Enrico Schumann, Jun 18 '18 at 12:58

Maurits Evers · Accepted Answer · 2018-06-18T10:23:17.767

I agree with @Roland's comment. To illustrate, here is an example.

Let's generate some data with 200000 entries in one minute time intervals.

set.seed(2018);
df <- data.frame(
    date = seq(from = as.POSIXct("2018-01-01 00:00"), by = "min", length.out = 200000),
    amount = runif(200000))
head(df);
#                 date     amount
#1 2018-01-01 00:00:00 0.33615347
#2 2018-01-01 00:01:00 0.46372327
#3 2018-01-01 00:02:00 0.06058539
#4 2018-01-01 00:03:00 0.19743361
#5 2018-01-01 00:04:00 0.47431419
#6 2018-01-01 00:05:00 0.30104860

We now (1) create a new column date_hour that includes the date & hour part of the full date&time, (2) group_by column date_hour, and (3) sum entries from column amount to give amount.sum.

df %>%
    mutate(date_hour = format(date, "%Y-%m-%d %H")) %>%
    group_by(date_hour) %>%
    summarise(amount.sum = sum(amount))
## A tibble: 3,333 x 2
#   date_hour     amount.sum
#   <chr>              <dbl>
# 1 2018-01-01 00       28.9
# 2 2018-01-01 01       26.4
# 3 2018-01-01 02       32.7
# 4 2018-01-01 03       29.9
# 5 2018-01-01 04       29.7
# 6 2018-01-01 05       28.5
# 7 2018-01-01 06       34.2
# 8 2018-01-01 07       33.8
# 9 2018-01-01 08       30.7
#10 2018-01-01 09       27.7
## ... with 3,323 more rows

This is very fast (it takes around 0.3 seconds on my 2012 MacBook Air), and you should be able to easily adjust this example to your particular case.

@MetodJazbec ?? Your comment seems to be incomplete (there is limit on the number of characters, and posting extended code or code output in comments is not a good idea anyway); to store the output in a new variable just do `df.new <- df %>% ...` where you include the `magrittr` chain from above. The output table will only include `date_hour` and `amount.sum` as you've summarised data by calculating the sum of `amount` (which means that in this case values in the other columns don't make any sense anymore). — Maurits Evers, Jun 18 '18 at 10:30
Sorry for confusion, I added an update to my post so that I could include the output table as well. — Metod Jazbec, Jun 18 '18 at 10:34
@MetodJazbec It's difficult to debug without your actual data. Check what source entry the `"1970-01-01 00"` row originates from. You should check what date entries give rise to this value. — Maurits Evers, Jun 18 '18 at 12:53

score 0 · Answer 2 · answered Jun 18 '18 at 12:03

You can compute hourly sums without any packages, by using tapply. I use the random data as suggested by Maurits Evers:

set.seed(2018)
df <- data.frame(
    date = seq(from = as.POSIXct("2018-01-01 00:00"),
               by = "min", length.out = 200000),
    amount = runif(200000))
head(df)
##                  date     amount
## 1 2018-01-01 00:00:00 0.33615347
## 2 2018-01-01 00:01:00 0.46372327
## 3 2018-01-01 00:02:00 0.06058539
## 4 2018-01-01 00:03:00 0.19743361
## 5 2018-01-01 00:04:00 0.47431419
## 6 2018-01-01 00:05:00 0.30104860


tapply(df$amount,
       format(df$date, "%Y-%m-%d %H"),
       sum)
## 2018-01-01 00 2018-01-01 01 2018-01-01 02 
##      28.85825      26.39385      32.73600 
## 2018-01-01 03 2018-01-01 04 2018-01-01 05 
##      29.88545      29.74048      28.46781 
## ...

Slow performance of split() function

2 Answers2