Populate a large data frame with calculated values

Question

I've got a rather large data frame (1.65 MM rows) that looks like this:

  date        dayofyear time humidity temp
1 2008-01-01         1  300       99  38.2
2 2008-01-01         1  302       99  38.5
3 2008-01-01         1  304       99  38.5
4 2008-01-01         1  306       99  38.6
5 2008-01-01         1  308       99  38.9
6 2008-01-01         1  310       99  39.1
....

I want to calculate solar noon for each row and add it as a new column to the data frame. The suncalc::getSunlightTimes(date) function gives me the solar noon for any given date.

If I do this:

solarNoons <- suncalc::getSunlightTimes(date = as.Date(dataFrame$date))

then (obviously) even after an hour, the command doesn't finish.

One observation -- even though there are 1.65 MM rows in the data frame, there are only around 3000 unique dates. What's the best way to approach this? Can one call getSunlightTimes() only 3000 times and still populate each of the 1.65 MM rows of the data frame with the values in a new column?

It's my first time posting, but long time reading and learning. Much appreciated.

use the `unique` function to isolate these unique dates to their own `data.frame`. calculate the solar noons of this minimal set. merge the result back to the original data. without a reproducible example, I can only offer this advice to help nudge you along (as well as provide some "buzzwords" to Google further) — MichaelChirico, Aug 18 '18 at 03:28
Don't you also need to pass `lat` and `lon` co-ordinates to get "solarNoon" time? Moreover, use `keep` parameter to calculate only "solarNoon" and avoid other unwanted calculations. — Ronak Shah, Aug 18 '18 at 03:32
Assign `df$date <- as.Date(df$date)` as early as possible, ideally right after you read it in. Strings are more unwieldy than Date objects. That way you avoid creating a temporary with 1.65m strings, 99+% of which are duplicates anyway. — smci, Aug 18 '18 at 04:11
You totally won't need this, but [R supports caching function calls](https://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r), so again calling it 1.65m on only 3000 unique values will only result in 3000 calls. But just use dplyr::group_by(date) — smci, Aug 18 '18 at 04:19
Related: [Speedup conversion of 2 million rows of date strings to POSIX.ct](https://stackoverflow.com/questions/13638545/speedup-conversion-of-2-million-rows-of-date-strings-to-posix-ct) using `fasttime` library — smci, Aug 18 '18 at 04:21
This is all great feedback. I've implemented it. Many thanks. — C. N., Aug 18 '18 at 22:33

smci · Answer 1 · 2018-10-05T21:22:21.417

0

Only call the expensive function once on each distinct value of df$date. Now you will only call it 3000 times instead of 1.65m times. Should be 550x faster. Also apply whatever function-specific speedup tips @RonakShah is suggesting.

# Assign df$date <- as.Date(df$date) as early as possible, ideally right after you read it in    
df$date <- as.Date(df$date)

library(dplyr)

df <- df %>% group_by(date) %>%
             mutate(solarNoon = suncalc::getSunlightTimes(df$date, ...))

Two ways to ensure you only call suncalc::getSunlightTimes() once on each date:

group_by(date) %>% ... pipe into your suncalc::getSunlightTimes(df$date, ...) call
use caching

I'd favor way 1. since it's easy to code and teaches you good decomposition in general. Always try to structure your code to avoid calling expensive functions a million times unnecessarily.

edited Oct 05 '18 at 21:22

answered Aug 18 '18 at 04:14

smci

32,567
20
113
146

In the severely unlikely event this is still too slow/hangs, iterate over your df in chunks of say 10000 lines or whatever. – smci Aug 18 '18 at 04:17
For some reason, this didn't work on my machine. The command wouldn't finish. I too believe this should have worked. – C. N. Aug 18 '18 at 22:30
@C.Mitty: did you follow my suggestion "In the severely unlikely event this is still too slow/hangs, iterate over your df in chunks of say 10000 lines or whatever." Try `df[1:10000,] %>% group_by(date) ...` – smci Aug 19 '18 at 00:35

score 0 · Accepted Answer · answered Aug 18 '18 at 04:44

The following should work. Suppose we generate a data frame of 2 million rows:

> N <- 2e6
> R <- data.frame(year = sample(2000:2009,N,TRUE),
+                 dayofyear = sample(365,N,TRUE),
+                 time = floor(runif(N,0,12))*100+floor(runif(N,0,60)),
+                 humidity = 99,
+                 temp = floor(runif(N,15,40)))
> R$date <- as.Date(with(R,strptime(paste(year,dayofyear),
+                                   "%Y %j", tz="GMT")))
> nrow(R)
[1] 2000000
> head(R)
  year dayofyear time humidity temp       date
1 2000       206  307       99   39 2000-07-24
2 2009       101 1019       99   16 2009-04-11
3 2004       307  547       99   21 2004-11-02
4 2003       270 1158       99   33 2003-09-27
5 2006        21  330       99   22 2006-01-21
6 2005       154  516       99   21 2005-06-03
>

In this case, date is already a Date column, but if yours is a character column, then:

> R$date <- as.Date(R$date)

should only take a few seconds.

Now, get a list of all the unique date values. This should be quite fast:

> dates <- unique(R$date)
> print(length(dates))
[1] 3650
>

Now, run getSunlightTimes on this vector. This only took a couple of seconds on my machine using suncalc version 0.4 and R version 3.4.4:

> times <- suncalc::getSunlightTimes(dates, lat=0, lon=0)

Now, generate an index vector giving the index of each date in R$date within the vector of unique dates dates:

> i <- match(R$date, dates)

Now, select rows of the times dataframe by this same index:

> solarNoons <- times[i,]
> nrow(solarNoons)
[1] 2000000
>

If we pick a row of R:

> R[1234567,]
        year dayofyear time humidity temp       date
1234567 2002        24  535       99   17 2002-01-24

you'll see that the corresponding row of solarNoons is the result for that date:

> solarNoons[1234567,]
                        date lat lon           solarNoon               nadir
2616.352 2002-01-24 12:00:00   0   0 2002-01-24 12:13:14 2002-01-24 00:13:14
                     sunrise              sunset          sunriseEnd
2616.352 2002-01-24 06:09:42 2002-01-24 18:16:46 2002-01-24 06:11:58
                 sunsetStart                dawn                dusk
2616.352 2002-01-24 18:14:30 2002-01-24 05:47:49 2002-01-24 18:38:39
                nauticalDawn        nauticalDusk            nightEnd
2616.352 2002-01-24 05:22:22 2002-01-24 19:04:06 2002-01-24 04:56:50
                       night       goldenHourEnd          goldenHour
2616.352 2002-01-24 19:29:38 2002-01-24 06:38:39 2002-01-24 17:47:49
>

If you want, you can merge the two data frames together:

> R2 <- cbind(R, solarNoons)

This all assumes that "1.65 MM" meant 1.65 million. If you meant 1.65 million million (i.e., an American trillion), then you're going to need a bigger computer.

Thanks for taking the time! `unique`, `match`, `[i, ]`, and `cbind` got me home! — C. N., Aug 18 '18 at 22:33

Populate a large data frame with calculated values

2 Answers2