R, Join all datasets by Date

Question

I have code here: https://github.com/thistleknot/FredAPIR/blob/master/SemanticFilter.R

I iterate through parsedList

a=1
for (i in parsedList)
{
  test1 <- fred$series.observations(series_id = parsedList[a], observation_start = "2000-01-01", observation_end = "2018-03-01")
  test2 <- fred$series.observations(series_id = parsedList[a+1], observation_start = "2000-01-01", observation_end = "2018-03-01")

  test %>>%
    select(
      date,
      value
    ) %>>%
    mutate(
      date = as.Date(date),
      value = as.numeric(value)
    ) ->
    dt1

  if (a>length(parsedList))
{
  test2 %>>%
    select(
      date,
      value
    ) %>>%
    mutate(
      date = as.Date(date),
      value = as.numeric(value)
    ) ->
    dt2

  dt2[dt1, on = c('date')]

}


  a=a+1
}

what I would like to do is merge (join?) all these parsedList's by date so all dataset's (which currently consists of "date", and "value") are merged by date.

I would like to use the merge function (from data.table?) but would like to iterate through all parsedList and result in one dataset with just date, and a slew of values (from each parsedList dataset).

Note [a] = a counter variable. That is the tricky part here. How do I join all iterations of test which are individual lists of parsedList[a]'s into a single list? Ex... parsedList[1] & parsedList[2] & parsedList[3] and ... so on until the last element of parsedList[length(parsedList)] is processed. So each parsedList[] has it's own date, value pair. So I'd need each value saved, but the date is the joining variable.

Note, this is as much of a logic question as much as a function question.

Essentially `Reduce(merge, parsedList)`. You may need to pass additional arguments to specify what columns to join on, and what to keep if you don't want the default inner join. It will go smoothly without much effort if the columns to join on all have the same name (and are keyed, if you have data tables), and the other columns all have different names (and are not keyed). If you need more help, please share a reproducible example. — Gregor Thomas, May 01 '18 at 14:59
thank you, I can now see the use in marking duplicates rather than deleting them. — thistleknot, May 01 '18 at 15:32
I guess this is more of a logic question. I fail to see how reduce iterates through an unknown list size? note the parsedList[x], a is a counter variable (hence why I linked the code). The examples provided only seem to join each list as provided to the reduce function. However, I have a single parsedList[a] that has many iterations. So there is a parsedList[1] and a parsedList[2] and a parsedList[3] and so on. What I need is how ever many parsedList[a]'s I have without have to manually type each iteration. (see linked github code for further code details) — thistleknot, May 02 '18 at 06:46
That's what `Reduce` does. That's the whole point of `Reduce`. — Gregor Thomas, May 02 '18 at 13:49
I'm not sure what you mean by "I fail to see how reduce iterates through an unknown list size". Did you try it and it didn't work? If so, provide a reproducible example showing the problem and we can work on it. Did you look at the `?Reduce` documentation, run the `Reduce` examples at the bottom and not understand what's happening? If so, ask a new question specifically about your confusion. Did you look at the source code for `Reduce` and not understand what's happening? If so, ask a new question focused on that. — Gregor Thomas, May 02 '18 at 13:53
`Reduce` will work on a `list` object (or a vector). In this case, you want to make a `list` of the data frames you want to join, and then run `Reduce(merge, my_list_of_dataframes)`. This will be equivalent to `result = merge(my_list_of_dataframes[[1]], my_list_of_dataframes[[2]]); result = merge(result, my_list_of_dataframes[[3]]); result = merge(result, my_list_of_dataframes[[4]])...`. — Gregor Thomas, May 02 '18 at 14:00
Maybe you need something like `my_list_of_dataframes = lapply(parsedList, function(x) red$series.observations(series_id = x, observation_start = "2000-01-01", observation_end = "2018-03-01")`? I'm confused by the difference between `i` in your for loop and `a` in `parsedList[a]`, what `fred` is, that you seem to be using `fred$series.observations` as a function, with no information about what packages any of these objects are in.... If you need additional help, make a minimal reproducible example following the guidelines [in this answer](https://stackoverflow.com/q/5963269/903061) — Gregor Thomas, May 02 '18 at 14:02
I did do all of that, tbh I couldn't make heads or tails of how to apply it to my situation. I think the disconnect is over parsedList which isn't where my data resides, it's merely a list of Federal Reserve dataset names. Each of these names is loaded into a FRED object called test which is a 2 dimensional list of date and value. Which is maybe where the disconnect is at. — thistleknot, May 02 '18 at 16:15
updated code to show in more detail how I imagined the merge would go, but in this case it merely rewrites dt2 with the last 2 datasets loaded — thistleknot, May 02 '18 at 16:21
It seems like you are trying to pull data and merge it with existing data all at once. I'd strongly recommend doing the operations in simple steps, one after the other: first, pull all the data into a list, then merge all the data in the list. — Gregor Thomas, May 02 '18 at 16:41
I'd also recommend using `for(a in seq_along(parsed_list))` instead of manually incrementing `a` and completely ignoring `i`. — Gregor Thomas, May 02 '18 at 16:46

score 1 · Accepted Answer · answered May 02 '18 at 16:58

Here's my recommendation. I still don't know what fred is or where it comes from, nor parsedList, so I can't do any testing. But hopefully it gets the idea across.

# first download all the data    
data_list = lapply(parsed_list, function(a)
    fred$series.observations(
        series_id = a,
        observation_start = "2000-01-01",
        observation_end = "2018-03-01"
    )
)

# define function to process the data
# we plan on re-naming the "value" column so each one is distinct
process_data = function(d, value_name) {
    d = d[, c("date", "value")]
    d$date = as.Date(d$date)
    d$value = as.numeric(d$value)
    names(d)[2] = value_name
    return(d)
}

# process the data
data_list_processed = list()
for (i in seq_along(data_list)) {
    data_list_processed[[i]] = process_data(data_list[[i]], value_name = paste0("value", i))
}

# merge the data
combined_data = Reduce(merge, data_list_processed)

Breaking up your code into steps like this makes it more modular and gives you the chance to debug each step separately. It also let's you easily use other bits of code, like the Reduce line that joins all the datasets by date.

The only reason I can think of to do all the steps at once is if the data is so big that you can't fit it all into memory.

everything seems to work right up to the reduce which produces a single row with date,value1,value2, and so on. But no actual values. print(combined_data) Empty data.table (0 rows) of 78 cols: date,value1,value2,value3,value4,value5... NULL — thistleknot, May 03 '18 at 02:37
Are there common dates between all your data? Maybe you want `Reduce(function(x, y) merge(x, y, all = TRUE), data_list_processed)`. — Gregor Thomas, May 03 '18 at 03:01
I think that did it! Thank you so much! https://github.com/thistleknot/FredAPIR/commit/b40f7608eb63f8b8ada8943b12a0dc3eb2bbfcc3#diff-82400596dd0f6ffd802eec7518e0db21 I gave you credit (this is for a school project, I intend on interpolating, and finally doing PCA over the dataset) — thistleknot, May 03 '18 at 03:07

R, Join all datasets by Date

1 Answers1