0

I searched, but I couldn't find a similar question, so I apologize if I may have missed it. My problem is actually pretty simple. I have two lists, a large one and a smaller one.

The smaller one consists of the averages of the data in the large list (ten lines have been aggregated to form the small list -> it has one tenth the size of the larger one). All I want now, is to add a new column in the large list (which is no problem) and showing the averages next to the original data. I am aware that I will see the average ten times, but that's fine.

I tried to solve this "problem" with simple list comparisons, e.g. (the relevant averages, as well as the original data have identical identifiers in the first column):

Large_List$Average_column[ Large_List$identifier == Small_List$identifier ] <- Small_List$Average[ Large_List$identifier == Small_List$identifier ];

Yet for some reason, it doesn't work. Probably because the target vector is larger than the source vector. I really tried a lot, and the only thing that seems to work is a loop structure. But that is no option because my list is way too large... I am sure there must be a smart solution to this simple issue.

UPDATE & SPECIFICATION Thank you for your suggestions. But it seems I need to be more specific. The problem is that in most, but not in all cases, the average is formed out of ten consecutive datapoints. It may occur that less is used because of holes in the sample. Therefore, a replication will unfortunately not do the job.

Here’s an example (1_Ident is the minute identifier, 10_Ident being the ten minute identifier) :

Original_List:

1_Ident | 10_Ident|Minute_value|
 July1-0| July1-0d|       1
 July1-2| July1-0d|       1
   (..)
July1-10| July1-0d|       1
July1-11| July1-1d|       1
July1-12| July1-1d|       2
July1-21| July1-21|       3
July1-31| July1-31|       2

Resulting Small_list:

10_Ident|Minute_average|
July1-0d|       1
July1-1d|      1.5
July1-2d|       3
July1-3d|       2

Desired outcome: Large_List:

1_Ident |10_Ident|Minute_value|Minute_average|
 July1-0| July1-0d|       1          1
 July1-2| July1-0d|       1          1
   (..)
July1-10| July1-0d|       1          1
July1-11| July1-1d|       1         1.5
July1-12| July1-1d|       2         1.5
July1-21| July1-21|       3          3
July1-31| July1-31|       2          2

I think the main problem is that the Small_list$Minute_average vector is not the same size as the Large_list$Minute_value vector. As said, one could compare the two lists line by line, doing a loop, but the size of the tables is >1M lines, so that won't work.

What I want to do is basically the following: 1) Look in the Large_List$10_Ident and compare it Small_List$10_Ident
2) Where the values match, transfer the corresponding Small_List$Minute_average value to Large_List$Minute_average Thanks!

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    Welcome on SO! Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Look for `?match`, `%in%` and/or `?rep`. – sgibb Aug 20 '14 at 17:09
  • So long as it's always average of 10 consecutive elements, how about creating the vector `avg <- rep(Small_List$Average,each=10)` and making that be your `Large_list$Average_column` ? – Carl Witthoft Aug 20 '14 at 18:53

1 Answers1

1

You could use match or merge to do that but why not just calculate the averages off the groupings?

Large_List$Average_column <- ave(Large_List$col_to_be_avgd, 
                                 Large_List$group_var, 
                                 FUN=mean, na.rm=TRUE)

The merge code might look like

merge( Large_List, Small_List[c('identifier', "Average"], by='identifier' , all.x=TRUE)
IRTFM
  • 258,963
  • 21
  • 364
  • 487