How to create a for loop that allows me to check the same values between two datasets in an interval of datas

Question

I have two DFs. The first one contains ALL the match played with a unique ID associated. The second one contains only the match played by "BC" when is at home, the goals scored by the away team and the name of the away team.

df <- data.frame(ID=10:22,
                 casa=c("Juve","Juve","Hoff","Arsenal",
                     "Juve","BC","Ajax","BC",
                     "BC","Hoff","Chelsea","Juve","Juve"),
                 golfatti=c(1,1,3,4,5,2,3,4,2,1,0,1,2),
                 away=c("Inter","Inter","Inter","Milan","Milan",
                          "Inter","Genoa","Milan","Inter", "Spezia","Samp",
                          "Newcastle","Milan"))
df2 <- df[df$casa=="BC", c("ID","golfatti","away")]

What I would like to create is a for loop that takes the name of the away team of the second df and his ID match(i.e. "Inter" and 15 as ID), so for every "i" in df2$away, find the same row in the first DF and calculate the sum of the "golfatti" in the two rows before when appear "Inter".

So in this case for i=Inter and ID=15, my loop has to go in DF, find the row with df$away=df2$away=="Inter" and df$ID==df2$ID=="15" and search the two times before Inter appears and sum the goal scored. So it will go behind in the df and the first time Inter appears beofre ID=15, is in ID=12 and sum to an empty vector the three goal scored. Then, it search again Inter a find that played in ID=11 and sum the goal scored in the vector created before. Now the loop has to stop and has to go to the second "i" in df2$away, which is milan and do basically the same.

What can I do? If you need more infos about what i need, don't mind and ask. I really need help. Basically my problem is about the match of df2$away in the df and then tell the loop to select from ID to the two IDs before when the team appear.

AndrewGB · Answer 1 · 2022-02-27T08:31:14.863

0

Updated

To get the mean (or sum) of previous matches, then we need to take the lag of the output derived from rollapply. Additionally, if we want to return a named vector, then we can use set_names prior to returning the vector of the mean values. You can easily run other scenarios with this approach (e.g., the mean of the previous 50 matches). You would just change the 2 in rollapplyr to 50.

library(tidyverse)
library(zoo)
  
df %>%
  group_by(away) %>%
  mutate(mean = set_names(lag(rollapplyr(golfatti, 2, mean, partial = TRUE)), nm = away)) %>%
  pull(mean)

Output

Inter     Inter     Inter     Milan     Milan     Inter     Genoa 
   NA       1.0       1.0        NA       4.0       2.0        NA 
Milan     Inter    Spezia      Samp Newcastle     Milan 
  4.5       2.5        NA        NA        NA       4.5

Original

With tidyverse, we can use lag to get the previous values for each group. First, I group by the away column, then we can get the sum of golfatti for the two previous scores. Then, we can keep only the IDs that match in df2, then use pull to return of vector of the sums.

library(tidyverse)

df %>%
  group_by(away) %>%
  mutate(result = lag(golfatti) + lag(golfatti, 2)) %>%
  ungroup %>%
  filter(ID %in% df2$ID) %>% 
  pull(result)

# [1] 4 9 5

If you want to get the mean of the scores, then we can just divide by two.

df %>%
  group_by(away) %>%
  mutate(result = (lag(golfatti) + lag(golfatti, 2))/2) %>%
  ungroup %>%
  filter(ID %in% df2$ID) %>% 
  pull(result)

# [1] 2.0 4.5 2.5

If you want to return all IDs, return mean for 1 or more previous matches, then we can caluclate the lag, then use rowMeans.

df %>%
  group_by(away) %>%
  mutate(lag1 = lag(golfatti),
         lag2 = lag(golfatti, 2)) %>% 
  ungroup() %>% 
  mutate(results = rowMeans(select(., starts_with("lag")), na.rm = TRUE),
         results = ifelse(is.nan(results), NA, results)) %>% 
  pull(results)

# [1]  NA 1.0 1.0  NA 4.0 2.0  NA 4.5 2.5  NA  NA  NA 4.5

edited Feb 27 '22 at 08:31

answered Feb 25 '22 at 19:36

AndrewGB

16,126
5
18
49

That sounds great, but I have a couple of questions if you don't mind. What i wanted was to calculate sum of the two previous match, not taking in consideration the daily match but i solved it just removing "golfatti" in the third row.1) What if I'd like to calculate the mean and not the sum? 2) This is an example of a dataset, in the real one I have to do this for the 56 matches before that. Is there any way of doing that in a faster way than always write "lag(golfatti, i)" manually? 3) I don't want to add a new column with the sum/mean, but create a new vector. Is it possible? Thank you. – Ricter Feb 25 '22 at 20:32
@GabrielePillitteri I understand now; I just misread "empty vector the three goal scored", as I was thinking you meant the current row and the 2 before. You don't need to make a for loop to do it. I've added the results for both `sum` and `mean` and just return a vector. – AndrewGB Feb 25 '22 at 22:34
Basically i'm going to explain you another problem. If a team hasn't played two matches, it will return to me NA. So how can i fix this? I would like an algorithm that calcultes the two matches before, but if there are no two matches it calculate the mean by the amount of matches that it has found. I hope it's clear what i'm saying – Ricter Feb 25 '22 at 22:45
@GabrielePillitteri You do not need to write `lag` for every row. In the `mutate` statement, it is already going to be pulling the previous 2 matches from a given row for the away team. I have added how to get the mean for every row, which returns NA if no previous matches were played; if 1 match was previously played, then it returns that number; if 2 matches were previously played, then it returns the mean of those 2 matches. For writing a [great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), you should add the expected output – AndrewGB Feb 25 '22 at 23:18
Now it works perfectly. I tried adding an another lag(,3) and for Milan, since there are onyl two matches before, it calculates the mean of the only two matches. What I meant with writing lag for every row was, if I'd like to calculate the mean of the 50 matches before I have to write 50 times "lag(golfatti, 1) .... lag(golfatti,50)". Is there any better way? Another question, how can I save the result into a vector and assign that vector a name? P.s is really possibile to offer you a coffee? – Ricter Feb 26 '22 at 09:04
@GabrielePillitteri I can see now that you need something a little more flexible. In that case, `rollapply` is a much better way to go. I've updated my post with that approach at the top. So, you can now simply change 1 number (i.e., `2`) to the last `n` matches. We can also set the names before pulling the vector to get a named vector. These could also be set to something else (like ID + away team). P.S. Yes, you can; you can just use the buymeacoffee link in my profile. – AndrewGB Feb 27 '22 at 08:35

How to create a for loop that allows me to check the same values between two datasets in an interval of datas

1 Answers1