9

Alright, I'm waving my white flag.

I'm trying to compute a loess regression on my dataset.

I want loess to compute a different set of points that plots as a smooth line for each group.

The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.

Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.

I just can't figure out how to make this work on a per-group basis.

Here are some examples of my failed attempts.

test2 <- test %>% 
  group_by(CpG) %>% 
  dplyr::arrange(AVGMOrder) %>% 
  do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))

> test2
# A tibble: 136 x 2
# Groups:   CpG [4]
   CpG            x
   <chr>      <dbl>
 1 cg01003813 0.781
 2 cg01003813 0.793
 3 cg01003813 0.805
 4 cg01003813 0.816
 5 cg01003813 0.829
 6 cg01003813 0.841
 7 cg01003813 0.854
 8 cg01003813 0.866
 9 cg01003813 0.878
10 cg01003813 0.893

This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.

test2 <- test %>% 
  group_by(CpG) %>% 
  dplyr::arrange(AVGMOrder) %>% 
  dplyr::do({
    predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
  })

This one simply fails with the following error.

"Error: Results 1, 2, 3, 4 must be data frames, not numeric"

Also it still isn't applied as a new column with dplyr::mutate

fems <- fems %>% 
  group_by(CpG) %>% 
  dplyr::arrange(AVGMOrder) %>% 
  dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))

This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.

I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.

data looks like this:

> head(test)
    X geneID        CpG                                        CellLine       Meth AVGMOrder neworder Group SmoothMeth
1  40     XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210         1        1     5  0.7808767
2  94     XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120         1        1     5  0.7927130
3 148     XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448         1        1     5  0.8045080
4 202     XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890         1        1     5  0.8163997
5  27     XG cg25296477  iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880         2        2     3  0.8285259
6  81     XG cg01003813  iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240         2        2     3  0.8409501

unique(test$CpG) [1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"

So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).

My actual dataset has a few thousand of those CpG's, not just the four.

https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing

OTStats
  • 1,820
  • 1
  • 13
  • 22
Alex Nesta
  • 393
  • 2
  • 13
  • Have you looked at the [Many Models](http://r4ds.had.co.nz/many-models.html) chapter of R for Data Science? It walks through a very similar exercise – camille May 03 '18 at 20:18
  • I will take a look. Thanks. – Alex Nesta May 03 '18 at 20:21
  • So you want the loess predicted values as an additional column in your dataset? I think you can just change `do(broom::tidy...)` in your first example to `do(x = broom::tidy...)`, or use `broom::augment`. WIll test when i can make some data up or if you supply some – Calum You May 03 '18 at 20:22
  • trying now. also added google sheets link to test dataframe – Alex Nesta May 03 '18 at 20:27
  • 1
    This might by helpful for you: https://stackoverflow.com/a/49616753/8583393 – markus May 03 '18 at 20:53
  • hi Markus, Yes that post did help. I was able to modify my command to this: test2 <- test %>% nest(-CpG) %>% mutate(fit = map(data, ~ loess(Meth ~ AVGMOrder, .))) %>% unnest(map2(fit, data, augment)) It returned my original dataframe with the column .fitted. That generated a gorgeous line. Thank You. Add your answer and I will choose it as best. – Alex Nesta May 03 '18 at 21:02

3 Answers3

9

This is a neat Tidyverse way to make it work:

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)

models <- fems %>%
        tidyr::nest(-CpG) %>%
        dplyr::mutate(
                # Perform loess calculation on each CpG group
                m = purrr::map(data, loess,
                               formula = Meth ~ AVGMOrder, span = .5),
                # Retrieve the fitted values from each model
                fitted = purrr::map(m, `[[`, "fitted")
        )

# Apply fitted y's as a new column
results <- models %>%
        dplyr::select(-m) %>%
        tidyr::unnest()

# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
        geom_point() +
        geom_line(aes(y = fitted))

This is what the output looks like

RDRR
  • 860
  • 13
  • 16
  • 1
    It would be great if you could explain what's going on with the nesting and mapping. I was quite surprised that the first map took `data` as the argument rather than a dot or the actual name of the dataframe. This worked for me though, so thank you! – MokeEire May 17 '19 at 19:53
  • 1
    @MokeEire by default, `nest` will assign _data_ as the name of the new column. `nest` basically makes a dataframe from each group to give a series of smaller dataframes. So by taking _data_ as its first argument, `map` will take each nested dataframe in turn and compute loess on each. – RDRR May 29 '19 at 15:27
  • How would someone apply this not only to different groups; but to different groups and different columns? I have many indicators that need detrending. So must fit many non-parametric splines. – Brad Mar 03 '20 at 08:44
  • How would you extract the x-value when, say, `Meth == 0.8` for all `CpG` groups? I know `cg13176022` will result in an `NA` but I have a similar situation where I need to find the x-value for a modeled y-value for each group. Curious where `predict` comes into play. Thanks. – TheSciGuy Aug 04 '20 at 18:20
3

You may have already figured this out -- but if not, here's some help.

Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.

So for your case:

fems <- fems %>% 
  group_by(CpG) %>% 
  arrange(CpG, AVGMOrder) %>% 
  mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
    data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))

Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.

jcmb
  • 216
  • 2
  • 3
0

Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.

# Define important variables
span_60 <- 60/365  # 60 days of a year
span_365 <- 365/365  # a whole year


# Group and order the data set
df <- as.data.frame(
  df %>% 
    group_by(profile) %>%
    arrange(profile, day) %>%
    )
)


# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
  
  # Declare that the loess column exists, but is blank
  df$loess_60 <- NA
  df$loess_365 <- NA
  
  # Identify all unique profilee IDs
  all_ids <- unique(x$profile)
  
  # Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
  # and isolate the according rows that belong to the profilee ID.
  for (i in all_ids) {
    len_entries <- length(which(x$profile == i))
    queried_rows <- result <- x[which(x$profile == i), ]
    
    # Run the loess fit and write the result to the according column
    fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
    fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
    x[which(x$profile == i), "loess_60"] <- fit_60
    x[which(x$profile == i), "loess_365"] <- fit_365
  }
  
  # Return the initial data frame
  return(x)
}


# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)