0

I have a single data frame with information on many surgeons and their patients, for use in producing a Kaplan-Meier survival curve and conducting a Cox proportional hazard model analysis. The data includes a surgeon ID (sequential starting at 1), patient age, patient sex, status (0 = censored, 1 = event), and days between the index event (surgery) and the end event (reoperation) or censoring (patient died, moved away, etc.).

I would like to produce one new data frame for each surgeon to support my analysis, create a new variable ("SurgeonGroup") based on the surgeon's ID - the SurgeonGroup is either "You" for records with that surgeon's ID, or "Other Surgeons" for all other values - and save the new data frame sequentially (DataProvider1, DataProvider2, etc.) so each surgeon can be compared to their peers in the survival curve and hazard ratio analysis. For example, the SurgeonGroup variable will be used to compare the surgeon with their peers using the coxph function as follows:

 coxph(Surv(Days, Status) ~ PatientAge + PatientSex + SurgeonGroup, data = DataProvider1) %>%
                tbl_regression(exp = TRUE)

The following code produces a smaller sample data frame with only 5 surgeons, creates a simple function, and creates 5 different data frames for 5 different providers by calling that function 5 times. However, since my original data frame has many more surgeons, writing out the data frame assignment/function call statement for each one is clunky and has a risk of copy/paste errors.

Is there a simple way to repeat this "DataProviderX" <- MyFunction(X)" pattern for any similar dataset, producing the same number of new data frames as there are unique surgeons? I have searched for loop and apply function approaches that could be used in this case, but can't seem to make any work (iterations are not my strength in R). Any advice would be much appreciated!

Here is my replicable example:

# Load dplyr Package

    library(dplyr)


# Create Sample Data Frame
     
    Surgeon <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5)
    PatientAge <- c(69,84,94,67,92,76,74,92,76,89,96,99,94,95,84,85,99,93,89,84,74,86,77,88,81,82,89,88,88,81,83,95,81,72,80,92,83,83,96,82,98,79,84,88,91,82,89,88,78,88)
    PatientSex <- c("M","F","F","F","F","F","M","M","F","F","M","M","F","F","F","F","F","M","F","F","F","M","F","M","M","F","F","F","M","M","F","M","F","M","F","M","F","M","M","M","F","M","F","F","M","F","M","F","M","F")
    Status <- c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
    Days <- c(254,450,488,798,395,667,1836,220,3401,292,52,663,656,52,3797,1097,51,234,367,1641,1402,8,546,913,1849,2171,1474,312,2139,118,572,8,1175,2634,24,36,93,2627,312,1582,220,276,1329,135,116,933,2038,76,1018,1224)

    Data <- data.frame(Surgeon, PatientAge, PatientSex, Status, Days)
     
     
# Create Function
     
    MyFunction <- function(FunctionID) {
        FunctionData <- Data %>% mutate(SurgeonGroup = case_when(Surgeon == FunctionID ~ "You",
                                                                 TRUE ~ "Other Surgeons"))
       return(FunctionData)
     }
     
     DataProvider1 <- MyFunction(1)
     DataProvider2 <- MyFunction(2)
     DataProvider3 <- MyFunction(3)
     DataProvider4 <- MyFunction(4)
     DataProvider5 <- MyFunction(5)
Norwooder
  • 35
  • 4
  • 3
    Do not do ths! I repeat do not do this. Other options. Work on the dataframe as a grouped dataframe 2, `split` the dataframe into a list of dataframe. Do not pollute the global environment by cteating many unnecessary variables – Onyambu Mar 24 '23 at 23:14
  • 1
    Also just iterate using your function on the main dataset. Ie `coxph(...., data=Myfunction(Data,1))` – Onyambu Mar 24 '23 at 23:18
  • 1
    Also see [this answer](https://stackoverflow.com/a/24376207/17303805), including the section "Why put the data in a list?". – zephryl Mar 25 '23 at 00:01
  • Thanks, onyambu, for your responses (and zephryl for the helpful link). A couple of questions for onyambu: In response to your first comment, I can split the main data frame into a list of data frames, but that separates out results for each surgeon - while I want to compare each surgeon to the group. But maybe I am missing the intent of your response? Forgive me - I am still new to all of this! Should the second code piece read like this: `coxph(...., data=MyFunction(1))`? When I include `MyFunction(Data, 1)` I receive an error: "unused argument (1)". – Norwooder Mar 25 '23 at 14:00
  • 1
    Yes use `MyFunction(i)` since data is hardcoded. – Onyambu Mar 25 '23 at 16:44

1 Answers1

1

I am not familiar with coxph(). So in a more general way, to include the modeling step, I would do this:

unique_ids <- unique(Data$Surgeon)

results <- lapply(unique_ids, function(id) {
  # Create a data frame for a particular surgeon.
  Data$SurgeonGroup <- ifelse(Data$Surgeon == id, "You", "OtherSurgeons")

  # Run your model and save the output.
  result <- model(outcome ~ predictor, data = Data)

  # Reshape the result into a data frame. Many ways to do that, for example
  # function glance() from package "broom" (https://broom.tidymodels.org/).
  broom::glance(result)
})

# Bind all results into a single data frame.
dplyr::bind_rows(results)

If you would like to get up to speed on functional programming (e.g. the kind that makes heavy use of functions such as lapply()) check out this chapter in Hadley's book: https://adv-r.hadley.nz/fp.html

jakub
  • 4,774
  • 4
  • 29
  • 46
  • Thanks, jakub. That does produce the five data frames per my original question - and adding the coxph() call inside the lapply function saves me an extra step. Now I just need to figure out how to save the regression output (surgeon hazard ratio, confidence intervals, and p-value) to a single data frame. The comment from zephryl above may help with that. All of this said, I am afraid that I may have framed this question wrong, and should have asked the more basic question of "What is the best way to carry out a survival analysis for each surgeon in this dataset?" – Norwooder Mar 25 '23 at 15:27
  • Hm, maybe. I think you are really asking three separate questions: 1) How to produce a list of data frames from a single data frame, 2) how to run some code on each element of that list, and 3) how to bind the results together into a data frame. – jakub Mar 25 '23 at 15:54
  • Thank you! Your revised solution gets at the core of what my original question should have been. Reshaping the results is a bit tricky, but with the help of another post I was able to pull the hazard ratio and related confidence intervals and p-value. I have a ton of learning to do, but will get a lot of mileage from this solution in both this and another analysis I am working on. Greatly appreciate it! – Norwooder Mar 25 '23 at 22:00