Iterating an operation on columns in a dataframe using a loop

Question

I've got a dataframe with column names that include week and year designators of the format "W1_2019" plus other text. The complete dataframe contains 52 weeks worth of 5 columns each. My goal is to take the following code, which does exactly what I want it to do for weeks 1 & 2, and put it into a loop for x=1 to 52 so I don't have to use 52 times the same half a dozen lines.

eidsr <- dget(file="test1.txt")

mode_xmt <- data.frame(District=eidsr$district) #Initializes dataframe mode_xmt with only 1 column containing District names

wtmp <- select(eidsr, contains("W1_2019"))
wtmp$mode <- "NoRep"
wtmp$mode[wtmp$W1_2019_EIDSR_Total_Malaria_cases>0] <- "Report"
wtmp$mode[wtmp$`W1_2019_EIDSR-Mobile_SMS`==1] <- "Mobile_SMS"
wtmp$mode[wtmp$`W1_2019_EIDSR-Mobile_Internet`==1] <- "Mobile_Internet"

#At this point the dataframe wtmp looks like the example below.

mode_xmt$`2019_W1` <- wtmp$mode #Appends ONLY the W1_2019 column to mode_xmt
rm(wtmp)

wtmp <- select(eidsr, contains("W2_2019"))
wtmp$mode <- "NoRep"
wtmp$mode[wtmp$W2_2019_EIDSR_Total_Malaria_cases>0] <- "Report"
wtmp$mode[wtmp$`W2_2019_EIDSR-Mobile_SMS`==1] <- "Mobile_SMS"
wtmp$mode[wtmp$`W2_2019_EIDSR-Mobile_Internet`==1] <- "Mobile_Internet"

mode_xmt$`2019_W2` <- wtmp$mode
rm(wtmp)

At the end of each operation, my working data are as follows. Dataframe wtmp looks like this:

   `W1_2019_EIDSR-Timely_~ W1_2019_EIDSR_Total_Mala~ W1_2019_EIDSR_Date_R~ `W1_2019_EIDSR-Mobile_~ `W1_2019_EIDSR-Mobi~ mode 
                     <dbl>                     <dbl> <chr>                                   <dbl>                <dbl> <chr>
 1                      NA                         0 NA                                         NA                   NA NoRep
 2                      NA                        NA NA                                         NA                   NA NoRep
 3                      NA                        51 NA                                         NA                   NA Repo~
 4                      NA                        NA NA                                         NA                   NA NoRep
 5                      NA                        64 NA                                         NA                   NA Repo~
 6                      NA                        86 NA                                         NA                   NA Repo~
 7                      NA                        92 NA                                         NA                   NA Repo~
 8                      NA                        47 NA                                         NA                   NA Repo~
 9                      NA                        46 NA                                         NA                   NA Repo~
10                      NA                        35 NA                                         NA                   NA Repo~

mode_xmt, with the new column appended, looks like this:

   District 2019_W01
1        Bo    NoRep
2        Bo    NoRep
3        Bo   Report
4        Bo    NoRep
5        Bo   Report
6        Bo   Report
7        Bo   Report
8        Bo   Report
9        Bo   Report
10       Bo   Report

And once I've done the second iteration for W2, mode_xmt looks like this:

   District 2019_W01 2019_W02
1        Bo    NoRep   Report
2        Bo    NoRep    NoRep
3        Bo   Report   Report
4        Bo    NoRep    NoRep
5        Bo   Report   Report
6        Bo   Report   Report
7        Bo   Report   Report
8        Bo   Report   Report
9        Bo   Report   Report
10       Bo   Report   Report

Lather, rinse, repeat. Times 52. And as @DS_UNI has observed, while separate columns for week and year would be nice, they would defeat the ultimate purpose which is a time-series that stretches over more than one year ... but to keep myself from going completely nuts I'd just be happy if I could iterate the 52 weeks of a single year.

As I said, the above code works. I'm just looking for a way to loop it rather than repeating it ad nauseum.

Here's the text of a dput on the truncated data (save as test1.txt in your working directory):

structure(list(`W1_2019_EIDSR-Timely_Report` = c(NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_), W1_2019_EIDSR_Total_Malaria_cases = c(0,  NA, 51, NA, 64, 86, 92, 47, 46, 35, 33, NA, NA, 77, 35, 7, 24,  27, 14, 72), W1_2019_EIDSR_Date_Received = c(NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_), `W1_2019_EIDSR-Mobile_Internet` = c(NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `W1_2019_EIDSR-Mobile_SMS` = c(NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `W2_2019_EIDSR-Timely_Report`
= c(NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), W2_2019_EIDSR_Total_Malaria_cases = c(55,  NA, 44, NA, 38, 26, 29, 40, 59, 18, 48, NA, NA, 37, 34, 51, 34,  38, 13, 56), W2_2019_EIDSR_Date_Received = c(NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,  NA_character_, NA_character_, NA_character_), `W2_2019_EIDSR-Mobile_Internet` = c(NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `W2_2019_EIDSR-Mobile_SMS` = c(NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), district = c("Bo",  "Bo", "Bo", "Bo", "Bo", "Bo", "Bo", "Bo", "Bo", "Bo", "Bo", "Bo",  "Bo", "Bo", "Bo", "Bo", "Bo", "Bo", "Bo", "Bo")), .Names = c("W1_2019_EIDSR-Timely_Report",  "W1_2019_EIDSR_Total_Malaria_cases", "W1_2019_EIDSR_Date_Received",  "W1_2019_EIDSR-Mobile_Internet", "W1_2019_EIDSR-Mobile_SMS",  "W2_2019_EIDSR-Timely_Report", "W2_2019_EIDSR_Total_Malaria_cases",  "W2_2019_EIDSR_Date_Received", "W2_2019_EIDSR-Mobile_Internet",  "W2_2019_EIDSR-Mobile_SMS", "district"), row.names = c(NA, -20L ), class = c("tbl_df", "tbl", "data.frame"))

I would recommend that you take a look on what [tidy data](https://en.wikipedia.org/wiki/Tidy_data) is and how you can reshape your data to optimise your analysis, [this question](https://stackoverflow.com/questions/55628549/double-loop-to-fill-dataframe-how-to-fix-invalid-function-in-complex-assignme/55629572#55629572) handles similar data, tbh I wouldn't recommend using a loop to solve this, — DS_UNI, Apr 16 '19 at 14:52
that being said, it's pretty challenging to help without a reproducible example, this [post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) might give you some ideas on how to provide sample data, reproducible example, and a expected output — DS_UNI, Apr 16 '19 at 14:52
I was afraid you were going to say that ... this data is so messy it's going to take me quite a while to create dummy data to reproduce. I'll try ... :( — DanM, Apr 16 '19 at 14:57
sorry about that! But you can also take a look at the question I added in the first comment, it might help to take a look at the answers there — DS_UNI, Apr 16 '19 at 15:00
I know this is massive newbie stuff, but I'm frequently crashing and burning on questions like this because I can't figure out how to create sample data that looks anything like the *imported* data I'm working with. Do you have any "create dummy data for dummies" links you can point me to? — DanM, Apr 16 '19 at 18:55
the link in the second comment didn't help? are you allowed to provide a sample of your own data? if you can just post the output of `dput(head(eidsr, 20))` , otherwise it seems that the important part for this step of processing the data are the headers, in this case you could add the output of `names(eidsr)` and the range of values these columns usually take — DS_UNI, Apr 16 '19 at 19:08
Bless you, that "dput" command may be exactly what I was looking for. Let me get a smaller version of the data to dput on, and I'll be right back. — DanM, Apr 16 '19 at 19:21
ok perfect! I'm not sure I understand your expected output, can explain a bit or add. When I'm running your code I'm getting a lot of repeated rows, is that intentional? — DS_UNI, Apr 17 '19 at 07:30
I added a starting code as an answer to give an idea of where this is headed, we can work on it until you get what you want, but like I said the expected output is still not clear enough to me — DS_UNI, Apr 17 '19 at 07:40
Those repeated rows are intentional; the "Bo" name is actually a regional name, and every data point with the different reports represents a different (but de-identified) location. — DanM, Apr 17 '19 at 11:43
can a district have more than one report per week? or do you mean it should be just one row per district and week? can you paste an expected output? — DS_UNI, Apr 17 '19 at 11:59
Oh yeah, the full data include over 1300 facilities for 14 districts. That's why the separate lines. These data aren't pretty, but they work. I think the new expected outputs I've put in the original should clarify what I'm doing. — DanM, Apr 17 '19 at 12:12
if there's a row pro facility and week I would keep the facility column and try again with the tidy data approach, as it's not about being pretty, but about being efficient and avoiding mistakes specially if you're working with a big amount of data. I did edit my answer to add the loop approach you want, but I still have a question, what would happen if you have one entry where two or more of these three conditions hold? (i.e., sms==1, internet==1, and malaria >0) — DS_UNI, Apr 17 '19 at 14:13

DS_UNI · Accepted Answer · 2019-04-17T14:02:46.000

Your data should look something like this (I would also prefer to have a column for week and a column for year). And most probably there's a way to manipulate to get what you want.

library(dplyr)
library(reshape2)

eidsr %>% 
  # values should be in a column (not in headers) 
  melt(id.var = 'district') %>% 
  # extract the new variables
  mutate(week_year = substr(variable, 1, 7),
         variable = sub(".*EIDSR[- _]", "", variable)) %>% 
  # assuming missing values don't have a specific meaning you can just remove them
  na.omit()

#     district            variable value week_year
# 21        Bo Total_Malaria_cases     0   W1_2019
# 23        Bo Total_Malaria_cases    51   W1_2019
# 25        Bo Total_Malaria_cases    64   W1_2019
# 26        Bo Total_Malaria_cases    86   W1_2019
# 27        Bo Total_Malaria_cases    92   W1_2019
# 28        Bo Total_Malaria_cases    47   W1_2019
# 29        Bo Total_Malaria_cases    46   W1_2019
# 30        Bo Total_Malaria_cases    35   W1_2019

I can see that you're loosing your patience, so if you MUST use a loop you should use one of the apply functions, and for those you need a function to repeatedly apply on a vector or a list:

wacky_fun <- function(x_chr){
  malaria_col <- paste0(x_chr, '_EIDSR_Total_Malaria_cases')
  sms_col <- paste0(x_chr, '_EIDSR-Mobile_SMS')
  internet_col <- paste0(x_chr, '_EIDSR-Mobile_Internet')

  mode_col <- rep("NoRep", nrow(eidsr))
  mode_col[eidsr[malaria_col]>0] <- "Report"
  mode_col[eidsr[sms_col]==1] <- "Mobile_SMS"
  mode_col[eidsr[internet_col]==1] <- "Mobile_Internet"

  return(mode_col)
}

We'll apply the function on all the weeks in the data

# get the unique weeks in the headers 
weeks <- names(eidsr)[grepl('W[[:digit:]]_[[:digit:]]{4}', names(eidsr))] %>% 
  substr(1, 7) %>% 
  unique()
# apply the function on all the weeks, bind them with the district, and convert to data.frame
cbind('district' = eidsr$district, sapply(weeks, wacky_fun)) %>% 
  as.data.frame()

No, that is not what I'm aiming for. Looks like I can't format comments so will re-edit in the question above. Thanks for your patience. — DanM, Apr 17 '19 at 11:56
Just adding my belated thanks @DS_UNI. Your "wacky fun" worked perfectly with a whole lot fewer lines of code than my kluge did. I've got a lot of analysis to do so further learning of the code will have to take a back seat till later, but I appreciate your help and fully intend to come back to analyze your code till I understand it! — DanM, Apr 19 '19 at 19:38

Iterating an operation on columns in a dataframe using a loop

1 Answers1