3

I have created an RMarkdown that checks for errors, which outputs print statements that specify the error and what row numbers need to be corrected (which would check the errors in df below). I have created another dataframe (df.index in the example below) to track the rows that need to be corrected for each column (that is in df). Essentially, I need to add a column that stores a list of the rows that needs to be corrected for each column in df. Then, as I do more error checks, I will need to append to the list in a given row in df.index and add new lists to other rows for the rows column in the newly created summary dataframe.

I have looked through dozens of SO entries on lists, but cannot find a good answer. Here is what I have tried, which I show with this minimal example. This code does work and it gives me the output that I want. However, it is extremely verbose and will probably be hard for others on my project team to be able to read and make sense of it.

Minimal Example

Data

library(dplyr)

# Dataframe that contains the dataset that I'm checking for errors.
df <-
  structure(
    list(
      `1.1.` = c("Andrew", "Max", "Sylvia", NA, "1",
                 NA, NA, "Jason"),
      `1.2.` = c(1, 2, 2, NA, NA, 5, 3, NA),
      `1.3.` = c(
        "cool",
        "amazing",
        "wonderful",
        "okay",
        NA,
        "sweet",
        "chocolate",
        "fine"
      )
    ),
    class = "data.frame",
    row.names = c(NA, -8L)
  )


# Dataframe that contains the column numbers and names, which will be used to create a summary of what rows need to be changed for each column.
df.index <-
  structure(list(
    number = c("1.1.", "1.2.", "1.3."),
    name = c("name",
             "number", "category")
  ),
  class = "data.frame",
  row.names = c(NA, -3L))

What I have tried

obs <- "1.1."

na.index <- which(is.na(df$`1.1.`))

summary <- df.index %>%
  dplyr::mutate(rows = ifelse(number == obs, list(na.index), NA))

# Check to see if there are any numeric values in this character column. Adding 6 just to have a duplicate for this example.
na.index2 <-
  c(which(!is.na(as.numeric(
    as.character(df$`1.1.`)
  ))), 6)

# Append new list from na.index2 to the existing list in row 1 (or 1.1.), and keep only the unique values, excluding NAs.
summary <- summary %>%
  dplyr::mutate(rows = ifelse(number == obs, list(unique(na.omit(
    unlist(append(rows, list(na.index2)))
  ))), NA))

# Column 1.2. in df.
obs <- "1.2."

na.index3 <- which(df$`1.2.` > 2)

summary <- summary %>%
  dplyr::mutate(rows = ifelse(number == obs, list(na.index3), rows))

na.index4 <- which(df$`1.2.` == 2)

summary <- summary %>%
  dplyr::mutate(rows = ifelse(number == obs, list(unique(na.omit(
    unlist(append(rows[2], list(na.index4)))
  ))), rows))

# Column 1.3. in df.
obs <- "1.3."

na.index5 <- which(df$`1.3.` == "okay")

summary <- summary %>%
  dplyr::mutate(rows = ifelse(number == obs, list(na.index5), rows))

Output (which is also the expected output)

summary

  number     name       rows
1   1.1.     name 4, 6, 7, 5
2   1.2.   number 6, 7, 2, 3
3   1.3. category          4

I get all of the correct rows in the example above, but there has to be a much simpler way to do this, and without having to create obs and having to specify the row number (e.g., rows[2]) when appending a list.

As you can see, not every column has the same error checks. So, I'm hoping to have an easy way to add a list to the rows column in summary as I go through similar checks for each category (like 1.2., 1.3., etc.), as well as being able to append additional lists (like shown here).

AndrewGB
  • 16,126
  • 5
  • 18
  • 49

2 Answers2

4

We could loop across the columns of 'df', get the index of NA and numeric only elements (str_detect) in a list, reshape to 'long' format with `pivot_longer' and join with the 'df.index' data

library(dplyr)
library(stringr)
library(tidyr)
out <-  df %>% 
    summarise(across(everything(), ~ {
       tmp <- df.index$name[match(cur_column(), df.index$number)]
       if(tmp == 'name') list(which(is.na(.)|str_detect(., 
          '^-?[0-9.]+$'))) else if(tmp == "number") 
             list(which(is.na(.)|str_detect(., '\\D'))) else 
          list(which(is.na(.)))})) %>%
    pivot_longer(everything(), names_to = 'number', values_to = 'rows') %>% 
    right_join(df.index) %>%
    select(names(df.index), everything())

-output

> out
# A tibble: 3 x 3
  number name     rows     
  <chr>  <chr>    <list>   
1 1.1.   name     <int [4]>
2 1.2.   number   <int [3]>
3 1.3.   category <int [1]>

> out$rows
[[1]]
[1] 4 5 6 7

[[2]]
[1] 4 5 8

[[3]]
[1] 5

Update

Based on the updated checks, just modify the conditions in else if or else

out <-  df %>% 
    summarise(across(everything(), ~ {
       tmp <- df.index$name[match(cur_column(), df.index$number)]
       if(tmp == 'name') list(which(is.na(.)|str_detect(., 
          '^-?[0-9.]+$'))) else if(tmp == "number") 
             list(which(. >= 2)) else 
          list(which(. %in% 'okay'))})) %>%
    pivot_longer(everything(), names_to = 'number', values_to = 'rows') %>% 
    right_join(df.index) %>%
    select(names(df.index), everything())

-output

> out$rows
[[1]]
[1] 4 5 6 7

[[2]]
[1] 2 3 6 7

[[3]]
[1] 4
akrun
  • 874,273
  • 37
  • 540
  • 662
2

I would create a function for each check that you want to run. I would then come up with a framework to apply these functions to the columns of interest. Here is just one way to go about it.

# your check for numbers
check_1 <- function(x) {
  is_number <- !is.na(as.numeric(as.character(x)))
  is_na <- is.na(x)
  
  which(is_number | is_na)
}

# some random check ("fine" is an error!)
check_2 <- function(x) {
  which(x == "fine")
}

I would then create a list of each check. Here, the element is the name of the column, and the value is the function.

all_checks <- list("1.1." = check_1,
                   "1.3." = check_2)

Now you can map over every check, applying it to the column you want.

check_results <- imap(all_checks, ~ .x(df[[.y]]))
# > check_results
# $`1.1.`
# [1] 4 5 6 7
# 
# $`1.3.`
# [1] 8

You can then use these results to create that summary table, if you would like. Or, just keep this and use it as the list.

map_dfr(check_results, ~ tibble(rows = paste(.x, collapse = ", ")), .id = "number") %>% 
  left_join(df.index, by = "number") %>% 
  relocate(number, name, rows)
# # A tibble: 2 x 3
#   number name     rows      
#   <chr>  <chr>    <chr>     
# 1 1.1.   name     4, 5, 6, 7
# 2 1.3.   category 8  

The benefit of this is that it is now easy to see all your checks, and add more / make updates. You can even add multiple checks to the same column. The resulting check_results list can have multiple elements with the same name. You would just need to do a bit more work to collapse those if you wanted to just flag all the errors for a given column rather than breaking it out by check.

# apply check_1 to more columns
all_checks <- list("1.1." = check_1,
                   "1.2." = check_1,
                   "1.3." = check_2)

check_results <- imap(all_checks, ~ .x(df[[.y]]))
# > check_results
# $`1.1.`
# [1] 4 5 6 7
# 
# $`1.2.`
# [1] 1 2 3 4 5 6 7 8
# 
# $`1.3.`
# [1] 8

Follow on Edit

This can be expanded upon if you are trying to add in more options. For example, here is how you could pass additional arguments to check functions.

# this nor takes the column as x, but some other y and z as well
check_3 <- function(x, y, z) {
  message("passed in y:\t", y)
  message("passed in z:\t", z)
  
  x_criteria <- x == "Max"
  
  which(x_criteria)
}

You can make your all_checks list as sophisticated as you need and pull the information from the relevant list element.

# now we have a list with the function and parameters
all_checks <- list("1.1." = list(fn = check_3, pars = list(z = 5, y = 10)),
                   "1.3." = list(fn = check_2))

# we use exec to execute the function with the parameter list
imap(all_checks, ~ exec(.x$fn, x = df[[.y]], !!!.x$pars))
# passed in y:  10
# passed in z:  5
# $`1.1.`
# [1] 2
#
# $`1.3.`
# [1] 8
  • This is great for my situation! The datasets I'm working with have about 100 columns and hundreds of rows of data; I also have to do about 30 different error checks. So, I think this is a better way to organize it (and so it's easier to follow for others), so that it is easier to follow. The other solution is great if there aren't that many columns and fewer error checks. Thanks! – AndrewGB Aug 10 '21 at 18:52
  • 1
    I do a lot of ETL through R and have a few things like this myself. I like having the list where I can just see it all. Another thing you can do is have separate lists for different types of checks, rules to apply to multiple columns (that you read from a data frame), etc. You can start building a framework that works for you as you do it more. Ends up being pretty powerful. –  Aug 10 '21 at 18:58
  • Great, thanks! Yeah, some of the errors that I will be checking will be a little more complex, as some of the checks will be dependent upon what is in some other columns, so will definitely keep the separate lists idea in mind when getting to those. – AndrewGB Aug 10 '21 at 19:24
  • Quick follow up: What would be the easiest way to pass additional parameters to each function you are applying? For example, check_1 has more than `x`, like `check_1 <- function(x, y, z)...`. How do you pass that in `imap`? – AndrewGB Aug 10 '21 at 19:41
  • 1
    @AndyBrown I added a revision to the end. Long story short, you can convert `all_checks` to a nested list with some structure and pull list elements off as you need them. The trick is how to do that most efficiently with `imap`. –  Aug 10 '21 at 20:05
  • Thanks! I have a follow up question about trying to do something similar when trying to apply to an entire dataframe, as I couldn't figure out to integrate it with this, and was wondering if you would take a look. But I have created a new question for that: https://stackoverflow.com/questions/68737808/how-to-pass-column-names-into-a-function-when-using-an-apply-or-map-function-in – AndrewGB Aug 11 '21 at 07:36