0

I have a large list of tables each including multiple variables. As a simple example, to understand the structure of my data, please see below:

dat1 <- structure(list(NR = c("DBD-0006", "DBD-0057", 
                      "DBD-0095", "GHP-0169", "GHP-0237", "NNB-0243", "NNB-0303", 
                      "NNB-0306", "NNB-0359", "NNB-0364"), DATE = c("13-07-2011", 
                                                                    "15-12-2010", "09-03-2011", "14-09-2011", "30-06-2010", "16-05-2016", 
                                                                    "04-07-2012", "11-07-2012", "05-12-2012", "12-12-2012"), CODE= c("1", 
                                                                                                                                     "1", "1", "1", "1", "1", "1", "1", "1", "1"), DATE2 = c("18-07-2011", 
                                                                                                                                                                                             "15-12-2010", "09-03-2011", "14-09-2011", "05-01-2012", "11-05-2016", 
                                                                                                                                                                                             "05-07-2012", "11-07-2012", "06-12-2012", "12-12-2012"), type = c("YY.90.01", 
                                                                                                                                                                                                                                                               "50.19", "50.37", "50.37", "50.37", "YY.90.00", 
                                                                                                                                                                                                                                                               "50.37", "YY.50.01", "YY.82.01", "YY.50.02"), center = c("DBD", 
                                                                                                                                                                                                                                                                                                                        "DBD", "DBD", "GHP", "GHP", "NNB", "NNB", "NNB", "NNB", 
                                                                                                                                                                                                                                                                                                                        "NNB")), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
                                                                                                                                                                                                                                                                                                                                                                    "data.frame"))

dat2 <- dat3 <- dat1

tables <- list(df1 = dat1,
             df2 = dat2,
             df3 = dat3)

In my data there are several non-ASCII characters and I need to identify where in each dataset they appear. I have written a for loop which yields the columns that include these characters. However, it take hours to run the loop! I need to accelerate it and for that I think a fucntion from the apply family can speed up. Below you can see the code:

nonUTF8<- list()
for (table in names(tables)) {               ##this gives the table names
  for (item in names(tables[[table]])){      #this gives the variables within each table
    nonUTF8[[table]] <- tables[[table]] [,(grepl("[^\x01-\x7F]", tables[[table]]))]  
  }}

I would appreciate your advice.it

Rara
  • 105
  • 9
  • 2
    The base difference between (e.g.) `lapply` and a well-written `for` loop is negligible, if it is slow-running then the `lapply` direct-translation will also run slowly. You say you need to identify the columns (within each table) that includes the non-ascii chars, do you intend to fix them, or do you need to produce a new list of frames with just these columns so that you can use them differently in follow-on processing? – r2evans May 12 '23 at 13:18
  • But to be a little more compact (at very little improvement otherwise), you might use something like `lapply(tables, function(tb) tb[sapply(tb, function(z) any(grepl("[X-Z]", z), na.rm = TRUE))])` (I changed the pattern since your data is all ascii). – r2evans May 12 '23 at 13:21
  • 2
    I second @r2evans's first comment, but an exception is if you run the *apply in parallel, across multiple cores, the time to complete will like come down by approximately a factor the number of cores. – stevec May 12 '23 at 13:24
  • @r2evans if I simplify your code to `lapply(tables, function(tb) tb[grepl("[X-Z]", tb)])`, it still works — which really surprised me. This is also essentially what OP’s code does, since they never use `item` in their inner loop. Any idea why this works and if it’s a safe/reliable approach? – zephryl May 12 '23 at 13:31
  • @r2evans Thanks for the solution. I just need to extract the columns including them so that somebody else can decide what to do with them. – Rara May 12 '23 at 13:42
  • 1
    @zephryl, if you look under the hood, `grepl("[X-Z]", dat1)` is using `as.character(dat1)`, convert it into a vector `ncol(dat1)` long, where the _string_ is literally `c("DBD-0006", "DBD-0057", "DBD-0095", ...)` (the string contains the R code `c(...)`, it is not a vector itself). So while it is likely "safe" since our pattern here does not include any of those characters, it's not safe in the general sense. – r2evans May 12 '23 at 13:43
  • @r2evans Could you kindly explain what "[X-Z]" means? – Rara May 12 '23 at 13:54
  • @Rara, it is a regex character class/range, something you're already using in your question; since none of your data contains non-ascii characters, I thought it more interesting to demonstrate using a filter that actually does something; if you don't understand how the regex is working, see https://stackoverflow.com/a/22944075/3358272 – r2evans May 12 '23 at 14:05
  • @Rara, you would replace it with your `"[^\x01-\x7F]"` – r2evans May 12 '23 at 14:46
  • @r2evans Thanks for the explanations. I have one more question as a next step. Could you also help me with that please? Now I would like to loop through the data to filter each row that contains those special characters. I have written this code but do not know how to refer to the column name in each table: lapply(tables, function(x) sapply(x, function(y) filter(y, grepl("[X-Z]" , column name))). Could you please help? My final aim is to have lists of columns per table including only the rows with special characters. – Rara May 15 '23 at 10:13
  • @r2evans This code gives me the desired out put for a column in table FAB, for instance. col_xxx <- filter(FAB, grepl("[X-Z]" , col_xxx)) – Rara May 15 '23 at 10:21
  • 1
    `lapply(tables, function(tb) tb %>% select(where(~ any(grepl("[GX-Y]", .)))) %>% filter(if_any(everything(), ~ grepl("[GX-Y]", .))))` – r2evans May 15 '23 at 12:47

1 Answers1

0

one approach (test for non-ASCII characters taken from here)

  • create helper function which returns the names of a dataframe's columns which contain one or more values with non-ASCII characters:
which_columns <- \(df) {
  column_indices <- which(lapply(df, \(column) any(grepl("[^ -~]", column))) == TRUE)
  names(df)[column_indices]
}
  • apply the helper function with lapply (or Map etc.):
tables |> lapply(FUN = which_columns)
I_O
  • 4,983
  • 2
  • 2
  • 15