Making a while loop more efficient for use on a large data.table to delete rows based on certain conditions

Question

I have a pretty big amount of data in a data table. I would like to delete a number of rows if there is a certain value in a cell.

Below is an excerpt from my data table:

             V1              V2       V3       V4       V5       V6       V7       V8       V9      V10
 1:                     01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
 2: AT0000A1EKT9          .Close      #NV                                                               
 3:                    Ask.Close      #NV                                                               
 4:                    BID.Close      #NV                                                               
 5:               Bid Ask Spread      #NV        0        0        0        0        0        0        0
 6:              TR.IssuerRating      ba1      ba1      ba1      ba1      ba1      ba1      ba1      ba1
 7: AT0000A17HT4          .Close    3.436    3.426    3.376    3.347    3.388    3.379    3.349    3.325
 8:                    Ask.Close   98.092   98.149    98.43   98.596   98.366   98.415   98.584   98.721
 9:                    BID.Close   97.537   97.594   97.874   98.039    97.81   97.859   98.027   98.164
10:               Bid Ask Spread    0.555    0.555    0.556    0.557    0.556    0.556    0.557    0.557
11:              TR.IssuerRating      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2

Using dput(head(x)) so the table can be easily copied

setDT(structure(list(V1 = c("", "AT0000A1EKT9", "", "", "", "", "AT0000A17HT4", 
"", "", "", ""), V2 = c("01.01.16", ".Close", "Ask.Close", "BID.Close", 
"Bid Ask Spread", "TR.IssuerRating", ".Close", "Ask.Close", "BID.Close", 
"Bid Ask Spread", "TR.IssuerRating"), V3 = c("04.01.16", "#NV", 
"#NV", "#NV", "#NV", "ba1", "3.436", "98.092", "97.537", "0.555", 
"P-2"), V4 = c("05.01.16", "", "", "", "0", "ba1", "3.426", "98.149", 
"97.594", "0.555", "P-2"), V5 = c("06.01.16", "", "", "", "0", 
"ba1", "3.376", "98.43", "97.874", "0.556", "P-2"), V6 = c("07.01.16", 
"", "", "", "0", "ba1", "3.347", "98.596", "98.039", "0.557", 
"P-2"), V7 = c("08.01.16", "", "", "", "0", "ba1", "3.388", "98.366", 
"97.81", "0.556", "P-2"), V8 = c("11.01.16", "", "", "", "0", 
"ba1", "3.379", "98.415", "97.859", "0.556", "P-2"), V9 = c("12.01.16", 
"", "", "", "0", "ba1", "3.349", "98.584", "98.027", "0.557", 
"P-2"), V10 = c("13.01.16", "", "", "", "0", "ba1", "3.325", 
"98.721", "98.164", "0.557", "P-2"), V11 = c("14.01.16", "", 
"", "", "0", "ba1", "3.3", "98.863", "98.305", "0.558", "P-2"
), V12 = c("15.01.16", "", "", "", "0", "ba1", "3.26", "99.089", 
"98.53", "0.559", "P-2"), V13 = c("18.01.16", "", "", "", "0", 
"ba1", "3.271", "99.026", "98.468", "0.558", "P-2"), V14 = c("19.01.16", 
"", "", "", "0", "ba1", "3.244", "99.177", "98.618", "0.559", 
"P-2"), V15 = c("20.01.16", "", "", "", "0", "ba1", "3.238", 
"99.211", "98.652", "0.559", "P-2"), V16 = c("21.01.16", "", 
"", "", "0", "ba1", "3.268", "99.044", "98.487", "0.557", "P-2"
)), row.names = c(NA, -11L), class = c("data.table", "data.frame"
)))

I have grouped my data by the ISIN Number in column 1. For some of these ISIN's, I do not have any price data, which can be seen by the #NV next to .Close

What I would like to do is delete the whole ISIN entry if there is a #NV next to .Close.

The data table should look as follows after deleting the rows I want:

             V1              V2       V3       V4       V5       V6       V7       V8       V9      V10
 1:                     01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
 2: AT0000A17HT4          .Close    3.436    3.426    3.376    3.347    3.388    3.379    3.349    3.325
 3:                    Ask.Close   98.092   98.149    98.43   98.596   98.366   98.415   98.584   98.721
 4:                    BID.Close   97.537   97.594   97.874   98.039    97.81   97.859   98.027   98.164
 5:               Bid Ask Spread    0.555    0.555    0.556    0.557    0.556    0.556    0.557    0.557
 6:              TR.IssuerRating      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2

I have written a while loop which works fine for small amounts of test data. However, when I apply this while loop to my full data.table the loop is very inefficient and takes a long time to run making it unusable as I have around 1 million rows.

The while loop looks as follows

i <- 1
while(i < dim(test1)[1]){
  if (test1$V2[i] == ".Close" & test1$V3[i] == "#NV"){
    a <- i + 4                    # creating upper range of rows to be deleted
    test1 <- test1[-c(i:a)]       #deleting rows and overwriting data.table
    i <- 1                        #starting loop from beginning again since data.table is smaller
  }
  else{
    i <- i+1
  }
}

Is there any way to make this loop more efficient?

Can you please provide your data in a more usable format? The empty strings make it problematic to just copy and `read.table()` it, please use paste in the output from `dput(head(x))`. Thanks! — r2evans, May 11 '22 at 17:18
I edited my question so now there is an output from dput(head(x)) in the question text — Elias K., May 12 '22 at 06:59

r2evans · Answer 1 · 2022-05-13T11:02:55.540

From a data-integrity perspective, a blank V1 is in a separate group from a non-blank V1: your data has three groups, two with 1 row each (non-blank) and one group with 9 rows (all blank V1). To properly "group by V1", we need to fix this. I recognize sometimes it can be aesthetically preferred to have the non-repeating values (blanks), so I'll keep the original there.

dat[, V1b := V1
  ][!nzchar(V1b),V1b := NA
  ][,V1b := zoo::na.locf(V1b, na.rm = FALSE)]
dat
#               V1              V2       V3       V4       V5       V6       V7       V8       V9      V10      V11      V12
#           <char>          <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>
#  1:                     01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16 14.01.16 15.01.16
#  2: AT0000A1EKT9          .Close      #NV                                                                                 
#  3:                    Ask.Close      #NV                                                                                 
#  4:                    BID.Close      #NV                                                                                 
#  5:               Bid Ask Spread      #NV        0        0        0        0        0        0        0        0        0
#  6:              TR.IssuerRating      ba1      ba1      ba1      ba1      ba1      ba1      ba1      ba1      ba1      ba1
#  7: AT0000A17HT4          .Close    3.436    3.426    3.376    3.347    3.388    3.379    3.349    3.325      3.3     3.26
#  8:                    Ask.Close   98.092   98.149    98.43   98.596   98.366   98.415   98.584   98.721   98.863   99.089
#  9:                    BID.Close   97.537   97.594   97.874   98.039    97.81   97.859   98.027   98.164   98.305    98.53
# 10:               Bid Ask Spread    0.555    0.555    0.556    0.557    0.556    0.556    0.557    0.557    0.558    0.559
# 11:              TR.IssuerRating      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2
# 5 variables not shown: [V13 <char>, V14 <char>, V15 <char>, V16 <char>, V1b <char>]

From here, I think the filtering should be easier.

dat[, .SD[!any(grepl("Close$", V2) & V3 == "#NV"),], by = V1b]
#             V1b           V1              V2       V3       V4       V5       V6       V7       V8       V9      V10      V11
#          <char>       <char>          <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>
# 1:         <NA>                     01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16 14.01.16
# 2: AT0000A17HT4 AT0000A17HT4          .Close    3.436    3.426    3.376    3.347    3.388    3.379    3.349    3.325      3.3
# 3: AT0000A17HT4                    Ask.Close   98.092   98.149    98.43   98.596   98.366   98.415   98.584   98.721   98.863
# 4: AT0000A17HT4                    BID.Close   97.537   97.594   97.874   98.039    97.81   97.859   98.027   98.164   98.305
# 5: AT0000A17HT4               Bid Ask Spread    0.555    0.555    0.556    0.557    0.556    0.556    0.557    0.557    0.558
# 6: AT0000A17HT4              TR.IssuerRating      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2
# 5 variables not shown: [V12 <char>, V13 <char>, V14 <char>, V15 <char>, V16 <char>]

I keep getting an error: `Error in [.data.table(dt1[, :=(V1b, V1)][!nzchar(V1b), :=(V1b, NA)], : Supplied 1057755 items to be assigned to 1057756 items of column 'V1b'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.` When i removed the first row containing dates the method worked — Elias K., May 13 '22 at 09:10

score 1 · Accepted Answer · answered May 12 '22 at 22:36

Wonderful example of the challenges a nicely formatted Excel spreadsheet offers to data analysis.

If I understand correctly, the OP wants to remove all rows of a section if any of the values in column V3 equals #NV. A new section starts with a non empty entry in column V1.

library(data.table)
setDT(test1)[, grp := cumsum(nzchar(V1))][
  , if (!any(V3 == "#NV")) .SD, by = grp][
    , grp := NULL][]

             V1              V2       V3       V4       V5       V6       V7       V8       V9      V10
         <char>          <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>   <char>
1:                     01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
2: AT0000A17HT4          .Close    3.436    3.426    3.376    3.347    3.388    3.379    3.349    3.325
3:                    Ask.Close   98.092   98.149    98.43   98.596   98.366   98.415   98.584   98.721
4:                    BID.Close   97.537   97.594   97.874   98.039    97.81   97.859   98.027   98.164
5:               Bid Ask Spread    0.555    0.555    0.556    0.557    0.556    0.556    0.557    0.557
6:              TR.IssuerRating      P-2      P-2      P-2      P-2      P-2      P-2      P-2      P-2
6 variable(s) not shown: [V11 <char>, V12 <char>, V13 <char>, V14 <char>, V15 <char>, V16 <char>]

Thank you for your answer. This was indeed an excel sheet that I formatted this way, would you recommend not leaving cells blank when I am planning to use R? — Elias K., May 13 '22 at 07:50

score 0 · Answer 3 · answered May 11 '22 at 16:26

0

A quick solution if V3 is the column you are filtering, uses dplyr::filter() to drop rows meeting a certain condition.

data.filtered = filter(data, !V3 == '#NV')

Also, does this question ask the same thing?

answered May 11 '22 at 16:26

Kodiakflds

603
1
4
15

The problem with this solution is that in some V3 columns, see row 6 for example, there are no #NV entries but another entry, this row would have to be deleted too. If I simply filter the columns by #NV I would not get the desired result as this row would still be displayed. – Elias K. May 11 '22 at 16:32

Making a while loop more efficient for use on a large data.table to delete rows based on certain conditions

3 Answers3