-1

I've been trying to solve this problem for two days and I'm tearing my hair out. I have a dataset with nearly 15 million points. I have a few days of data points that are artifacts that I need to remove from the dataset. I know the syntax for deleting rows that I need to delete from my dataset: DataNoArtifacts <- Data[-(5039761:5041201), ] This code has worked for me in the past and continues to work for me. My problem is FINDING the actual values that I need to delete so that I can get the row numbers, or range of row numbers, to put in the code. When I try to filter the dataset to find the exact date and minutes I need to filter out, I can easily find the times. However, filtering the dataset to get them assigns them new row numbers. I need to be able to filter the data and see the original row numbers but cannot. So I tried to solve this by scrolling through my 15,000,000 row dataset to find the rows manually since that seems to be the only option. The problem is, if I scroll more than one click or so at a time, the dataset will jump up/down a few thousand rows, which makes it near impossible for me to find the row number for any specific day, MUCH less the exact hour and minute I need to find. If I finally get within the range of a couple of weeks from the data point I need to find and delete, the only way to ensure I'll be able to find the datapoint I need is to click one click at a time... through 2 weeks or so of data that is broken up by the minute. I have some days with data I need to delete with a huge range (example: 12/11/2003 has a few ranges of many hours at a time with equipment error that I need to delete), and some of the days with data I need to delete have only a few randomly interspersed minutes within the entire day that are problematic (example: 3/10/2014 with constant wind speeds between 0 and 3 m/s, with about 40 random blips in the data of the 1,440 minutes throughout the day that are anomalies like 20 m/s). Long story short: I know how to delete the values I need to delete. But for the life of me R will not cooperate to help me find the rows.

Unfortunately, the data points I need to delete are not exclusively days/hours/minutes above or below a certain value I can filter out. It's the times AROUND specific points that indicated artifacts that I need to delete.

MrFlick
  • 195,160
  • 17
  • 277
  • 295

1 Answers1

1

In base R, row names are preserved when subsetting:

dd <- data.frame(x = LETTERS[1:3])
print(dd[dd$x != "B", , drop  = FALSE])
  x
1 A
3 C

You are presumably working with tidyverse, which as you have pointed out works differently.

library(tidyverse)
dd <- as_tibble(dd)
filter(dd, x != "B")
# A tibble: 2 × 1
  x    
  <chr>
1 A    
2 C    

The easy solution to this, if you have enough memory to handle it (15 million integer indices is only 60 Mb), is to add your own row column:

dd |> mutate(row = seq(n()), .before = 1) |> filter(x != "B")
# A tibble: 2 × 2
    row x    
  <int> <chr>
1     1 A    
2     3 C    
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453