0

I am trying to reduce the size of my files by removing rows that contain no additional information. I observe that I have rows where the Bid and Ask price do not change from period to the next for the same ID. In this case I only want to keep the first observation.

Not sure how to add data from Excel file?

enter image description here

Please help write code that efficiently reduces the size of the data.frame by only keeping the first observation per id/time if the Bid/Ask price does not change. Efficiency is key since my files are >5GB.

I tried using distinct(data) but that does not work seeing as the time column does change. I want to specify that keep distinct bid/ask prices within id/time group.

Grouping by time and ID is also not an option since my dataset is far to large and this will result in too slow code.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    (1) Please don't spam tags, this has nothing to do with the RStudio IDE. (2) Please do not post (only) an image of code/data/errors: it breaks screen-readers and it cannot be copied or searched (ref: https://meta.stackoverflow.com/a/285557 and https://xkcd.com/2116/). Please include the code, console output, or data (e.g., `data.frame(...)` or the output from `dput(head(x))`) directly into a [code block]. – r2evans May 03 '23 at 16:23
  • What measure of precision to you need to determine "no change"? In how R portrays data, `99.85` could be masking some lower-digit differences. See https://stackoverflow.com/q/9508518/3358272, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f – r2evans May 03 '23 at 16:24
  • `Grouping by time and ID is also not an option`, nor is it a good idea, since you are looking for change in values across different times. You _must_ measure distinctness by ID, it is unavoidable. – r2evans May 03 '23 at 16:28

2 Answers2

0

the first step i suggest is to create a kind of unique key

df$key<- paste0(df$id, df$Date-Time, df$Bid, df$Ask, sep = '') 

This will give you a unique key for the elements you want not no repeat. Then apply the duplicated function:

df<- df[!duplicated(df$key),] 

this way is very efficent. I have used this code in a 8M rows dataframe and i takes only few seconds. The most important thing is to define correctly the key.

if format date is to complicated to process just create some dummy column with date as character, in order to create the key column.

Rodrigo
  • 53
  • 8
0

As you tagged data.table I assume you are familiar with the syntax, instead of using duplicate checks, you can simply order your data by Date-Time such that the most recent (keepers) are on top. With .SD[1] you subset the first records of a group. Then you use ID, Open Bid, Open Ask in the by argument to keep your unique records only.

dt[, .SD[1], .(ID, `Open Bid`, `Open Ask`)]
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22