Remove duplicated rows in R

Question

I am trying to reduce the size of my files by removing rows that contain no additional information. I observe that I have rows where the Bid and Ask price do not change from period to the next for the same ID. In this case I only want to keep the first observation.

Not sure how to add data from Excel file?

Please help write code that efficiently reduces the size of the data.frame by only keeping the first observation per id/time if the Bid/Ask price does not change. Efficiency is key since my files are >5GB.

I tried using distinct(data) but that does not work seeing as the time column does change. I want to specify that keep distinct bid/ask prices within id/time group.

Grouping by time and ID is also not an option since my dataset is far to large and this will result in too slow code.

(1) Please don't spam tags, this has nothing to do with the RStudio IDE. (2) Please do not post (only) an image of code/data/errors: it breaks screen-readers and it cannot be copied or searched (ref: https://meta.stackoverflow.com/a/285557 and https://xkcd.com/2116/). Please include the code, console output, or data (e.g., `data.frame(...)` or the output from `dput(head(x))`) directly into a [code block]. — r2evans, May 03 '23 at 16:23
What measure of precision to you need to determine "no change"? In how R portrays data, `99.85` could be masking some lower-digit differences. See https://stackoverflow.com/q/9508518/3358272, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f — r2evans, May 03 '23 at 16:24
`Grouping by time and ID is also not an option`, nor is it a good idea, since you are looking for change in values across different times. You _must_ measure distinctness by ID, it is unavoidable. — r2evans, May 03 '23 at 16:28

score 0 · Answer 1 · answered May 03 '23 at 16:10

the first step i suggest is to create a kind of unique key

df$key<- paste0(df$id, df$Date-Time, df$Bid, df$Ask, sep = '')

This will give you a unique key for the elements you want not no repeat. Then apply the duplicated function:

df<- df[!duplicated(df$key),]

this way is very efficent. I have used this code in a 8M rows dataframe and i takes only few seconds. The most important thing is to define correctly the key.

if format date is to complicated to process just create some dummy column with date as character, in order to create the key column.

Merijn van Tilborg · Answer 2 · 2023-05-05T07:27:35.360

0

As you tagged data.table I assume you are familiar with the syntax, instead of using duplicate checks, you can simply order your data by Date-Time such that the most recent (keepers) are on top. With .SD[1] you subset the first records of a group. Then you use ID, Open Bid, Open Ask in the by argument to keep your unique records only.

dt[, .SD[1], .(ID, `Open Bid`, `Open Ask`)]

edited May 05 '23 at 07:27

answered May 04 '23 at 09:22

Merijn van Tilborg

5,452
1
7
22

2

Code-only answers aren't ideal. Can you add some context to your answer explaining what it does and why it addresses OP's problem? – A. R. May 04 '23 at 20:14
@Andrew Ray, fair point added some explanations – Merijn van Tilborg May 05 '23 at 07:22

Remove duplicated rows in R

2 Answers2