0

This is my first post here but I'm hoping you can help me out with this as it's doing my head in!

I have a csv file containing a lot of data (~250,000 lines) and I need to remove the duplicate entries. There are only certain elements in each row that I would like to test for duplicates, but the other data needs to be shown in the end result. The columns Date, Lat and Lon need to be tested for duplicates. For example, if I start with this data:

Date    Time    Mag Lat Lon Depth   Event
01/01/2008  01:38:25    1.04    35.5152 -120.8587   4.15    71091831
01/01/2008  01:44:27    0.84    38.8215 -122.8132   2.55    51193664
01/01/2008  01:46:59    0.48    38.8298 -122.811    2.44    51193666
01/01/2008  01:44:29    0.86    38.8215 -122.8132   2.76    51276634
01/01/2008  02:02:32    0.32    38.8193 -122.7968   5.86    51193667

It would remove the fourth line as it has the same Date, Lat and Lon as the second line and hence the output would be:

Date    Time    Mag Lat Lon Depth   Event
01/01/2008  01:38:25    1.04    35.5152 -120.8587   4.15    71091831
01/01/2008  01:44:27    0.84    38.8215 -122.8132   2.55    51193664
01/01/2008  01:46:59    0.48    38.8298 -122.811    2.44    51193666
01/01/2008  02:02:32    0.32    38.8193 -122.7968   5.86    51193667

Thanks in advance!

Tom

tom982
  • 142
  • 1
  • 11

1 Answers1

0

Use duplicated. Assume your data is dat:

> dat[!duplicated(dat[, c("Lat", "Lon")]), ]
        Date     Time  Mag     Lat       Lon Depth    Event
1 01/01/2008 01:38:25 1.04 35.5152 -120.8587  4.15 71091831
2 01/01/2008 01:44:27 0.84 38.8215 -122.8132  2.55 51193664
3 01/01/2008 01:46:59 0.48 38.8298 -122.8110  2.44 51193666
5 01/01/2008 02:02:32 0.32 38.8193 -122.7968  5.86 51193667
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138