0

I have a large csv file with millions of records and 6 columns . I want to get the unique records of one column say "Name" and the columns associated with the unique records in "Name". Say I get 50,000 unique "Name" records I want to get the other 5 columns associated with those 50,000 records. I know how to get the unique records in a column. In the code below I filter out the Name column(1st column) I want into a separate data frame and then return the unique records using unique function. But I am not sure how to get the other 5 columns for those unique records.

m <- read.csv(file="Test.csv", header=T, sep=",", 
              colClasses = c("character","NULL","NULL","NULL","NULL","NULL"))
names <- unique(m, incomparables = FALSE)
Bruno
  • 43
  • 10
  • `?merge`, something like `res <- merge(names, m)` will give you all the data associated with the unique names, but will obviously be greater than 50,000 records. Or are you after a specific record for each unique name, for example the first one, or the last one, or some other condition? – tospig Apr 15 '15 at 02:19
  • @Bruno : Other 5 columns will be unique with respect to your Name column (1st column). If you're using `unique` function on your 50000 records table, it will remove the later duplicate values and your will be less than 50K. – Ashvin Meena Apr 15 '15 at 09:10

1 Answers1

1

Yes, others will be unique w.r.t. your 1st column. If Same name has repeated and have different entries in at-least one of the other 5 columns, that row will be count as unique one.

m <- read.csv(file="Test.csv", header=T, sep=",", colClasses = c("character","NULL","NULL","NULL","NULL","NULL"))
m <- unique(m) #remove duplicates
Subset <- m[1:50000,] #subset first 50000 rows

Refer following links for better understanding:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/unique.html

Unique on a dataframe with only selected columns

Community
  • 1
  • 1
Ashvin Meena
  • 309
  • 1
  • 13