0

I have a large dataset I am reading in R I want to apply the Unique() function on it so I can work with it better, but when I try to do so, I get this prompted:

clients <- unique(clients)
Error: cannot allocate vector of size 27.9 Mb

So I am trying to apply this function part by part by doing this:

clientsmd<-data.frame()
n<-7316738  #Amount of observations in the dataset
t<-0
for(i in 1:200){
  clientsm<-clients[1+(t*round((n/200))):(t+1)*round((n/200)),]
  clientsm<-unique(clientsm)
  clientsmd<-rbind(clientsm)
  t<-(t+1) }

But I get this:

 Error in `[.default`(xj, i) : subscript too large for 32-bit R

I have been told that I could do this easier with packages such as "ff" or "bigmemory" (or any other) but I don't know how to use them for this purpose.

I'd thank any kind of orientation whether is to tell me why my code won't work or to say me how could I take advantage of this packages.

Gotey
  • 449
  • 4
  • 15
  • 41
  • If `clients` is your whole dataframe, I suppose it has a column with a unique identifier. Say this column is called `id`. It might be worthwhile trying to see if `unique(clients$id)` or preferably `duplicated(clients$id)` works. This also enables you to subset `clients` to get all duplicates, which you can then check further including other columns. – coffeinjunky Mar 14 '16 at 11:56
  • how much RAM do you have and what's the size of your `data.frame`? also it's important if you have 32 or 64 bit operating system. Your problem looks like simple memory issue, sometimes calling `gc()` function can help, or closing R and starting it again, you may try to free more RAM in your system by closing other running applications. And don't be scared to get familiar with `ff` and `ffbase` packages, you can convert you `data.frame` to `ffdf` like this `clients_ffdf <- as.ffdf(clients)` and then work with it practically like with usual `data.frame` – inscaven Mar 15 '16 at 06:09

3 Answers3

1

Is clients a data.frame of data.table? data.table can handle quite large amounts of data compared to data.frame

library(data.table)

clients<-data.table(clients)

clientsUnique<-unique(clients)

or

duplicateIndex <-duplicated(clients) 

will give rows that are duplicates.

iboboboru
  • 1,112
  • 2
  • 10
  • 21
1

increase your memory limit like below and then try executing.

 memory.limit(4000)   ## windows specific command
CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
Sowmya S. Manian
  • 3,723
  • 3
  • 18
  • 30
0

You could use distinct function from dplyr package .

function - df %>% distinct(ID)

where ID is something unique in your dataframe .

Pankaj Kaundal
  • 1,012
  • 3
  • 13
  • 25