2

I have a large (~4.5 million records) data frame, and several of the columns have been anonymised by hashing, and I don't have the key, but I do wish to renumber them to something more legible to aid analysis.

To this end, for example, I've deduced that 'campaignID' has 161 unique elements over the 4.5 records, and have created a vector to hold these. I've then tried writing a FOR/IF loop to search through the full dataset using the unique element vector - for each value of 'campaignID', it is checked against the unique element vector, and when it finds a match, it returns the index value of the unique element vector as the new campaign ID.

campaigns_length <- length(unique_campaign)
dataset_length <- length(dataset$campaignId)


for (i in 1:dataset_length){
for (j in 1:campaigns_length){
if (dataset$campaignId[[i]] == unique_campaign[[j]]){
  dataset$campaignId[[i]] <- j
}}}

The problem of course is that, while it works, it takes an enormously long time - I had to stop it after 12 hours! Can anything think of a better approach that's much, much quicker and computationally less expensive?

Brian O'Halloran
  • 323
  • 3
  • 18
  • Try dplyr package here is a video tutorial https://www.youtube.com/watch?v=wki0BqlztCo – Koundy Jul 11 '14 at 08:17
  • tip: you get more anwers if you provide a [reproducible example](http://adv-r.had.co.nz/Reproducibility.html) - i.e. you should include example data. – lukeA Jul 11 '14 at 08:49

3 Answers3

1

You could use match.

dataset$campaignId <- match(dataset$campaignId, unique_campaign)

See Is there an R function for finding the index of an element in a vector?

Community
  • 1
  • 1
James Trimble
  • 1,868
  • 13
  • 20
1

You might benefit from using the data.table package in this case:

library(data.table)

n = 10000000

unique_campaign = sample(1:10000, 169)

dataset = data.table(
    campaignId = sample(unique_campaign, n, TRUE),
    profit = round(runif(n, 100, 1000))
)
dataset[, campaignId := match(campaignId, unique_campaign)]

This example with 10 million rows will only take you a few seconds to run.

Johan
  • 810
  • 6
  • 12
0

You could avoid the inside loop with a dictionnary-like structure :

id_dict = list()
for (id in 1:unique_campaign) {
    id_dict[[ unique_campaign[[id]] ]] = id
}

for (i in 1:dataset_length) {
    dataset$campaignId[[i]] = id_dict[[ dataset$campaignId[[i]] ]]
}

As pointed in this post, list do not have O(1) access so it will not divided the time recquired by 161 but by a smaller factor depending on the repartition of ids in your list.
Also, the main reason why your code is so slow is because you are using those inefficient lists (dataset$campaignId[[i]] alone can take a lot of time if i is big). Take a look at the hash package which provides O(1) access to the elements (see also this thread on hashed structures in R)

Community
  • 1
  • 1
Math
  • 2,399
  • 2
  • 20
  • 22