3

I wrote a function to anonymize names in a data frame given some key and it comes to a crawl once it gets to anonymizing very many names but I don't understand why.

The data frame in question is a set of 4733 tweets collected through the Twitter API where each row is a tweet with 32 columns of data. The names are to be anonymized regardless of which row they show up in, so I'd like to not limit the function to looking at only a couple of those 32 columns.

The key is a data frame containing 211121 pairs of real and fake names, both real and fake being unique in the data frame. The function slows down immensely after about 100k names are anonymized.

The function looks like the following:

pseudonymize <- function(df, key) {
  for(name in key$realNames) {
    df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
  }
}

Is there some obvious thing here that would cause the slowing? I'm not at all experienced with optimizing code for speed.

EDIT1:

Here are a few lines from the data frame to be anonymized.

"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"@jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"@abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"@tdesj3 @belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"

Here are a few lines from the key.

"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"

EDIT2:

I've simplified the DF down to only the two columns that would need anonymizing, and this made things much faster, but it still putters out after doing about 155k names.

As requested in the comments, here's the dput() output for the first three lines of the DF that's to be anonymized.

structure(list(
  utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
  texte = c("@EmilyIsPro ik lol", "@NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "@NikkiErica21 lol yes _Ã\231։")
  ),
  row.names = c(NA, 3L),
  class = "data.frame")

And here's the dput() for the first three lines of the key.

structure(list(
  realNames = c("________", "____________aho", "___________ass"),
  fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
  ),
  row.names = c(NA, 3L),
  class = "data.frame")
joshisanonymous
  • 213
  • 1
  • 8
  • Please share a small, reproducible (copy/pasteable!) sample input. – Gregor Thomas Apr 20 '21 at 16:28
  • 2
    It's hard to tell without seeing your data structures, but you're doing a lot of conversion inside the loop. `apply` converts data frames to matrices - you probably shouldn't be using it at all. `as.data.frame` converts back to data frame. Do you really need to convert your object to matrix and then back to data frame in every single iteration? If you can move those operations outside the loop---convert everything once--it will go faster. And when we see input data you may not need then conversions at all. – Gregor Thomas Apr 20 '21 at 16:32
  • 2
    Also, if you are not using regex special characters, using the `fixed = TRUE` argument will make `gsub()` much faster. And there may be vectorization options so you don't need the loop at all... – Gregor Thomas Apr 20 '21 at 16:32
  • 1
    Could you please share the data with `dput()` so all the class and structure information is included? `dput(df[1:3, ])` and `dput(key([1:3])` would be great. – Gregor Thomas Apr 20 '21 at 18:42

1 Answers1

1

Acting on the data as a vector rather than a data.frame will be much more efficient. I ran into some encoding issues so converted the text to UTF-8 using iconv; If the names contain non-ASCII characters this would need some handling.

key1 <- data.frame(
    realNames = c("________", "____________aho", "___________ass", 
        "___Yeliab", "__courtlezz", "NikkiErica21", "EmilyIsPro", "aho"),
    fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker", 
        "A_A", "B_B", "C_C", "D_D", "E_E"),
    stringsAsFactors = FALSE
)

pseudonymize1 <- function(df, key) {
    mat <- as.matrix(df)
    dims <- attr(mat, which = "dim")
    cnam <- colnames(df)
    vec <- iconv(unclass(mat), from = "latin1", to = "UTF-8")
    for (name in split(key, f = seq_len(nrow(key)))) {
        vec <- gsub(
            vec, 
            pattern = name$realNames, 
            replacement = name$fakeNames, 
            fixed = TRUE)
    }
    mat <- vec
    attr(mat, which = "dim") <- dims
    df <- as.data.frame(mat, stringsAsFactors = FALSE)
    colnames(df) <- cnam
    df
}
pseudonymize1(df1, key1)
# utilisateur                                                                       texte
# 1         A_A                                                                 @D_D ik lol
# 2         B_B @C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3         B_B                               @C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢

library(microbenchmark)    
microbenchmark(
    pseudonymize(df1, key1),
    pseudonymize1(df1, key1)
)
# Unit: microseconds
#                     expr      min        lq     mean   median        uq      max neval cld
#  pseudonymize(df1, key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371   100   b
# pseudonymize1(df1, key1)  287.683  306.1905  333.678  314.950  339.8705  497.301   100  a 

A concern I have with 155k names is that when searching as a regular expression you will find names contained in other names. This could be in the true name within the true name (e.g. Emily within EmilyIsPro), or the true name within a previously replaced fake name. You will want to test for this, and consider using a random hash instead of a name-like fake name.

CSJCampbell
  • 2,025
  • 15
  • 19