3

I know that in R for loops should be avoided and vectorized operations should be used instead.

I want to solve this with a for loop and then try to use the apply family, then also in Rcpp.

I load a dataset containing one column of passwords (alphanumeric).

Once loaded (a sample, for speed), I want to create new column with value (0,1) based on some conditions "contains_lower_chars", "contains_numbers" and so on.

Here what I tried to do, but it doesn't work - meaning each column I create has the same value.

library(tidyverse)
set.seed(123)
# load dataset from url, skip the first 16 rows
df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
  sample_frac(.001) %>% 
  rename(password = V1)

patterns = c("[a-z]","[A-Z]","[0-9]+")

df$has_lower <- 0 
df$has_upper <- 0
df$has_numeric <- 0

for(i in 1:nrow(df)){
    for(j in patterns){
        n <- ifelse(grepl(j, df$password[i]),1,0)
        }
    df$has_lower[i] <- n
    df$has_upper[i] <- n 
    df$has_numeric[i] <- n
}

Output I have in mind is:

password has_lower has_upper has_numeric
Bigmaccas   1         1       0
0127515559  0         0       1
dbqky73p    1         0       1
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
chopin_is_the_best
  • 1,951
  • 2
  • 23
  • 39
  • What does "doesn't work" mean exactly? Are you getting an error? Some unexpected output? When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Apr 02 '18 at 20:39
  • @MrFlick You can reproduce the whole example running the code above. It contains the link to the URL file to populate the df. I will add the output error (each of the column I create has the same value) – chopin_is_the_best Apr 02 '18 at 20:42
  • But you don't give the desired result. And using things like `sample_frac()` aren't reproducible without setting a seed (plus that does relies on `dplyr` which isn't explicitly mentioned in the code). I don't know why you would assume the columns would be different when you assign the same `n` value to each. – MrFlick Apr 02 '18 at 20:44
  • I'll remove the `[rcpp]` tag as this has nothing to do with Rcpp. – Dirk Eddelbuettel Apr 02 '18 at 23:46

3 Answers3

1

We can simplify things if we just name your pattern vector. For example

patterns = c(has_lower="[a-z]",
             has_upper="[A-Z]",
             has_numeric="[0-9]+")

for(pattern in names(patterns)) {
  df[, pattern] = as.numeric(grepl(patterns[pattern], df$password))
}

Basically we just loop through each of the names, grab the regular expression corresponding to that name, then do the matching and adding the column.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • It works like a charm. My only question is: what does `df[, i]` mean: adding an i-th column? – chopin_is_the_best Apr 02 '18 at 22:08
  • It either extracts or adds a column. `i` can be a number which will return the `i`th column, or it can be a character/string to return a column with that name. – MrFlick Apr 02 '18 at 22:28
0

First you need to update has.lower has.upper and has.numeric within the j loop otherwise your n remains the same for this 3 cases. To do so you need to be able to loop over the names of the columns has.lower has.upper and has.numeric:

names <- c("has_lower","has_upper","has_numeric")

for(i in 1:nrow(df)){
  for(j in 1:length(patterns)){
    df[i,(names[j])] <- as.numeric(grepl(j, df$password[i]))
  }
}

A quicker, nicer, more compact alternative using apply and the fact that grepl is already vectorized:

df[, c("has_lower","has_upper","has_numeric"):=lapply(patterns, function(x) grepl(x,df$password))]

Note (nothing to do with your question):

I advise you to use the fread function to read your dataset since it is quite large.

df = fread('http://datashaping.com/passwords.txt', header = F, skip = 16)%>%
  sample_frac(.001) %>% 
  rename(password = V1)
Frostic
  • 680
  • 4
  • 11
  • can you explain (in plain words) what `df[i,(names[j])]` means? I am having difficulting understanding this... regarding `fread`, I posted a question [here](https://stackoverflow.com/questions/49618214/read-a-random-sample-from-url) if you want to read about it. – chopin_is_the_best Apr 02 '18 at 21:22
  • `df[i,(names[j])]` selects the row `i` for the column named after `names[j]`. The brackets around `names[j]` tells R that it has to use the value of the variable `names[j]` to look for the corresponding column in `df`. There is no column named `names[j]` in `df` but there are columns named `has_lower`, `has_upper` etc... – Frostic Apr 02 '18 at 21:28
  • thanks! very clear. In the second method, I get the error `Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").` – chopin_is_the_best Apr 02 '18 at 22:06
  • That’s my bad I assumed ‘df’ was a data.table because I used ‘fread’ to read your data. I will update it. – Frostic Apr 02 '18 at 22:16
  • I added `setDT` and it works, even tho it gives me back "TRUE/FALSE" instead 0/1 – chopin_is_the_best Apr 02 '18 at 22:49
0

A data frame is above all a list.

So, you can simply do:

df[c("has_lower", "has_upper", "has_numeric")] <- 
  lapply(patterns, function(pattern) grepl(pattern, df$password) + 0)

Use + 0L instead of + 0 is you want integers instead of doubles (I would recommend to do nothing and to keep logicals).

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • I get the error `Error in `:=`(c("has_lower", "has_upper", "has_numeric"), lapply(patterns, : could not find function ":="`. What I am doing wrong? I have loaded `data.table` – chopin_is_the_best Apr 02 '18 at 22:03