0

I am trying to perform a string operation on one of the columns in my data frame Test_df. This dataframe has close to 5mil records. The objective is to count the occurrences of a character in a string (after replacing the nulls) and i am using the str_locate to count.

Since this is a row-wise mutation, i tried using the rowwise() function with dplyr.

Test_df <- Test_df%>%
        rowwise() %>%
          mutate(col1 = replace_na(str_locate(as.character(my_string),"2")[1],999))

This line took more than 5 hours to execute which was extremely sub-optimal.

I then tried using the purrr:pmap function to speed up the process a little as per this Stack Overflow Thread but this did not help speed up the process.

Test_DF <- Test_DF%>%mutate(col1 = purrr::pmap_dbl(list(Test_DF$my_string), function(a) replace_na(str_locate(a,"2")[1],999)))

Is there a way to do replace_na and str_locate so that the execution is faster? I need to do this on a monthly basis.

Sumedha Nagpal
  • 163
  • 2
  • 15
  • 2
    If you want to count the number of characters, you can simply use `nchar()` which is already vectorized to you don't need `rowwise()`. – EmilHvitfeldt Dec 04 '19 at 20:28
  • 1
    Use `str_count` to count. You shouldn't need to use `rowwise` with it. If you need more help, please share a little bit of sample data and the desired result. Though, without seeing your input, I'm not sure why you need to use `rowwise` even with `str_locate`... – Gregor Thomas Dec 04 '19 at 20:49
  • @Shakir data table vs dplyr is irrelevant here. There aren't joins or grouped operations. OP is operating on each element of a vector---the only performance question that matters is whether or not that operation can be vectorized. – Gregor Thomas Dec 05 '19 at 02:38
  • Yes... I was over complicating things. Thank you everyone for your wise comments. :) – Sumedha Nagpal Dec 06 '19 at 17:49

0 Answers0