0

I have a dataframe like as shown below

test_df <- data.frame("SN" = c("ABC123","ABC123","ABC123","MNO098","MNO098","MNO098"), 
                      "code" = c("ABC1111","DEF222","GHI133","","MNO1123","MNO567"), 
                      "d_time" = c("2220-08-27","2220-05-27","2220-02-27","2220-11-27","2220-02-27",""))

enter image description here

I am trying to do 2 things

1) create 2 new columns (p_id,v_id) by stripping alphabets from columns SN and code and retain only 9 digits

2) create a lag column (p_vid) based on v_id for each person sorted based on his/her d_time

t_df <- test_df %>% group_by(SN)
t_df %>% arrange((d_time), .by_group = TRUE) ->> sorted_df #sorted based on d_time
transform_ids = function(DF){  # this function is to create person and visit_occurrence ids
  DF %>% 
    mutate(p_id = as.integer(str_remove_all(.$SN,"[a-z]|[A-Z]") %>%   #retaining only the numeric part
                                    str_sub(1,9))) %>%
    mutate(v_id = as.integer(str_remove_all(.$code,"[a-z]|[A-Z]") %>% 
                                              str_sub(1,9))) %>%
    group_by(p_id) %>%
    mutate(pre_vid = lag(v_id)) %>%
    ungroup
}
transform_ids(sorted_df)

But when I do this I encounter the below error

Error in View : Column p_id must be length 3 (the group size) or one, not 6 Error: Column p_id must be length 3 (the group size) or one, not 6 In addition: Warning message: In view(transform_ids(t_df)) : Show Traceback Rerun with Debug Error: Column p_id must be length 3 (the group size) or one, not 6

I expect my output to be like as shown below. Basically I am trying to link each v_id of a person to his previous visit which is p_vid

enter image description here

The Great
  • 7,215
  • 7
  • 40
  • 128

1 Answers1

1

To generate the p_id and v_id columns, just use sub:

t_df$p_id <- gsub("[A-Z]+", "", t_df$SN)
t_df$v_id <- gsub("[A-Z]+", "", t_df$code)

For the p_vid column, use lag() from the dplyr package:

t_df %>%
group_by(p_id) %>%
mutate(p_vid = lag(v_id, order_by=d_time, default=0))

The output from the above actually gives you a tibble. If you want a data frame, just use:

t_df <- as.data.frame(t_df)

Output:

  SN     code    d_time     p_id  v_id  p_vid
  <fct>  <fct>   <fct>      <chr> <chr> <chr>
1 ABC123 ABC1111 2220-08-27 123   1111  222  
2 ABC123 DEF222  2220-05-27 123   222   133  
3 ABC123 GHI133  2220-02-27 123   133   0    
4 MNO098 ""      2220-11-27 098   ""    1123 
5 MNO098 MNO1123 2220-02-27 098   1123  567  
6 MNO098 MNO567  ""         098   567   0    
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Don't we have to ungroup it to get the output as a dataframe? I would like to pass the processed dataframe to another functiom – The Great Nov 01 '19 at 05:01
  • In addition, can you see my updated expected output? – The Great Nov 01 '19 at 05:02
  • Your code also does the same thing/produces same output. Right? – The Great Nov 01 '19 at 05:07
  • I take it back, your expected output looks correct to me now. – Tim Biegeleisen Nov 01 '19 at 05:08
  • Hi @Tim Biegeleisen - How to restrict the number of characters after strip to just 9. If you see in my code, I have used `0-9`. Similarly can you let me know how to do it in gsub? – The Great Nov 01 '19 at 05:53
  • Is this the elegant way to do `substr(gsub("[A-Z|a-z]+{0,1}", "", t_df$SN),0,9)`? – The Great Nov 01 '19 at 05:58
  • Explain what you are trying to do (I don't want to venture a guess). – Tim Biegeleisen Nov 01 '19 at 06:06
  • Let's say instead of extracting all digits (after strip), I would like to extract/retain only the first 2 digits. In this case, the output for `p_id` for first 3 rows would only be `12` and for next 3 rows, it would just be `09` – The Great Nov 01 '19 at 06:08
  • @SSMK You could actually handle this using a single call to `sub`, e.g. `sub("^[A-Z]+(\\d{2}).*$", "\\1", "ABC1111")` which outputs `"11"` – Tim Biegeleisen Nov 01 '19 at 06:14
  • May I know what does `*$` and `\\1` indicate here? In my case numbers can be between alaphabets or at the start. All I know is string will be a mix of numbers and alphabets – The Great Nov 01 '19 at 06:21
  • Then in this case you should remove all letters and then take a substring from the start of just 2 characters. Do not use my comment above in this case. `substr(gsub("[A-Z]+", "", "ABC1DEF1"), 1, 2)` – Tim Biegeleisen Nov 01 '19 at 06:22
  • Thank you. last question. Don't we have to ungroup? When do we use `ungroup`? – The Great Nov 01 '19 at 07:23
  • You don't need an ungroup operation here, because the output still contains all the originated unaggregated rows of the input data frame. – Tim Biegeleisen Nov 01 '19 at 07:29
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/201706/discussion-between-ssmk-and-tim-biegeleisen). – The Great Nov 01 '19 at 07:32
  • Hi, can help me with this? https://stackoverflow.com/questions/61200536/r-unable-to-install-r-packages-cannot-open-the-connection – The Great Apr 14 '20 at 04:15