1

I have a reddit dataset where each row represents a single reddit post, along with the username info. However, given that it's reddit data, the number of posts per username varies a lot (i.e. depending on how active a given username is on reddit). I am trying to create a unique id for each username and my data are structured as follows:

dput(df[1:5,c(2,3)])

output:

structure(list(date = structure(c(15149, 15150, 15150, 15150, 
15150), class = "Date"), username = c("تتطور", "عاطله فقط", 
"قصه ألم", "بشروني بوظيفة", "الواعده"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-5L), groups = structure(list(username = c("الواعده", 
"بشروني بوظيفة", "تتطور", "عاطله فقط", 
"قصه ألم"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))

I ran the following code where I tried replicate the code here

The code works w/out errors, but I am unable to create a unique id by username. #create an ID per observation

df <- df %>% 
  group_by(username)  %>% 
 mutate(id = row_number())%>% 
 relocate(id)

Print data example with specific columns

dput(df[1:10,c(1,4)])

output:

structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L), 
    username = c("تتطور", "عاطله فقط", "قصه ألم", 
    "بشروني بوظيفة", "الواعده", "ماخليتوآ لي اسم", 
    "مرافئ ساكنه", "معتوقة", "تتطور", "تتطور"
    )), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(username = c("الواعده", 
"بشروني بوظيفة", "تتطور", "عاطله فقط", 
"قصه ألم", "ماخليتوآ لي اسم", "مرافئ ساكنه", 
"معتوقة"), .rows = structure(list(5L, 4L, c(1L, 9L, 10L
), 2L, 3L, 6L, 7L, 8L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))

In Stata, I would do this as follows:

// create an id variable per username
egen id = group(username)
maldini1990
  • 279
  • 2
  • 11
  • 1
    If you want all "ids" to be unique, why not just `df$id <- seq_len(nrow(df))`? (aka `df %>% mutate(id = row_number())` without `group_by`.) – r2evans Feb 16 '23 at 15:22
  • But if each row represents a different reddit post, this would create a unique id per post, not username, correct? However, I am interested in having a unique id at the username-level. – maldini1990 Feb 16 '23 at 15:24
  • 1
    Does this answer your question? [R - Group by variable and then assign a unique ID](https://stackoverflow.com/questions/39650511/r-group-by-variable-and-then-assign-a-unique-id) – I_O Feb 16 '23 at 15:28
  • 1
    `df %>% group_by(username) %>% mutate(id = row_number())` guarantees that for each username in this dataframe, they have all-unique ids. Are you talking about having to maintain uniqueness so that when you get new data the future rows are unique from these? That sounds like a hash across username, tweet content, and perhaps tweet-time. – r2evans Feb 16 '23 at 15:30
  • Thanks, this is essentially the code that I used in my post, and again, while the code works w/out errors, it's not really creating a unique id by username. I wonder if this may occur because some usernames are in Arabic, so R is unable to differentiate between them. See my updated data example. – maldini1990 Feb 16 '23 at 15:38

1 Answers1

1

That's an incorrect use of group_by for your purpose. If you want to get an id just like your Stata code with egen, you may want to try this:

df$id = as.integer(factor(df$username)) 

This produced the same id as Stata

egen id = group(username)

Just FYI, I also tried dplyr::consecutive_id():

 df %>% mutate(
   id_dplyr = dplyr::consecutive_id(username)
   )

but unable to reproduce Stata results with your example.

Zhiqiang Wang
  • 6,206
  • 2
  • 13
  • 27