0

In Stata the commands I use:

bysort ID : egen numberhead=total(relationship==1)
assert numberhead==1

#household with multiple number of head
list ID relationship if numberhead>=2

#for household without a head 
list ID relationship if numberhead<1

How can I achieve same in R?

structure(list(ID = c("SS/CR/BIA/ABEYOONG/1/0001/05",     "SS/CR/BIA/ABEYOONG/1/0001/03",     
"SS/CR/BIA/ABEYOONG/1/0001/04",     "SS/CR/BIA/ABEYOONG/1/0001/02", 
"SS/CR/BIA/ABEYOONG/1/0001/01",     "SS/CR/BIA/ABEYOONG/1/0002/01", 
"SS/CR/BIA/ABEYOONG/1/0002/04",     "SS/CR/BIA/ABEYOONG/1/0002/03", 
"SS/CR/BIA/ABEYOONG/1/0002/05",     "SS/CR/BIA/ABEYOONG/1/0002/02", 
"SS/CR/BIA/ABEYOONG/1/0003/01",     "SS/CR/BIA/ABEYOONG/1/0003/03", 
"SS/CR/BIA/ABEYOONG/1/0003/05",     "SS/CR/BIA/ABEYOONG/1/0003/04", 
"SS/CR/BIA/ABEYOONG/1/0003/02",     "SS/CR/BIA/ABEYOONG/1/0004/02", 
"SS/CR/BIA/ABEYOONG/1/0004/07",     "SS/CR/BIA/ABEYOONG/1/0004/06", 
"SS/CR/BIA/ABEYOONG/1/0004/05",     "SS/CR/BIA/ABEYOONG/1/0004/04", 
"SS/CR/BIA/ABEYOONG/1/0004/03",     "SS/CR/BIA/ABEYOONG/1/0004/01", 
"SS/CR/BIA/ABEYOONG/1/0005/01"),     relationship = c(3, 3, 3, 2, 
1, 1, 10, 3, 11, 2, 1, 3, 3, 3, 3, 3, 11, 3, 3, 3, 3, 1, 1)),     row.names = c(NA, 
-23L), class = c("tbl_df", "tbl", "data.frame"))
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
Aquila
  • 1
  • 2
  • I removed the SQL tags because the question has nothing to do with SQL. – Gordon Linoff Aug 23 '21 at 11:39
  • I am newbie in R programming but have using Stata and would like to learn more on R, can anyone help me understand how to get the equivalent of the stata command line above in R. Thank you – Aquila Aug 23 '21 at 11:40
  • 4
    Welcome to SO, Aquila! Questions on SO (and especially the [tag:r} tag, perhaps) really benefit from being a bit more self-contained, to the point of being reproducible. For that, it really helps to have sample data at a minimum; please see https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info for some good discussion on how to best do that. (Hint: look for the use of `dput(.)` in those suggestions.) Thanks! – r2evans Aug 23 '21 at 11:47
  • 4
    The `dplyr` library will have everything you need if the data is in a data frame. You can `arrange(ID)`. You can `filter(numberhead == 1)`. You can link both actions together with the dpylr pipe, `%>%`. However, if you'd like a comprehensive answer, your best bet is to follow @r2evans advice. – Kat Aug 23 '21 at 13:21
  • In your example data, there is no `numberhead`. Could you explain what you are trying to do and how you expected output should look like based on the data you presented in your question? – Martin Gal Aug 23 '21 at 16:59
  • 1
    @r2evans, I have provided a sample data to my question as you have advised. Thanks for the advice. – Aquila Aug 23 '21 at 17:01
  • @Kat, I would be happy if one can provide the same equivalent of stata command in the above in R , which gives me the results I wanted. I have provided a sample data. Thank you – Aquila Aug 23 '21 at 17:07
  • I am a Stata person, which I mention only because I don't want anyone to guess that I am negative about it. Wanting a translation from Stata to R is reasonable from your point of view and to anyone fluent in both Stata and R. From every other R expert's point of view it is probably better to show R data and explain directly what you want to calculate. – Nick Cox Aug 23 '21 at 18:02
  • @Martin Gal Thank you, the numhead is generated as new varaible in Stata in order is used to validate a condition. I want to find out if household has a multiple head or not. No houshold should have more than 1 head and also no household should lack one. Relationship variable has codes 1= head, 2=spouse 3= child. In stata, based on the data, no household with either multiple head or lack one. Which is what I want. I want to achieve the same result in R, how can that be achieved?. – Aquila Aug 25 '21 at 19:06

1 Answers1

1

Alright, I hope this is what you were looking for Aquila:

# libraries
library(tidyverse)

# collect data
df <-
  structure(
    list(
      ID = c(
        "SS/CR/BIA/ABEYOONG/1/0001/05",
        "SS/CR/BIA/ABEYOONG/1/0001/03",
        "SS/CR/BIA/ABEYOONG/1/0001/04",
        "SS/CR/BIA/ABEYOONG/1/0001/02",
        "SS/CR/BIA/ABEYOONG/1/0001/01",
        "SS/CR/BIA/ABEYOONG/1/0002/01",
        "SS/CR/BIA/ABEYOONG/1/0002/04",
        "SS/CR/BIA/ABEYOONG/1/0002/03",
        "SS/CR/BIA/ABEYOONG/1/0002/05",
        "SS/CR/BIA/ABEYOONG/1/0002/02",
        "SS/CR/BIA/ABEYOONG/1/0003/01",
        "SS/CR/BIA/ABEYOONG/1/0003/03",
        "SS/CR/BIA/ABEYOONG/1/0003/05",
        "SS/CR/BIA/ABEYOONG/1/0003/04",
        "SS/CR/BIA/ABEYOONG/1/0003/02",
        "SS/CR/BIA/ABEYOONG/1/0004/02",
        "SS/CR/BIA/ABEYOONG/1/0004/07",
        "SS/CR/BIA/ABEYOONG/1/0004/06",
        "SS/CR/BIA/ABEYOONG/1/0004/05",
        "SS/CR/BIA/ABEYOONG/1/0004/04",
        "SS/CR/BIA/ABEYOONG/1/0004/03",
        "SS/CR/BIA/ABEYOONG/1/0004/01",
        "SS/CR/BIA/ABEYOONG/1/0005/01"
      ),
      relationship = c(3, 3, 3, 2,
                       1, 1, 10, 3, 11, 2, 1, 3, 3, 3, 3, 3, 11, 3, 3, 3, 3, 1, 1)
    ),
    row.names = c(NA,-23L),
    class = c("tbl_df", "tbl", "data.frame")
  )

#--------- by sort ID : egen numberhead = total(relationship == 1) ----------
# using data object named df
df %>% 
  group_by(ID) %>%              # bysort
  filter(relationship == 1) %>% # to only see these fields
  summarise(numberhead = n())   # create a new variable

# # A tibble: 5 × 2
#   ID                           numberhead
#   <chr>                             <int>
# 1 SS/CR/BIA/ABEYOONG/1/0001/01          1
# 2 SS/CR/BIA/ABEYOONG/1/0002/01          1
# 3 SS/CR/BIA/ABEYOONG/1/0003/01          1
# 4 SS/CR/BIA/ABEYOONG/1/0004/01          1
# 5 SS/CR/BIA/ABEYOONG/1/0005/01          1 

# of the individual IDs, that have relationship == 1
# there is one observation of each

# I don't think there is an equivalent to assert
# you could validate that there are 5 observations for relationship == 1
# to validate this result, though

df %>% 
  filter(relationship == 1) %>% 
  nrow()                         # number of rows
# [1] 5 



#--------- List ID relationship if numberhead >= 2 ----------
# this one is simpler
df %>% 
  filter(relationship >=2) 

# # A tibble: 18 × 2
#    ID                           relationship
#    <chr>                               <dbl>
#  1 SS/CR/BIA/ABEYOONG/1/0001/05            3
#  2 SS/CR/BIA/ABEYOONG/1/0001/03            3
#  3 SS/CR/BIA/ABEYOONG/1/0001/04            3
#  4 SS/CR/BIA/ABEYOONG/1/0001/02            2
#  5 SS/CR/BIA/ABEYOONG/1/0002/04           10
#  6 SS/CR/BIA/ABEYOONG/1/0002/03            3
#  7 SS/CR/BIA/ABEYOONG/1/0002/05           11
#  8 SS/CR/BIA/ABEYOONG/1/0002/02            2
#  9 SS/CR/BIA/ABEYOONG/1/0003/03            3
# 10 SS/CR/BIA/ABEYOONG/1/0003/05            3
# 11 SS/CR/BIA/ABEYOONG/1/0003/04            3
# 12 SS/CR/BIA/ABEYOONG/1/0003/02            3
# 13 SS/CR/BIA/ABEYOONG/1/0004/02            3
# 14 SS/CR/BIA/ABEYOONG/1/0004/07           11
# 15 SS/CR/BIA/ABEYOONG/1/0004/06            3
# 16 SS/CR/BIA/ABEYOONG/1/0004/05            3
# 17 SS/CR/BIA/ABEYOONG/1/0004/04            3
# 18 SS/CR/BIA/ABEYOONG/1/0004/03            3 

# If you want to see only the unique IDs 
df %>% filter(relationship >= 2) %>% 
  select(ID) %>% 
  distinct()
# however every ID is distinct in this data, 
# so the results won't look different

#--------- List ID relationship if numberhead < 1 ----------
df %>% 
  filter(relationship < 1) 

# # A tibble: 0 × 2
# # … with 2 variables: ID <chr>, relationship <dbl> 

# no results

#--------- see it all at one time? ----------

df %>% 
  mutate(relates = cut(relationship, 
                       c(0, 1, max(relationship)))) %>% 
  group_by(relates,ID) %>% 
  summarise(n()) %>% 
  print(n = nrow(df))    # when you have a tbl_df, 
                         # you get pretty print in the console, 
                         # this call will let you see it all

# # A tibble: 23 × 3
# # Groups:   relates [2]
#    relates ID                           `n()`
#    <fct>   <chr>                        <int>
#  1 (0,1]   SS/CR/BIA/ABEYOONG/1/0001/01     1
#  2 (0,1]   SS/CR/BIA/ABEYOONG/1/0002/01     1
#  3 (0,1]   SS/CR/BIA/ABEYOONG/1/0003/01     1
#  4 (0,1]   SS/CR/BIA/ABEYOONG/1/0004/01     1
#  5 (0,1]   SS/CR/BIA/ABEYOONG/1/0005/01     1
#  6 (1,11]  SS/CR/BIA/ABEYOONG/1/0001/02     1
#  7 (1,11]  SS/CR/BIA/ABEYOONG/1/0001/03     1
#  8 (1,11]  SS/CR/BIA/ABEYOONG/1/0001/04     1
#  9 (1,11]  SS/CR/BIA/ABEYOONG/1/0001/05     1
# 10 (1,11]  SS/CR/BIA/ABEYOONG/1/0002/02     1
# 11 (1,11]  SS/CR/BIA/ABEYOONG/1/0002/03     1
# 12 (1,11]  SS/CR/BIA/ABEYOONG/1/0002/04     1
# 13 (1,11]  SS/CR/BIA/ABEYOONG/1/0002/05     1
# 14 (1,11]  SS/CR/BIA/ABEYOONG/1/0003/02     1
# 15 (1,11]  SS/CR/BIA/ABEYOONG/1/0003/03     1
# 16 (1,11]  SS/CR/BIA/ABEYOONG/1/0003/04     1
# 17 (1,11]  SS/CR/BIA/ABEYOONG/1/0003/05     1
# 18 (1,11]  SS/CR/BIA/ABEYOONG/1/0004/02     1
# 19 (1,11]  SS/CR/BIA/ABEYOONG/1/0004/03     1
# 20 (1,11]  SS/CR/BIA/ABEYOONG/1/0004/04     1
# 21 (1,11]  SS/CR/BIA/ABEYOONG/1/0004/05     1
# 22 (1,11]  SS/CR/BIA/ABEYOONG/1/0004/06     1
# 23 (1,11]  SS/CR/BIA/ABEYOONG/1/0004/07     1

Using the data you have provided and adding how to pull that data directly into R. Note that I assume ID is what is column 1 and that what you are calling relationship is the column hhsize.

For collecting the data, you can pull it directly from your personal computer drive or directly from the web.

library(openxlsx)

# from your computer
df2 <- read.xlsx("/path/in/you/computer/file.xlsx") 
# if there was more than one sheet, you would designate which sheet 

# from the web
# for dropbox, look in the path for "d1=0"
# you have to change that to "d1=1" for a direct download
df3 <- read.xlsx("https://www.dropbox.com/scl/fi/73dw92bpcjio3m1k0w5vv/Round-11th-19-08-2020.xlsx?dl=1&rlkey=2xxtyge3rppi0aikkl8nlt6oc")

If you really wanted to rename the columns you can do that this way:

names(df2)[1] <- "ID"

Is this what you are looking for?

#----- perhaps looking for this ------
df3[,c(1,17)] %>%         # only look at IDs and household size
  distinct() %>%          # ignore duplicates, when both fields match
  mutate(relates = cut(hhsize,      # add factor for ranges
                       c(0, 1, 2,
                         max(hhsize)),
                       include.lowest = T)) %>% 
  group_by(relates) %>%    # only group by household size ranges
  summarise(count = n())   # show the count per case
# # A tibble: 3 × 2
#   relates count
#   <fct>   <int>
# 1 [0,1]     505
# 2 (1,2]    1736
# 3 (2,25]  15771 
Kat
  • 15,669
  • 3
  • 18
  • 51
  • thanks for the great work, however, in stata, the assert command ties case 1(household with multiple number of head) and case 2(for household without a head) together. But from your submission in R, I see case 1 and case 2 beign executed independently. In stata, after execution, case 1 and case 2 result is 0 or none, based on the condition of assert. I expect to see same result in R output. The data is categorical with households in roaster, the code 1 in relationship variable stands for head of that household, 2= spouse, 3=child. There cannot be more than 1 head or without head. – Aquila Aug 24 '21 at 06:49
  • @Aquila, I'm not sure what you mean in terms of this data. The assert command does not create a whole new data field. (Well, unless you code it that way.) You have described your factor field of 1, 2, and 3, but that field is not in your data. The assert function is for validations. The way you called it would only validate whether the results were true or false. Perhaps you could show the results you would like as an image from Stata? – Kat Aug 24 '21 at 19:07
  • You are very right, I actually meant to say the assert is for validation, forgive my language. Please kindly see the attached result image from Stata . do "C:\Users\fdasb\AppData\Local\Temp\STD3bd0_000000.tmp" . bysort ID: egen numhead=total(RELATIONSHIP==1) . assert numhead==1 . end of do-file . do "C:\Users\fdasb\AppData\Local\Temp\STD3bd0_000000.tmp" . * Households without Head . list ID RELATIONSHIP if numhead<1 . * Households with multiple numbers of Heads . list ID MNO RELATIONSHIP if numhead>=2 . end of do-file – Aquila Aug 25 '21 at 10:07
  • Below is a sample of data where there were no head bysort HHNSRRNO: egen numhead=total(RELATIONSHIP==1) . assert numhead==1 228 contradictions in 18,064 observations assertion is false * Households without Head . list HHNSRRNO MNO RELATIONSHIP AGEY AGEM if numhead<1 HHNSRRNO MNO RELATI~P AGEY AGEM 756. | NE/GO/AKK/GARIN ALHAJI KAWU/1/0037 3 3 3 . | 4186. | NE/GO/AKK/KAYEL BAGA B'/1/0029 3 3 13 . | 4187. | NE/GO/AKK/KAYEL BAGA B'/1/0029 2 3 9 . | – Aquila Aug 25 '21 at 10:28
  • . * Households with multiple numbers of Heads . list HHNSRRNO MNO RELATIONSHIP AGEY AGEM if numhead>=2 sample of household with multiple heass +--------------------------------------------------------------------+ | HHNSRRNO MNO RELATI~P AGEY AGEM | |--------------------------------------------------------------------| 611. | NE/GO/AKK/GADAWO II/1/0029 3 4 26 . | 612. | NE/GO/AKK/GADAWO II/1/0029 1 1 62 . | – Aquila Aug 25 '21 at 10:29
  • Am wondering if you could provide me with your email so I could send a copy of the kind of data am working with on Stata. There is a specific task I use Stata to wrangle data and would like to simply replicate those same command lines in R. Task like 1. create a universal R script for working on my kind of Data 2. Check for duplicates 3. check for None & Multiple Household Head(s) 4. Split ref no. and check for consistency with variables within the data 5. compare for the spelling of names with other similar variables within the data 6. check for invalid phone numbers 7. split date – Aquila Aug 25 '21 at 11:51
  • I don't think that there is a safe way to share personal contact information on SO. However, you could put it in something like Google Documents, Box, or Dropbox and share the link. You could tell me where you collected the data (assuming it is not sensitive data). Additionally, your images won't show in the comments. You would have to edit your original post. – Kat Aug 25 '21 at 13:18