1

I have a dataset that contains the ZIP code, as well as 2 indicator variables denoting the race and ethnicity of the individual, as well as the count for each combination. The data can be accessed as follows:

Data1990 <- read.table("https://www2.census.gov/programs-surveys/popest/tables/1990-2000/intercensal/st-co/stch-icen1990.txt")
names(Data1990)<-c("Year","ZIP","Age","Race","Ethnic","Count")
Data1990<-Data1990[,c(2,4:6)]

The data looks like this:

   ZIP Race Ethnic Count
1 1001    1      1   239
2 1001    2      1   203
3 1001    1      1   821
4 1001    2      1   769
5 1001    1      1  1089
6 1001    2      1   961

I would like to create a new dataset which contains the proportion of white non-hispanics (Race = 1 or 2 and Ethnicity=1) in that particular ZIP code. I tried getting the sum of the white non-hispanics for each zip code by using a forloop. However, this code just gives the Variable "White" a 0 for each observation.

zip<-unique(Data1990$ZIP)
Data1990New<-data.frame(zip,White=NA)
for(i in zip){
Data1990New[which(zip==i),]$White <- sum(Data1990[(Data1990$Race==1 |Data1990$Race==2)&Data1990$Ethnic==1&
                                      Data1990$ZIP==Data1990New$i,][,4])
}

head(Data1990New)

I would like to do this in a more efficient way since forloops are slow. This is relevant but I think that my question is slightly different since I have an ID variable.

Thanks in advance!

Remy M
  • 599
  • 1
  • 4
  • 17
  • 2
    For loops are not slow (see [here](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf)). Can you provide a small snippet of the data and the algorithm used to calculate what you're after? – Roman Luštrik May 04 '19 at 20:17
  • I am guessing that you want to take the `sum` of 'Count' variable, because the data itself is summarised, I checked the frequency count for ZIP, Ethnic, Race, it is all the same – akrun May 04 '19 at 20:32
  • Yes, I want to take the sum of the 'Count' variable. Sorry if that wasn't clear. – Remy M May 04 '19 at 20:35
  • @RemyM Okay, I updated the post – akrun May 04 '19 at 20:35

2 Answers2

1

The data appears to be balanced, may be we need to filter based on the conditions, grouped by 'ZIP' and get the sum of 'Count'

library(dplyr)
Data1990 %>% 
   filter(Ethnic == 1, Race %in% 1:2) %>% 
   group_by(ZIP) %>%
   summarise(White = sum(Count))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This appears to be setting each observation as 38 – Remy M May 04 '19 at 20:25
  • 1
    @RemyM Pleases check the data, it is having a balanced frequency `Data1990 %>% count(ZIP, Ethnic, Race) %>% pull(n) %>% n_distinct# [1] 1` – akrun May 04 '19 at 20:30
1

Consider calculating the White and then aggregate for proportions:

Data1990$White <- with(Data1990, ifelse((Race==1 | Race==2) & Ethnicity==1, 1, 0))

agg <- do.call(data.frame, 
               aggregate(White ~ ZIP, Data1990, 
                         function(x) c(Total_All = length(x), Total_White = sum(x),
                                       White_Prop = sum(x) / length(x)))
       )

Note: Every zip code renders same size and white totals. Check original data.

Parfait
  • 104,375
  • 17
  • 94
  • 125