0

The code example below is to download census data by state from the census.gov website. The problem I run into is that the leading zeros are deleted when I download as a csv file. How do I maintain the original structure of the files while downloading?

  stateFIPScodes<-seq(10,13,1)

  for(i in 1:length(stateFIPScodes)){
      URL<-paste("https://www.census.gov/popest/data/intercensal/county/files/CO-EST00INT-  ALLDATA-",stateFIPScodes[i],".csv",sep="" )
      destfile<-paste("state2000_2010_",stateFIPScodes[i],".csv" ,sep="") # CSV files drop leading zero!!
      download.file(URL, destfile)
 }

Thank you!

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Meli
  • 345
  • 5
  • 15
  • 3
    Are you viewing the data with Excel or a text editor? Excel has a tendency to remove the zeros by treating values as numbers instead of strings. If you open the file with a plain text editor and the zeros are there, it indicates that your code is fine and you'll just have to trick Excel. – JorganPubshire Oct 06 '16 at 16:44
  • 1
    @JorganPubshire Thanks. I converted the files to download as text document instead of Excel and that appears to maintain the original structure. – Meli Oct 06 '16 at 16:59

1 Answers1

1

It appears as if the files are downloaded correctly, but you are correct: if you use read.csv to load the data into R, then some of the columns are interpreted as numeric, and therefore they lose the leading zeros.

To take your code that downloads files -

stateFIPScodes<-seq(10,13,1)

for(i in seq_along(stateFIPScodes)){
  code <- stateFIPScodes[[i]]
  URL <- paste0("https://www.census.gov/popest/data/intercensal/county/files/CO-EST00INT-ALLDATA-", code, ".csv")
  destfile <- paste0("state2000_2010_",code,".csv") 
  download.file(URL, destfile)
}

If we use base read.csv we get no trailing zeros:

library(dplyr)

read.csv("state2000_2010_10.csv") %>%
  select(1:5) %>%
  head

#>   SUMLEV STATE COUNTY   STNAME     CTYNAME
#> 1     50    10      1 Delaware Kent County
#> 2     50    10      1 Delaware Kent County
#> 3     50    10      1 Delaware Kent County
#> 4     50    10      1 Delaware Kent County
#> 5     50    10      1 Delaware Kent County
#> 6     50    10      1 Delaware Kent County

That's because the first few columns are being read in as numeric.

read.csv("state2000_2010_10.csv") %>% str()
#> 'data.frame':    780 obs. of  50 variables:
#>  $ SUMLEV      : int  50 50 50 50 50 50 50 50 50 50 ...
#>  $ STATE       : int  10 10 10 10 10 10 10 10 10 10 ...
#>  $ COUNTY      : int  1 1 1 1 1 1 1 1 1 1 ...

There are two ways to resolve this:

  1. Manually pass the data types into read.csv, or just prevent all conversions by adding colClasses = "character".
  2. Use readr::read_csv which handles it correctly.

We could just prevent automatic coercion:

read.csv("state2000_2010_10.csv", colClasses = "character") %>% str()
#> 'data.frame':    780 obs. of  50 variables:
#>  $ SUMLEV      : chr  "050" "050" "050" "050" ...
#>  $ STATE       : chr  "10" "10" "10" "10" ...
#>  $ COUNTY      : chr  "001" "001" "001" "001" ...
#>  $ STNAME      : chr  "Delaware" "Delaware" "Delaware" "Delaware" ...

You would need to choose what columns you wanted to cast to as.numeric.

Or you could select the columns, e.g.

read.csv("state2000_2010_10.csv", 
         colClasses = c(
           SUMLEV = "character",
           STATE = "numeric",
           COUNTY = "character"
         )) %>%
  select(1:5) %>%
  head

#>   SUMLEV STATE COUNTY   STNAME     CTYNAME
#> 1    050    10    001 Delaware Kent County
#> 2    050    10    001 Delaware Kent County
#> 3    050    10    001 Delaware Kent County
#> 4    050    10    001 Delaware Kent County
#> 5    050    10    001 Delaware Kent County
#> 6    050    10    001 Delaware Kent County

Second, you could use readr, which has more intelligent column type inference:

#> read_csv("state2000_2010_10.csv") %>%
#>    select(1:5) %>%
#>    head
#> # A tibble: 6 × 5
#>   SUMLEV STATE COUNTY   STNAME     CTYNAME
#>    <chr> <int>  <chr>    <chr>       <chr>
#> 1    050    10    001 Delaware Kent County
#> 2    050    10    001 Delaware Kent County
#> 3    050    10    001 Delaware Kent County
#> 4    050    10    001 Delaware Kent County
#> 5    050    10    001 Delaware Kent County
#> 6    050    10    001 Delaware Kent County
Michael Griffiths
  • 1,399
  • 7
  • 14
  • Seems overly complex for an answer that could have just been ...use: `colClasses="character", stringsAsFactors=FALSE`. – IRTFM Oct 06 '16 at 18:29