Reading fixed width format data into R with entries exceeding column width

Question

I need to use the Annual Building Permits by Metropolitan Area Data distributed by the US Census Bureau, which are downloadable here as fixed width format text files. Here is an excerpt of the file (I've stripped the column names as they aren't in a nice format and can be replaced after reading the file into a date frame):

999 10180 Abilene, TX                             306     298       8       0       0       0
184 10420 Akron, OH                               909     905       0       4       0       0
999 13980 Blacksburg-Christiansburg-Radford,
  VA                                              543     455       0       4      84       3
145 14010 Bloomington, IL                         342     214       4       0     124       7
160 15380 Buffalo-Cheektowaga-Niagara Falls,*
  NY                                             1964     931      14      14    1005      68
268 15500 Burlington, NC                         1353     938      12      16     387      20

As seen in the above excerpt, many of the entries in the Name column exceed the width of the column (which looks to be 36 characters). I've experimented with the various fwf reading functions of both the utils package and readr but can't find a solution that takes these entries into account. Any tips would be much appreciated.

Edit: The original file excerpt was edited by a mod for formatting and in the process the example entries where the third column width was exceeded were deleted. I've since updated the excerpt to reinclude them and have stripped the column names.

I ran @markdly 's code, which was submitted before this edit, works for all the entries that don't have this issue. I exported the result to a csv, and included an excerpt below to show what happens with these entries:

"38","999",NA,"13980",NA,"Blacksburg-Christiansburg-Radford,",NA,NA,NA,NA,NA,NA
"39","V","A",NA,NA,NA,"543",455,0,4,84,3
"40","145",NA,"14010",NA,"Bloomington, IL","342",214,4,0,124,7
"51","160",NA,"15380",NA,"Buffalo-Cheektowaga-Niagara Falls,*",NA,NA,NA,NA,NA,NA
"52","N","Y",NA,NA,NA,"1964",931,14,14,1005,68
"53","268",NA,"15500",NA,"Burlington, NC","1353",938,12,16,387,20

Edit 2: Most of the major metro areas I'm actually looking at don't fall into this problem category, so while it would be nice to have the data for the ones that do, if there is no workable solution, would there be a way to remove these entries from the data set altogether?

I've edited this slightly so it shows a more accurate representation of what you're dealing with. — thelatemail, Sep 19 '17 at 22:51

markdly · Accepted Answer · 2017-09-21T13:38:19.617

Edit:
Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.

Step 1 Read the file in and assign the split records a common record id

library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
  rownames_to_column() %>%
  mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))

df[37:40, ]
#> # A tibble: 4 x 8
#>   rowname                                                    X1    X2
#>     <chr>                                                 <chr> <int>
#> 1      37 999 13900 Bismarck, ND                            856   629
#> 2      38          999 13980 Blacksburg-Christiansburg-Radford,    NA
#> 3      39   VA                                              543   455
#> 4      40 145 14010 Bloomington, IL                         342   214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> #   record <chr>

Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.

df <- df %>%
  mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
  extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
  mutate(Name = trimws(Name)) %>%
  filter((lead(record) != record) | rowname == 1) %>%
  select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)

df[37:39, ]
#> # A tibble: 3 x 9
#>     CSA  CBSA                                 Name Total    X2    X3    X4
#>   <chr> <chr>                                <chr> <chr> <int> <int> <int>
#> 1   999 13900                         Bismarck, ND   856   629    16     6
#> 2   999 13980 Blacksburg-Christiansburg-Radford,VA   543   455     0     4
#> 3   145 14010                      Bloomington, IL   342   214     4     0
#> # ... with 2 more variables: X5 <int>, X6 <int>

Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.

Example data

library(readr)

# example data
txt <- "                                                                                        Num of
                                                                                        Struc-
                                                                                        tures
                                                                                        With
                                                                      3 and 4  5 Units  5 Units
CSA CBSA  Name                                   Total 1 Unit 2 Units   Units  or more  or more

999 10180 Abilene, TX                             306     298       8       0       0       0
184 10420 Akron, OH                               909     905       0       4       0       0" 

write_file(txt, "example.txt")

Solution

col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2", 
               "units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#>     CSA blank_1  CBSA blank_2        Name Total units_1 units_2
#>   <int>   <chr> <int>   <chr>       <chr> <int>   <int>   <int>
#> 1   999    <NA> 10180    <NA> Abilene, TX   306     298       8
#> 2   184    <NA> 10420    <NA>   Akron, OH   909     905       0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> #   num_struc_5_or_more <int>

Thanks for your reply @markdly. As noted in the edit I made to the post, the edit the mod made for formatting purposes removed the examples of problem entries. These entries exceed the 36 character width of the Name column. I ran your code and it works perfectly for all other entries, but added an excerpt of the csv output to the original post to show what it looks like for the problem ones. — user2424281, Sep 20 '17 at 21:10
@user2424281, I see what you mean - any particularly long name results in the record spanning two separate lines. You might need to parse the file line-by-line in that case using something like this answer https://stackoverflow.com/a/35761217/8475145. For a quick fix you could possibly just exclude these problem records if they aren't essential by doing a filter on NA in the Total / units_1 / unit_2 column etc... — markdly, Sep 21 '17 at 00:46
@user2424281, I think you can do this using `read_table` along with some data wrangling. See edited answer for details... — markdly, Sep 21 '17 at 13:40
Thanks so much for all your help. I appreciate your willingness to return to this problem with the updated info. I'll try your solution myself as soon as I get the chance, but in the meantime I'll go ahead and mark your solution as the answer as it looks like you got it all figured out sufficiently! — user2424281, Sep 21 '17 at 18:55
I finally got around to running your code and am experiencing an error after the second chunk. ("Error in if (drop) warningc("drop ignored") : argument is not interpretable as logical") — user2424281, Sep 25 '17 at 19:23
Also, I did notice that the record column for the problem entries will have the same record value as the entry above for the first portion while the second portion will be two higher. Ie. both Bismarck, and the first part of Blacksburg have a record value of 37 and the cut off portion of that entry (VA) has a record of 39. Not sure if this is supposed to be the case or if both of the Blacksburg entries should have a record value of 38. — user2424281, Sep 25 '17 at 19:30
@user2424281, the two lines for Blacksburg both have a `record` value of 38 on my end. The `record` value should be the same any time there is a split across multiple lines. — markdly, Sep 25 '17 at 21:58

Reading fixed width format data into R with entries exceeding column width

1 Answers1

Linked