Reading fixed width file is always a challenge because the users need to figure out the width of each column. To complete such task, I use functions from readr
to make the process easier.
The main function to read fixed width file is read_fwf
. In addition, there is a function called fwf_empty
can help users "guess" the column width. But this function may not always correctly identify the column width. Here is an example.
# Load package
library(readr)
# Read the data
filepath <- "http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt"
# Guess based on position of empty columns
col_pos <- fwf_empty(filepath)
# Read the data
dat <- read_fwf(filepath, col_positions = col_pos)
# Check the data frame
head(dat)
# A tibble: 6 × 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <int> <int> <dbl> <dbl>
1 HARTSFIELD INTL ATLANTA 285693 288803 22665665 165668.76 93039.48
2 BALTO/WASH INTL BALTIMORE 73300 74048 4420425 18041.52 19722.93
3 LOGAN INTL BOSTON 114153 115524 9549585 127815.09 29785.72
4 DOUGLAS MUNI CHARLOTTE 120210 121798 7076954 36242.84 15399.46
5 MIDWAY CHICAGO 64465 66389 3547040 4494.78 4485.58
6 O'HARE INTL CHICAGO 322430 332338 25636383 300463.80 140359.38
The fwf_empty
does a fairly good job to identify all columns except column 2 and 3. It assumes that they are from the same column. So we need some extra work.
The output of fwf_empty
is a list of 4 elements, showing the identified begin and end position, skip and column names. We have to update the begin and end position to account for the existence of column 2 and 3.
# Extract the begin position
Begin <- col_pos$begin
# Extract the end position
End <- col_pos$end
# Update the position information
Begin <- c(Begin[1:2], 43, Begin[3:6])
End <- c(End[1], 42, End[2:6])
# Update col_pos
col_pos$begin <- Begin
col_pos$end <- End
col_pos$col_names <- paste0("X", 1:7)
Now we read the data again.
dat2 <- read_fwf(filepath, col_positions = col_pos)
head(dat2)
# A tibble: 6 × 7
X1 X2 X3 X4 X5 X6 X7
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 HARTSFIELD INTL ATLANTA 285693 288803 22665665 165668.76 93039.48
2 BALTO/WASH INTL BALTIMORE 73300 74048 4420425 18041.52 19722.93
3 LOGAN INTL BOSTON 114153 115524 9549585 127815.09 29785.72
4 DOUGLAS MUNI CHARLOTTE 120210 121798 7076954 36242.84 15399.46
5 MIDWAY CHICAGO 64465 66389 3547040 4494.78 4485.58
6 O'HARE INTL CHICAGO 322430 332338 25636383 300463.80 140359.38
This time the read_fwf
function can successfully read the file.