What's wrong with this data set?

Question

I am learning R and I'm trying out this data set. http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt

unfortunately, using

ap <- read.table("http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt")

does gives erroneous results. The file is a "Free format input file" as described here. (http://data.princeton.edu/R/readingData.html). Going by the examples given on that page, my simple code should work.. but it doesn't and results in broken lines and bad entries. What's wrong?

Thank you.

Why do you believe your code should work? The file certainly isn't in the appropriate format for read.table. — Roland, Apr 15 '17 at 18:17
This is a fixed width file. You have to use `read.fwf` and specify the widths — Pierre Lapointe, Apr 15 '17 at 18:17
@Roland, maybe it's obvious to you why it should not work, but as a beginner, it isn't to me. My code was similar to the code given in the site I linked to and the data file for was in the same format. Therefore, I thought it should work. — mahela007, Apr 15 '17 at 18:43
@P Lapointe.. the data file I'm using and the one I have linked to are in the same format, are they not? — mahela007, Apr 15 '17 at 18:45
as suggested above by P.Lapointe, read.fwf is the function you need to use. please have a look at R documentation: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.fwf.html and this example on SO: http://stackoverflow.com/questions/14383710/read-fixed-width-text-file — tagoma, Apr 15 '17 at 19:13
@mahela007 `read.table` will read the file but will try to make columns where spaces are. Also when you have multiple words in the airport name, R doesn't know where to make a column. This file is fixed width. — Pierre Lapointe, Apr 15 '17 at 19:21

score 1 · Answer 1 · answered Apr 15 '17 at 19:23

You have to use read.fwf and specify widths like so:

read.fwf("http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt",
 widths=c(21,21,7,7,9,10,15))

                       V1                    V2      V3     V4       V5        V6        V7
1   HARTSFIELD INTL       ATLANTA                285693 288803 22665665 165668.76  93039.48
2   BALTO/WASH INTL       BALTIMORE               73300  74048  4420425  18041.52  19722.93
3   LOGAN INTL            BOSTON                 114153 115524  9549585 127815.09  29785.72
4   DOUGLAS MUNI          CHARLOTTE              120210 121798  7076954  36242.84  15399.46

www · Answer 2 · 2017-04-15T21:56:01.440

Reading fixed width file is always a challenge because the users need to figure out the width of each column. To complete such task, I use functions from readr to make the process easier.

The main function to read fixed width file is read_fwf. In addition, there is a function called fwf_empty can help users "guess" the column width. But this function may not always correctly identify the column width. Here is an example.

# Load package
library(readr)

# Read the data
filepath <- "http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt"

# Guess based on position of empty columns
col_pos <- fwf_empty(filepath)

# Read the data
dat <- read_fwf(filepath, col_positions = col_pos)

# Check the data frame
head(dat) 

# A tibble: 6 × 6
               X1                           X2     X3       X4        X5        X6
            <chr>                        <chr>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL ATLANTA               285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE              73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL BOSTON                114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE             120210 121798  7076954  36242.84  15399.46
5          MIDWAY CHICAGO                64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL CHICAGO               322430 332338 25636383 300463.80 140359.38

The fwf_empty does a fairly good job to identify all columns except column 2 and 3. It assumes that they are from the same column. So we need some extra work.

The output of fwf_empty is a list of 4 elements, showing the identified begin and end position, skip and column names. We have to update the begin and end position to account for the existence of column 2 and 3.

# Extract the begin position
Begin <- col_pos$begin

# Extract the end position
End <- col_pos$end

# Update the position information
Begin <- c(Begin[1:2], 43, Begin[3:6])
End <- c(End[1], 42, End[2:6])

# Update col_pos
col_pos$begin <- Begin
col_pos$end <- End
col_pos$col_names <- paste0("X", 1:7)

Now we read the data again.

dat2 <- read_fwf(filepath, col_positions = col_pos)
head(dat2)

# A tibble: 6 × 7
               X1        X2     X3     X4       X5        X6        X7
            <chr>     <chr>  <int>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL   ATLANTA 285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE  73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL    BOSTON 114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE 120210 121798  7076954  36242.84  15399.46
5          MIDWAY   CHICAGO  64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL   CHICAGO 322430 332338 25636383 300463.80 140359.38

This time the read_fwf function can successfully read the file.

What's wrong with this data set?

2 Answers2