0

I am learning R and I'm trying out this data set. http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt

unfortunately, using

ap <- read.table("http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt")

does gives erroneous results. The file is a "Free format input file" as described here. (http://data.princeton.edu/R/readingData.html). Going by the examples given on that page, my simple code should work.. but it doesn't and results in broken lines and bad entries. What's wrong?

Thank you.

mahela007
  • 1,399
  • 4
  • 19
  • 29
  • 1
    Why do you believe your code should work? The file certainly isn't in the appropriate format for read.table. – Roland Apr 15 '17 at 18:17
  • 3
    This is a fixed width file. You have to use `read.fwf` and specify the widths – Pierre Lapointe Apr 15 '17 at 18:17
  • @Roland, maybe it's obvious to you why it should not work, but as a beginner, it isn't to me. My code was similar to the code given in the site I linked to and the data file for was in the same format. Therefore, I thought it should work. – mahela007 Apr 15 '17 at 18:43
  • @P Lapointe.. the data file I'm using and the one I have linked to are in the same format, are they not? – mahela007 Apr 15 '17 at 18:45
  • as suggested above by P.Lapointe, read.fwf is the function you need to use. please have a look at R documentation: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.fwf.html and this example on SO: http://stackoverflow.com/questions/14383710/read-fixed-width-text-file – tagoma Apr 15 '17 at 19:13
  • @mahela007 `read.table` will read the file but will try to make columns where spaces are. Also when you have multiple words in the airport name, R doesn't know where to make a column­. This file is fixed width. – Pierre Lapointe Apr 15 '17 at 19:21

2 Answers2

1

You have to use read.fwf and specify widths like so:

read.fwf("http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt",
 widths=c(21,21,7,7,9,10,15))

                       V1                    V2      V3     V4       V5        V6        V7
1   HARTSFIELD INTL       ATLANTA                285693 288803 22665665 165668.76  93039.48
2   BALTO/WASH INTL       BALTIMORE               73300  74048  4420425  18041.52  19722.93
3   LOGAN INTL            BOSTON                 114153 115524  9549585 127815.09  29785.72
4   DOUGLAS MUNI          CHARLOTTE              120210 121798  7076954  36242.84  15399.46
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
0

Reading fixed width file is always a challenge because the users need to figure out the width of each column. To complete such task, I use functions from readr to make the process easier.

The main function to read fixed width file is read_fwf. In addition, there is a function called fwf_empty can help users "guess" the column width. But this function may not always correctly identify the column width. Here is an example.

# Load package
library(readr)

# Read the data
filepath <- "http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt"

# Guess based on position of empty columns
col_pos <- fwf_empty(filepath)

# Read the data
dat <- read_fwf(filepath, col_positions = col_pos)

# Check the data frame
head(dat) 

# A tibble: 6 × 6
               X1                           X2     X3       X4        X5        X6
            <chr>                        <chr>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL ATLANTA               285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE              73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL BOSTON                114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE             120210 121798  7076954  36242.84  15399.46
5          MIDWAY CHICAGO                64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL CHICAGO               322430 332338 25636383 300463.80 140359.38

The fwf_empty does a fairly good job to identify all columns except column 2 and 3. It assumes that they are from the same column. So we need some extra work.

The output of fwf_empty is a list of 4 elements, showing the identified begin and end position, skip and column names. We have to update the begin and end position to account for the existence of column 2 and 3.

# Extract the begin position
Begin <- col_pos$begin

# Extract the end position
End <- col_pos$end

# Update the position information
Begin <- c(Begin[1:2], 43, Begin[3:6])
End <- c(End[1], 42, End[2:6])

# Update col_pos
col_pos$begin <- Begin
col_pos$end <- End
col_pos$col_names <- paste0("X", 1:7)

Now we read the data again.

dat2 <- read_fwf(filepath, col_positions = col_pos)
head(dat2)

# A tibble: 6 × 7
               X1        X2     X3     X4       X5        X6        X7
            <chr>     <chr>  <int>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL   ATLANTA 285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE  73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL    BOSTON 114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE 120210 121798  7076954  36242.84  15399.46
5          MIDWAY   CHICAGO  64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL   CHICAGO 322430 332338 25636383 300463.80 140359.38

This time the read_fwf function can successfully read the file.

www
  • 38,575
  • 12
  • 48
  • 84