How to create a dataframe from a site which appears to store each row as a list?

Question

Thanks in advance for the help. Essentially, I was testing obtaining data off websites, when I ran across this one: http://lib.stat.cmu.edu/datasets/sleep. I proceeded in the following fashion:

(A) Get a sense of the data (in R): I essentially typed the following

readLines("http://lib.stat.cmu.edu/datasets/sleep", n=100)

(B) I notice that the data I would want really starts on the 51st line, so I write this code:

sleep_table <- read.table("http://lib.stat.cmu.edu/datasets/sleep", header=FALSE, skip=50)

(C) I get the following error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
line 1 did not have 14 elements

Where I got the above approach was from another question on stack overflow (import dat file into R). However, this question deals with a .dat file and my question is with data at a particular URL. What I'd like to know is how do I get the data from line 51 down (if you used readLines) into a dataframe with no headers (I'll add those in later with a colnames(sleep_table) <- c("etc.", "etc2", "etc3"...).

P.S. - I may not have named my question properly (so if you have a recommendation for that, I am all ears). — Jonathan Charlton, May 29 '13 at 16:37

score 3 · Answer 1 · answered May 29 '13 at 16:51

Since "Lesser short-tailed shrew" and "Pig" have unequal number of separator spaces, and the other fields are not tab-separated, read.table will not help. But luckily, this seems to be fixed space. Note that the solution is not complete, because there are a few nasty lines at the end of the record, and you probably have to convert the characters to number, but that's left as an easy exercise.

# 123456789012345689012345678901234568901234567890123456890123456789012345689012345678901234568901234567890123456890
# African elephant         6654.000 5712.000  -999.0  -999.0     3.3    38.6   645.0       3       5       3
# African giant pouched rat   1.000    6.600     6.3     2.0     8.3     4.5    42.0       3       1       3
sleep_table <- read.fwf("http://lib.stat.cmu.edu/datasets/sleep", widths = c(25,rep(8,10)),
                          header=FALSE, skip=51)

An nevertheless check Gabor's answer. He is the global master of regexp, and his solution would have helped even in cases where there are column-crossings, i.e. not fixed width. — Dieter Menne, May 29 '13 at 18:07

G. Grothendieck · Accepted Answer · 2013-05-29T17:41:43.817

3

Use the fact that the good lines end in a one digit field and that every field except the first is numeric:

URL <- "http://lib.stat.cmu.edu/datasets/sleep"
L <- readLines(URL)

# lines ending in a one digit field
good.lines <- grep(" \\d$", L, value = TRUE)

# insert commas before numeric fields
lines.csv <- gsub("( [-0-9.])", ",\\1", good.lines)

# re-read
DF <- read.table(text = lines.csv, sep = ",", as.is = TRUE, strip.white = TRUE, 
         na.strings = "-999.0")

If you are interested in the headings too here is some code for that. Omit the rest if you are not interested in headings.

# get headings - of the lines starting at left edge these are the ncol(DF) lines
#  starting with the one containing "species"
headings0 <- grep("^[^ ]", L, value = TRUE)
i <- grep("species", headings0)
headings <- headings0[seq(i, length = ncol(DF))]

# The headings are a bit long so we shorten them to the first word
names(DF) <- sub(" .*$", "", headings)

This gives:

> head(DF)
                    species     body  brain slow paradoxical total maximum
1          African elephant 6654.000 5712.0   NA          NA   3.3    38.6
2 African giant pouched rat    1.000    6.6  6.3         2.0   8.3     4.5
3                Arctic Fox    3.385   44.5   NA          NA  12.5    14.0
4    Arctic ground squirrel    0.920    5.7   NA          NA  16.5      NA
5            Asian elephant 2547.000 4603.0  2.1         1.8   3.9    69.0
6                    Baboon   10.550  179.5  9.1         0.7   9.8    27.0
  gestation predation sleep overall
1       645         3     5       3
2        42         3     1       3
3        60         1     1       1
4        25         5     2       3
5       624         3     5       4
6       180         4     4       4

UPDATE: minor simplification in white space trimming

UPDATE 2: shorten headings

UPDATE 3: added na.strings = "-999.0"

edited May 29 '13 at 17:41

answered May 29 '13 at 17:03

G. Grothendieck

254,981
17
203
341

Hi Mr. Grothendieck, thank you for this answer. I am not so familiar with the meanings of "^ *| *$" in the trim spacings section. How I currently understand that is as, anything following a space or a number sign. I know that's not correct. What is the correct way to interpret that? Also, do you mind giving an explanation of what the code is doing when it's inserting commas before numeric fields? If these two questions are better posed as independent ones, I may post them as a new set of questions if you like. – Jonathan Charlton May 29 '13 at 17:16
1

@JRC, (1) Have simplified it a bit by using `strip.white=` argument to `read.table` so we no longer need the `gsub` to strip the white space. (2) We can't split fields on white space using `read.table` since the first field with the species name has some entries that have white space within them so we add commas before each numeric allowing us to re-read it as a comma separated file. – G. Grothendieck May 29 '13 at 17:20
Hello, Mr. Grothendieck. Based on Dieter's comment below in conjunction with the answer you provided, I learned that my additional question above was on a topic of R (and other languages) I didn't know much about formally, that is, regular expressions (regex). To help future R programmers, I'm posting a link that is extremely helpful to understand your solution: http://net.tutsplus.com/tutorials/javascript-ajax/you-dont-know-anything-about-regular-expressions/ . Another link is http://gskinner.com/RegExr/ (this link let's you build and learn regex's). Thank you for your solution! – Jonathan Charlton May 31 '13 at 03:25
1

@JRC, Another source of regular expression links is the gsubfn R package's home page at http://gsubfn.googlecode.com - they are at the end. – G. Grothendieck May 31 '13 at 03:40
Thank you Mr. Grothendieck. This is such an incredible site (both Stackoverflow and gsubfn.googlecode.com) - so incredibly helpful. I am very grateful. – Jonathan Charlton May 31 '13 at 04:13

How to create a dataframe from a site which appears to store each row as a list?

2 Answers2