4

I'm struggling with using readLines() and read.Table() to get a well formatted data frame in R.

I want to read files like this which are Hockey stats. I'd like to get a nicely formatted data frame, however, specifying the concrete amount of lines to read is difficult because in other files like this the number of players is different. Also, non-players, signed as #.AC, #.HC and so on, should not be read in.

I tried something like this

LINES <- 19  
stats <- read.table(file=Datei, skip=11, header=FALSE, stringsAsFactors=FALSE,
                  encoding="UTF-8", nrows=LINES)

but as mentioned above, the value for LINES is different each time.

I also tried readLines as in this post, but had no luck with it.

Is there a way to integrate a condition in read.table, like (pseudo code)

if (first character == "AC") {
    break read.table
  }

Sorry if this looks strange, I don't have that much experience in scripting or coding.

Any help is appreciated, thanks a lot!

Greetz!

Community
  • 1
  • 1
chringel21
  • 126
  • 3
  • 10
  • I want to help but your link is down at least in my country (Senegal) can you post another link (dropbox) to get the data – dickoa May 05 '13 at 12:58
  • @dickoa Sure, there you go: [File 1](https://www.dropbox.com/s/62ozracew6eyzl6/teammonthly_10-2011_337457_8074.txt) [File 2](https://www.dropbox.com/s/7xk1yna7ns679ch/teammonthly_12-2011_337457_8074.txt) – chringel21 May 05 '13 at 14:24

1 Answers1

3

Your data show a couple of difficulties which should be handled in a sequence, which means you should not try to read the entire file with one command:

Read plain lines and find start and stop row

Depending on the specification of the files you read in my suggestion is to first find the the first row you actually want to read in by any indicator. So this can be a lone number which is always the same or as in my example two lines after the line "TEAM STATS". Finding the last line is then simple again by just looking for the first line containing only whitespaces after the start line:

lines <- readLines( Datei )
start <- which(lines == "TEAM STATS") + 2
end   <- start + min( grep( "^\\s+$", lines[ start:length(lines) ] ) ) -2
lines <- lines[start:end]

Read the data to data.frame

In your case you meet a couple of complications:

  • Your header line starts with an # which is on default recognized as a comment character, ignoring the line. But even if you switch this behavior off (comment.char = "") it's not a valid column name.
  • If we tell read.table to split the columns along whitespaces you end up with one more column in the data, than in the header since the Player column contains white spaces in the cells. So the best is at the moment to just ignore the header line and let read.table do this with it's default behavior (comment.char = "#"). Also we let the PLAYER column be split into two and will fix this later.
  • You won't be able to use the first column as row.names since they are not unique.
  • The rows have unequal length, since the POS column is not filled everywhere.

:

tab <- read.table( text = lines[ start:end ], fill = TRUE, stringsAsFactors=FALSE )
# fix the PLAYER column
tab$V2 <-  paste( tab$V2, tab$V3 )
tab <- tab[-3]

Fix the header

Just split the start line at multiple whitespaces and reset the first entry (#) by a valid column name:

colns <- strsplit( lines[start], "\\s+" )[[1]]
colns[1] <- "code"
colnames(tab) <- colns

Fix cases were "POS" was empty

This is done by finding the rows which last cell contains NAs and shift them by one cell to the right:

colsToFix <- which( is.na(tab[, "SHO%"]) )
tab[ colsToFix, 4:ncol(tab) ] <- tab[ colsToFix, 3:(ncol(tab)-1) ]
tab[ colsToFix, 3  ] <- NA

> str(tab)
'data.frame':   25 obs. of  20 variables:
 $ code  : chr  "93" "91" "61" "88" ...
 $ PLAYER: chr  "Eichelkraut, Flori" "Müller, Lars" "Alt, Sebastian" "Gross, Arthur" ...
 $ POS   : chr  "F" "F" "D" "F" ...
 $ GP    : chr  "8" "6" "7" "8" ...
 $ G     : int  10 1 4 3 4 2 0 2 1 0 ...
 $ A     : int  5 11 5 5 3 4 6 3 3 4 ...
 $ PTS   : int  15 12 9 8 7 6 6 5 4 4 ...
 $ PIM   : int  12 10 12 6 2 36 37 29 6 0 ...
 $ PPG   : int  3 0 1 1 1 1 0 0 1 0 ...
 $ PPA   : int  1 5 2 2 1 2 4 2 1 1 ...
 $ SHG   : int  0 1 0 1 1 0 0 0 0 0 ...
 $ SHA   : int  0 0 1 0 1 0 0 1 0 0 ...
 $ GWG   : int  2 0 1 0 0 0 0 0 0 0 ...
 $ FG    : int  1 0 1 1 1 0 0 0 0 0 ...
 $ OTG   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ UAG   : int  1 0 1 0 0 0 0 0 0 0 ...
 $ ENG   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SHOG  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SHOA  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SHO%  : num  0 0 0 0 0 0 0 0 0 0 ...
Beasterfield
  • 7,023
  • 2
  • 38
  • 47
  • Thanks for your detailed guide on how to solve my problem. I'll be trying out your suggestions! – chringel21 May 05 '13 at 14:27
  • Thanks again, it works. What really helped was the first part, treat the whole file as one string and then look for matching patterns. :D – chringel21 May 09 '13 at 09:43