This first puts a newline character, "\n"
, after every "END_ROW_WORD"
marker, then pastes the result into a long character string.
Then, it uses read.table
to read the data in from a text connection.
end <- "END_ROW_WORD"
inx <- c(0, grep(end, dat[[1]]))
s <- NULL
for(i in seq_along(inx)[-1]){
s <- c(s, dat[[1]][(inx[(i - 1)] + 1):inx[i]], "\n")
}
con <- textConnection(paste(s, collapse = " "))
result <- read.table(con, fill = TRUE)
close(con)
result
# V1 V2 V3 V4 V5 V6
#1 D 2 f h k END_ROW_WORD
#2 k 1 2 END_ROW_WORD
#3 e g j 2 k END_ROW_WORD
DATA.
dat <-
structure(list(V1 = c("D", "2", "f", "h", "k", "END_ROW_WORD",
"k", "1", "2", "END_ROW_WORD", "e", "g", "j", "2", "k", "END_ROW_WORD"
)), .Names = "V1", class = "data.frame", row.names = c(NA, -16L
))
EDIT.
After the question's edit by the OP, I revised the code to see if that file can be properly read into a data.frame
. The main difficulty is that the file has many non printable characters, and read.table
was having trouble getting to the end of the file.
Credits to the solution of this problem go to the accepted answer in read.csv warning 'EOF within quoted string' prevents complete reading of file. I upvoted both the question and that answer.
Credits must also be given to @kath, in the answer the idea of using a string replace to put newline characters as EOL markers is much better than my ugly for
loop above. Unlike kath, I use base R
only, I don't find it necessary to load an external package.
Now the revised code.
# Use this first pattern if AUCTION also marks the end of a row
#pattern <- "(^LOT|^AUCTION)"
pattern <- "(^LOT)"
dat <- readLines("data_.csv")
s <- gsub("[[:cntrl:]]", "", dat)
s <- sub(pattern, "\\1\n", s)
con <- textConnection(paste(s, collapse = "\t"))
result <- read.table(con, sep = "\t", fill = TRUE, quote = "", row.names = NULL)
close(con)
head(result)
tail(result)
str(result)
I thought that there would be some empty rows, so I checked it with the following code.
#
# See if there are any empty rows
#
empty <- apply(result, 1, function(x) nchar(trimws(paste0(x, collapse = ""))) == 0)
sum(empty)
#[1] 0