r - convert single column data frame to data frame with rows based on one fixed text

Question

Update 1

Linking the actual dataset since the solutions given for the example data are not working out for me.

Link: https://app.box.com/s/65j1enr13pi51i44mfrymccklw1artot

Please note that LOT is the end of the row marker.

--

I've data frame like the following (single column):

D
2
f
h
k
END_ROW_WORD
k
1
2
END_ROW_WORD
e
g
j
2
k
END_ROW_WORD

I'd like to convert it into following format:

As you can see there is a specific word (END_ROW_WORD) that marks the end of the row.

And what is now the end row marker? You have `PAST LOT`, `LOT` but also `PAST AUCTION` and `AUCTION`. — Rui Barradas, Feb 23 '18 at 22:34
The problem with your complete dataset is that a) you have several words per row (not single letters as shown here) and b) all kinds of special characters which makes finding a appropriate delimiter difficult. — kath, Feb 23 '18 at 22:51

score 1 · Answer 1 · answered Feb 23 '18 at 21:47

Here is a similar approach to Alejandro's, but using split instead of a for loop:

colstarts <- diff(c(0, which(df == "END_ROW_WORD")))
rows <- split(df[[1]], rep(1:length(colstarts), colstarts))
rows <- lapply(rows, `length<-`, max(lengths(rows)))
as.data.frame(do.call(rbind, rows))

score 1 · Answer 2 · answered Feb 23 '18 at 21:58

A solution without for-loops, but with stringr

library(stringr)
new_text <- str_c(df$V1, collapse = " ")
new_text <- str_replace_all(new_text, "END_ROW_WORD", "END_ROW_WORD\n")
read.table(text = new_text, fill = T)

#   V1 V2 V3           V4 V5           V6
# 1  D  2  f            h  k END_ROW_WORD
# 2  k  1  2 END_ROW_WORD                
# 3  e  g  j            2  k END_ROW_WORD

Data

df <- 
  structure(list(V1 = structure(c(3L, 2L, 6L, 8L, 10L, 5L, 10L, 1L, 2L, 5L, 4L, 7L, 9L, 2L, 10L, 5L),
                                .Label = c("1", "2", "D", "e", "END_ROW_WORD", "f", "g", "h", "j", "k"),
                                class = "factor")),
            .Names = "V1", class = "data.frame", row.names = c(NA, -16L))

Updated the question with actual dataset as the solution didn't work out for me, — user709413, Feb 23 '18 at 22:27

score 0 · Answer 3 · answered Feb 23 '18 at 21:42

This might not be the best way to do it but it works

pos_help = which(grepl("END_ROW_WORD",data))

d = list()
for(i in 1:length(pos_help)){
  if(i == 1){
    d[[i]] = data[1:pos_help[1]]
  } else {
    d[[i]] = data[(pos_help[i-1]+1):pos_help[i]]
  }
}
dataFrame = do.call(rbind,lapply(d, "length<-", max(lengths(d))))

Rui Barradas · Accepted Answer · 2018-02-24T10:12:57.313

This first puts a newline character, "\n", after every "END_ROW_WORD" marker, then pastes the result into a long character string.
Then, it uses read.table to read the data in from a text connection.

end <- "END_ROW_WORD"

inx <- c(0, grep(end, dat[[1]]))
s <- NULL
for(i in seq_along(inx)[-1]){
    s <- c(s, dat[[1]][(inx[(i - 1)] + 1):inx[i]], "\n")
}

con <- textConnection(paste(s, collapse = " "))
result <- read.table(con, fill = TRUE)
close(con)
result
#  V1 V2 V3           V4 V5           V6
#1  D  2  f            h  k END_ROW_WORD
#2  k  1  2 END_ROW_WORD                
#3  e  g  j            2  k END_ROW_WORD

DATA.

dat <-
structure(list(V1 = c("D", "2", "f", "h", "k", "END_ROW_WORD", 
"k", "1", "2", "END_ROW_WORD", "e", "g", "j", "2", "k", "END_ROW_WORD"
)), .Names = "V1", class = "data.frame", row.names = c(NA, -16L
))

EDIT.

After the question's edit by the OP, I revised the code to see if that file can be properly read into a data.frame. The main difficulty is that the file has many non printable characters, and read.table was having trouble getting to the end of the file.

Credits to the solution of this problem go to the accepted answer in read.csv warning 'EOF within quoted string' prevents complete reading of file. I upvoted both the question and that answer.

Credits must also be given to @kath, in the answer the idea of using a string replace to put newline characters as EOL markers is much better than my ugly for loop above. Unlike kath, I use base R only, I don't find it necessary to load an external package.

Now the revised code.

# Use this first pattern if AUCTION also marks the end of a row
#pattern <- "(^LOT|^AUCTION)"
pattern <- "(^LOT)"

dat <- readLines("data_.csv")
s <- gsub("[[:cntrl:]]", "", dat)
s <- sub(pattern, "\\1\n", s)

con <- textConnection(paste(s, collapse = "\t"))
result <- read.table(con, sep = "\t", fill = TRUE, quote = "", row.names = NULL)
close(con)

head(result)
tail(result)
str(result)

I thought that there would be some empty rows, so I checked it with the following code.

#
# See if there are any empty rows
#
empty <- apply(result, 1, function(x) nchar(trimws(paste0(x, collapse = ""))) == 0)
sum(empty)
#[1] 0

Thanks. Somehow the solution did not work out when I applied it on the actual dataset. I've updated the question now. — user709413, Feb 23 '18 at 22:21

score 0 · Answer 5 · answered Feb 23 '18 at 22:05

0

without loop, but using map and split.... (because why not :p )

library(tidyverse)
df <- tibble(x=c(
  "D",
  "2",
  "f",
  "h",
  "k",
  "END_ROW_WORD",
  "k",
  "1",
  "2",
  "END_ROW_WORD",
  "e",
  "g",
  "j",
  "2",
  "k",
  "END_ROW_WORD"
)  

)
split(df,cut(1:16,breaks=c(0,which(df == "END_ROW_WORD")))) %>%
  map_dfc(~rbind(.x,tibble(x=rep(NA,(6-nrow(.x)))))) %>% 
  t() %>% as.data.frame()

answered Feb 23 '18 at 22:05

Vincent Guyader

2,927
1
26
43

Thanks. I've linked the real data set as the solution for the example data didn't give desired result. – user709413 Feb 23 '18 at 22:30

r - convert single column data frame to data frame with rows based on one fixed text

5 Answers5