I have a very large dataset with many columns, exported from an application. The issue is that the file is "empty character" separated. Reading in the file with readLines yields a list of strings, each string with the same number of characters.
A possible way of determining the positions of the columns is checking whether each string, let s say at position 5, has an empty character. Therefore, can keep on searching starting from vector 1 until a non-empty character is found.
d <- data.frame("V1" = c(" f ggh", "aa hh", "a qq" ), stringsAsFactors =
F)
first.char <- function(col){
current <- 0
j <- 1
while(j <= length(d)){
tmp <- substr(d[j], col, col)
if(!grepl("^\\s*$", tmp)){
current <- 1
break}
j <- j+1
}
return(current)
}
row_dummies <- lapply( c(1:6), first.char) %>% unlist
This approach works but is very slow when scaling up (have a list of 1 million strings, each 1500 characters long). I also tried to transform each vector into a data.table and then use str split (Split text string in a data.table columns), but that seems to be even more inefficient, given that in most of the cases it is not necessary to check all rows.
Any suggestions or advice?
UPDATE: The example above is too trivial. This one is a bit better:
text <- c("df ggh a a h h a qq",
" aa hh ab qt",
" fggh aa hh a ")
The desired output is
list( c("df ggh", "a a", "h h", "a", "qq"),
c(NA, "aa", "hh", "ab", "qq"),
c(" fggh", "aa", "hh", "a", NA)
)
str_locate_all works well, as it indicates where to split the strings:
cuts_in <- sapply(text, function(x) x %>% str_locate_all(. , "\\s") )
cuts_in <- lapply(cuts_in, data.table) # to data.table
cuts_in <- rbindlist(cuts_in)
cuts_in <- cuts_in[, .N, by=start]
cuts_in[ N==3 ,"start"]
start
1: 7
2: 11
3: 15
4: 18
However, probably not the most efficient way (have 15 files, one million rows per file, each row has 1500 characters). For instance, there is no need to check the character at position 1 in rows 2 and 3 given that row 1 position 1 is not a space. read_table2 does not seem to be a solution either:
read_table2(text, col_names = FALSE)
X1 X2 X3 X4 X5 X6 X7 X8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 df ggh a a h h a qq
2 aa hh ab qt NA NA NA NA
3 fggh aa hh a NA NA NA NA