I don't know if there are good tools that look for multi-char delimiters, and you aren't the first to ask about it. Most (incl read.table
, read.delim
, and readr::read_delim
) require a single-byte separator.
One method, though certainly not efficient for large files, is to load them in line-wise and do the splitting yourself.
(Consumable data that the bottom.)
x <- readLines(textConnection(file1))
x <- x[x != 'header'] # or x <- x[-(1:5)]
(I'm guessing it isn't always the literal header
, so I'm assuming it's either a fixed count or you can easily "know" which is which.)
spl <- strsplit(x, ' ')
str(spl)
# List of 3
# $ : chr [1:31] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:20] "01130009.JPG" "JPEG" "" "" ...
# $ : chr [1:7] "01130009.JPG" "JPEG" "" "" ...
This seems ok, except that in your examples, there are lots of blanks on the right ...
spl[[1]]
# [1] "01130009.JPG"
# [2] "JPEG"
# [3] ""
# [4] ""
# [5] "2/5/2018 3:53:44 PM"
# [6] "G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg"
# [7] "Gray Fox"
# [8] ""
# [9] ""
# [10] ""
# [11] ""
# [12] ""
# [13] ""
# [14] ""
# [15] ""
# [16] ""
# [17] ""
# [18] ""
# [19] ""
# [20] ""
# [21] ""
# [22] ""
# [23] ""
# [24] ""
# [25] ""
# [26] ""
# [27] ""
# [28] ""
# [29] ""
# [30] ""
# [31] ""
So if you know how many columns there are, then you can easily remove extras:
spl <- lapply(spl, `[`, 1:7)
and then check the output:
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# 2 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# 3 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox
This works equally well with your second example:
x <- readLines(textConnection(file2))
x <- x[x != 'header'] # or x <- x[-(1:5)]
spl <- lapply(strsplit(x, ' '), `[`, 1:7)
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
# V1 V2 V3 V4 V5
# 1 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG 2/5/2018 3:53:44 PM
# V6
# 1 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# 2 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# 3 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# V7
# 1 Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3 Gray Fox
Consumable data:
# note: replaced single '\' with double '\\' for R string-handling only
file1 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg Gray Fox '
file2 <- 'header
header
header
header
header
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB Gray Fox
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG JPEG 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB Gray Fox '