3

I have multiple .txt files which look like this:

header
header
header
header
header
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg   Gray Fox                                                                           
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg   Direct Register Walk, Gait, Gray Fox, Stop                                         
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther Downg   Gray Fox   

The width of the last 2 columns varies, but there is always 3 spaces between all the columns (3rd column is empty in this case).

I'm using this code to read in the example .txt:

read.fwf(filename.txt,skip=5,widths=c(12,16,19,76,83),fill=T,fileEncoding = "UTF-16")

But this code won't work properly on this .txt:

header
header
header
header
header
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB   Gray Fox                                                                           
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB   Direct Register Walk, Gait, Gray Fox, Stop                                         
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\AAA AAAAAAAA\AAAAA AA\BBBB BBBB & BBBBB BBBBB\CAM_07-0008\Farther DowngBBB   Gray Fox   

Is there a way to read in a .txt file with a fixed deliminator (3 spaces) instead of having to define the width of each column, since the column width varies between files.

The files also have some issues with encoding, so here is the example file I'm using

Liza
  • 1,066
  • 2
  • 16
  • 26
  • have you tried with read.table ? – YOLO Mar 16 '18 at 17:45
  • Yes, I tried read.table(filename.txt,skip=5,sep=" ",fileEncoding="UTF-16") and with sep=" " and sep="" and every time I get an error Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 103 elements – Liza Mar 16 '18 at 17:50
  • That is because it separates on any white space, not just the three spaces in a row. If you don't have too many files, an easy solution could be just to replace all occurrences of three spaces in those with tabs, then use the `read.table` call you just tried. – C. Braun Mar 16 '18 at 17:53
  • Are these big files? You could read them all as raw text, replace three spaces with a tab with `gsub` and then parse that as table input. Or are you on a unix-like machine? You can read from a `pipe()` and do that translation quickly with something like `awk`. – MrFlick Mar 16 '18 at 17:53
  • I have hundreds of those files. But they are not too big 50 to 300 lines in each or so. – Liza Mar 16 '18 at 18:00
  • I'm not on a unix-like machine, unfortunately – Liza Mar 16 '18 at 18:00
  • I have added code for reading files in my answer. – rnso Mar 16 '18 at 18:32

2 Answers2

3

I don't know if there are good tools that look for multi-char delimiters, and you aren't the first to ask about it. Most (incl read.table, read.delim, and readr::read_delim) require a single-byte separator.

One method, though certainly not efficient for large files, is to load them in line-wise and do the splitting yourself.

(Consumable data that the bottom.)

x <- readLines(textConnection(file1))
x <- x[x != 'header'] # or x <- x[-(1:5)]

(I'm guessing it isn't always the literal header, so I'm assuming it's either a fixed count or you can easily "know" which is which.)

spl <- strsplit(x, '   ')
str(spl)
# List of 3
#  $ : chr [1:31] "01130009.JPG" "JPEG" "" "" ...
#  $ : chr [1:20] "01130009.JPG" "JPEG" "" "" ...
#  $ : chr [1:7] "01130009.JPG" "JPEG" "" "" ...

This seems ok, except that in your examples, there are lots of blanks on the right ...

spl[[1]]
#  [1] "01130009.JPG"                                                                
#  [2] "JPEG"                                                                        
#  [3] ""                                                                            
#  [4] ""                                                                            
#  [5] "2/5/2018 3:53:44 PM"                                                         
#  [6] "G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg"
#  [7] "Gray Fox"                                                                    
#  [8] ""                                                                            
#  [9] ""                                                                            
# [10] ""                                                                            
# [11] ""                                                                            
# [12] ""                                                                            
# [13] ""                                                                            
# [14] ""                                                                            
# [15] ""                                                                            
# [16] ""                                                                            
# [17] ""                                                                            
# [18] ""                                                                            
# [19] ""                                                                            
# [20] ""                                                                            
# [21] ""                                                                            
# [22] ""                                                                            
# [23] ""                                                                            
# [24] ""                                                                            
# [25] ""                                                                            
# [26] ""                                                                            
# [27] ""                                                                            
# [28] ""                                                                            
# [29] ""                                                                            
# [30] ""                                                                            
# [31] ""                                                                            

So if you know how many columns there are, then you can easily remove extras:

spl <- lapply(spl, `[`, 1:7)

and then check the output:

as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
#             V1   V2 V3 V4                  V5
# 1 01130009.JPG JPEG       2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG       2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG       2/5/2018 3:53:44 PM
#                                                                             V6
# 1 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# 2 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
# 3 G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
#                                           V7
# 1                                   Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3                                   Gray Fox

This works equally well with your second example:

x <- readLines(textConnection(file2))
x <- x[x != 'header'] # or x <- x[-(1:5)]
spl <- lapply(strsplit(x, '   '), `[`, 1:7)
as.data.frame(do.call(rbind, spl), stringsAsFactors = FALSE)
#             V1   V2 V3 V4                  V5
# 1 01130009.JPG JPEG       2/5/2018 3:53:44 PM
# 2 01130009.JPG JPEG       2/5/2018 3:53:44 PM
# 3 01130009.JPG JPEG       2/5/2018 3:53:44 PM
#                                                                                   V6
# 1 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# 2 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
# 3 G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB
#                                           V7
# 1                                   Gray Fox
# 2 Direct Register Walk, Gait, Gray Fox, Stop
# 3                                   Gray Fox

Consumable data:

# note: replaced single '\' with double '\\' for R string-handling only
file1 <- 'header
header
header
header
header
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Gray Fox                                                                           
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Direct Register Walk, Gait, Gray Fox, Stop                                         
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Gray Fox   '
file2 <- 'header
header
header
header
header
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB   Gray Fox                                                                           
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB   Direct Register Walk, Gait, Gray Fox, Stop                                         
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA AA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther DowngBBB   Gray Fox   '
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I tried readLines, and there is smth with encoding in those txt files, so when I do readLines I get "ÿþh" "" "", and readLines doesn't take fileEncoding argument, right? I should fix my example, because this works for the example data – Liza Mar 16 '18 at 18:04
  • [`readLines(..., encoding="utf-8")`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readLines.html)? – r2evans Mar 16 '18 at 18:08
  • No, that doesn't work. I uploaded my txt in dropbox https://www.dropbox.com/s/5st9k602bcfhzwk/first%20one%20-%20Copy.txt?dl=0 – Liza Mar 16 '18 at 18:13
  • I don't know command line tools, but I guess if you sub the triple space for a tab before passing to fread, it should work. `library(data.table); library(magrittr); file2 %>% gsub(" {3}(?=[^ ])","\t", ., perl=TRUE) %>% fread(skip=5, fill=TRUE)` Oh, nvm, just noticed the other answer covers it. – Frank Mar 16 '18 at 18:18
  • 1
    `readLines(file("first one - Copy.txt", encoding='utf-16'))` worked without warning/error. – r2evans Mar 16 '18 at 18:22
  • If you have issues with performance (due to *many* or *really large* files), you should lean closer to unixy cli tools, such as preprocessing your files (on the command line) with `sed -e "s/ /|/g" files*`. If you have [Git for Windows](https://gitforwindows.org/) installed, you have `sed` available (somewhere); if not, install it, since you should probably be using version control anyway :-) – r2evans Mar 16 '18 at 23:25
2

One can read the file skipping header rows, then use gsub function to replace 3 spaces with a convenient separator (vertical bar used here):

> mytext = "01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Gray Fox
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Direct Register Walk, Gait, Gray Fox, Stop
01130009.JPG   JPEG         2/5/2018 3:53:44 PM   G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg   Gray Fox"

> ddf = read.table(text=gsub("   ", "|", mytext), header=F, sep="|")
> ddf 
            V1   V2 V3 V4                  V5                                                                           V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
                                          V7
1                                   Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3                                   Gray Fox

Edit: As suggested by @r2evans in comments below, text has to be trimmed to remove trailing spaces using gsub(" *$", "", ...). Alternatively, following function is from How to trim leading and trailing whitespace in R?:

trim.trailing <- function (x) sub("\\s+$", "", x)

For text files, one can use readLines to read the text file:

> mytext = readLines(file('testfile.txt')) # read file text
> mytext = mytext[-c(1:5)]           # remove first 5 rows ('header')
> mytext = gsub("\\s+$", "", mytext) # remove trailing spaces
> mytext = gsub("   ", "|", mytext)  # change separator
> ddf = read.table(text=mytext, header=F, sep='|') # read columns from text
> ddf
            V1   V2 V3 V4                  V5                                                                           V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
                                          V7
1                                   Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3                                   Gray Fox

Alternatively, one can first read them to a data.frame of one variable, then manipulate the rows to get desired result:

> ddf1 = read.table(file='testfile.txt', sep = '\n', skip=5)
> mytext = gsub("\\s+$", "", unlist(ddf1$V1))
> ddf2 = read.table(text=gsub("   ", "|", mytext), header=F, sep='|')
> ddf2
            V1   V2 V3 V4                  V5                                                                           V6
1 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
2 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
3 01130009.JPG JPEG NA NA 2/5/2018 3:53:44 PM G:\\AAA AAAAAAAA\\AAAAA\\BBBB BBBB & BBBBB BBBBB\\CAM_07-0008\\Farther Downg
                                          V7
1                                   Gray Fox
2 Direct Register Walk, Gait, Gray Fox, Stop
3                                   Gray Fox
rnso
  • 23,686
  • 25
  • 112
  • 234
  • I think your sample data has removed all of the trailing spaces from the data ... I suspect that whatever created the data files space-filled the rows. Minor, but that means `read.table` will see different numbers of columns in each row. – r2evans Mar 16 '18 at 18:00
  • (This can probably be mitigated with another `gsub(" *$", "", ...)`.) – r2evans Mar 16 '18 at 18:02
  • For this to work I'd have to open every single file and do mytext ="...", is there a way to read directly from .txt? I have hundreds of them, and I don't want to open every single one of them – Liza Mar 16 '18 at 18:11
  • @Liza If you know a command-line way to correct the delimiter, fread accepts that: https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread – Frank Mar 16 '18 at 18:20
  • One can use `read.table` function to read a text file: http://astrostatistics.psu.edu/su07/R/html/utils/html/read.table.html – rnso Mar 16 '18 at 18:22