read data file with mixed numbers and texts in r

Question

I have data files that contain the following: the first 10 columns are numbers, the last column is text. They are separated by space. The problem is that the text in the last column may also contain space. So when I used read.table() I got the following error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :  
line 21 did not have 11 elements

what's the easiest way of reading the first 10 columns into a data matrix, and the last column into a string vector? Should I use readLines() first then process it?

score 1 · Answer 1 · 2017-06-03T19:26:58.240

If you cannot re-export or recreate your data files with different, non-whitespace separators or with quotation marks around the last column to avoid that problem, you can use read.table(... , fill = TRUE) to read in a file with unequal columns and then combine columns 11+ with dat$col11 <- do.call(paste, c(dat[11:nrow(dat)], sep=" ")) (or something like that) and then drop the now unwanted columns with dat[11:(nrow(dat)-1)] <- NULL. Finally, you may need to trim the whitespace from the end of the eleventh column with trimws(dat$col11).

Note that fill only considers the first five lines of your file, so you may need to find out the number of 'pseudo-columns' in the longest line manually and specify an appropriate number of col.names in read.table (see the linked answer).

Thanks for the `fill = TRUE` hint. I'll change my data generation program later. But now I want to read the results because it took a whole night to generate the data. — Yugong, Jun 03 '17 at 18:39

Yugong · Answer 2 · 2017-06-03T18:41:56.273

0

Hinted by the useful fill = TRUE option of read.table() function, I used the following to solve my problem:

dat <- read.table(fname, fill = T)
dat <- dat[subset(1:nrow(dat),!((1:nrow(dat)) %in% (which(dat[,11]=="No") + 1))),]

The fill = TRUE option puts everything after the first space of the 11th column to a new row (redundant rows that the original data do not have). The code above removes the redundant rows based on three assumptions: (1) the number of space separators in the 11th column is no more than 11 such that we know there is only one more row of text after a line whose 11th column contains space (that's what the +1 does); (2) we know the line whose 11th column starts with a certain word (in my case it is "No") (3) Keeping only the first word in the 11th column would be sufficient (without ambiguity).

edited Jun 03 '17 at 18:41

answered Jun 03 '17 at 18:35

Yugong

23
6

Strangely enough, the `fill = TRUE` option sometimes puts the extra content of the last column to new columns and sometimes puts them in an extra row! So the solution here doesn't always work. I'll edit it if I find a solution. – Yugong Jun 03 '17 at 19:18
The reason for the `read.table(..., fill = T)` behavior in my comment above: `read.table()` uses the first 5 rows to determine the number of columns. So if the row whose 11th column has two words separated by a space occurs within the first 5 rows, the function will determine there are 12 columns and put the extra word in an extra column. If the row occurs beyond the first 5 rows, the extra word will be put to an extra row. – Yugong Jun 03 '17 at 21:31

Lamia · Answer 3 · 2017-06-03T22:49:37.423

0

You could reformat your file before reading it in R. For example, using perl in a terminal:

perl -pe 's/(?<=[0-9]) /,/g' myfile.txt > myfile.csv

This replaces every space preceded by a number by a comma. Then read it into R using read.csv:

df = read.csv("myfile.csv")

edited Jun 03 '17 at 22:49

answered Jun 03 '17 at 21:38

Lamia

3,845
1
12
19

Thanks. I think this should work although it's a bit too complicated. I think the solution below is easier. – Yugong Jun 03 '17 at 22:33
I may remove the second part of the answer as I agree it is a bit too complicated for what you're trying to do. But the first part using perl is actually quite simple, and once you have your reformatted file, you can read it anytime more easily. – Lamia Jun 03 '17 at 22:36
I agree. But I suggest you keep it because it may be helpful for other people or other purposes. – Yugong Jun 03 '17 at 22:59

score 0 · Accepted Answer · answered Jun 03 '17 at 22:24

The following solved my problem:

nc <- max(count.fields(fname), sep = " ")
data <- read.table(fname, fill = T, col.names = paste0("V", seq_len(nc)), sep = " ", header = F)

Then the first 10 columns will be the numeric results I want and the remaining nc-10 columns can be combined into one string vector. The most helpful post is: How can you read a CSV file in R with different number of columns

read data file with mixed numbers and texts in r

4 Answers4