How do I extract data as a data frame from a text file in R? The data has names in it and the middle names are messing with my method

Question

I have a text file where strings are separated by whitespaces. I can easily extract these into R as a data frame, by first using the scan command and then seeing that each record has 15 strings in them.

So data[1:15} is one row, data[16:30} is the other row and so on. In each of these records, the name is composed of two strings, say FOO and BAR. But some records have names such as FOO BOR BAR or even FOO BOR BOO BAR. This obviously messes with my 15 string theory. How can I easily extract the data into a data frame?

So my data is in my working directory called results.txt.

I use this to scan my data:

mech <- scan("results.txt", "")

Then I can make the data frames like this:

d1 <- t(data.frame(mech[1:15]))
d2 <- t(data.frame(mech[16:30]))
d3 <- t(data.frame(mech[31:45]))

My plan was to iterate this in a for loop and rbind the data into one consolidated data frame.

d1 results in something like

1 FOO  BAR 2K12/ME/01 96 86 86 92 73  86  72 168  82  30 84.93

d2 results in

2 FOO2 BAR2 2K12/ME/02 72 83 61 75 44  88  75 165  91  30 72.60

Here, FOO and BAR are first and last names, respectively. Most records are like this. But d3:

3 FOO3 BOR BAR3 2K12/ME/03 72 83 61 75 44  88  75 165  91  30

Because of the extra middle name, I lose the final string of the text, the part right after 30. This then spills over to the next record. So row 46:60, instead of starting with 4, begins with the omitted data from the previous record.

How can I extract the data by treating the names as a single string?

EDIT: Stupid of me for not providing the data frame itself. Here is a sample.

 1      FOO BAR                         2K12/ME/01                  96            86            86            92             73             86             72            168            82                       30     84.93
 2      FOO2 BAR2                        2K12/ME/02                  72            83            61            75             44             88             75            165            91                       30     72.60
 3      FOO3 BOR BAR3                      2K12/ME/03                  63            84            62            62             50             79             74            157            85                       30     69.13
 4      FOO4 BOR BAR4                  2K12/ME/04                  89            88            74            79             77             83             68            182            82                       30     81.93

Thanks for posting. Can you give us a reproducible example? See this post for more-http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. — John_dydx, Jun 26 '15 at 06:31
It is not clear. You may need to create some example along with expected result based on that example as @John suggested — akrun, Jun 26 '15 at 06:35
Is the element following the names always `2K12/ME/XX`? If not, please post a more representative sample of the data. — Ritchie Sacramento, Jun 26 '15 at 06:59
can you try editing in a text editor? Maybe its a mix of spaces and tabs or something like that you can use search and replace on. This happens to me often and that is usually my first approach. — Stedy, Jun 26 '15 at 06:59
@Jay Yes the the element following the names is always 2K12/ME/XX. It can also go up to 2K12/ME/XXX, that is a three digit number. — Aditya Salapaka, Jun 26 '15 at 07:23
Do you have line breaks in your original file? Does the third string actually contain an additional number (making it 16 elements long)? — Roland, Jun 26 '15 at 07:25
@Stedy I can, but I want to be able to run this script on other similar documents, too. I tried to import the text data into excel, fix the names, and export that data into csv. This works, but is time consuming and not a very reproducible way of doing things... — Aditya Salapaka, Jun 26 '15 at 07:32
@Roland In the text file I have, the data is already formatted into columns and there is a linebreak after each record. But when I extract this data into R using `scan()`, the next record beigns immediately after the previous record ends. So the character vector into which the data was scanned, it would read as: ... “82” “30” “84.93” “2” “FOO2” “BAR2” — Aditya Salapaka, Jun 26 '15 at 07:32
@Roland Yeah the third string actually does contain an additional number, making it 16 elements long. — Aditya Salapaka, Jun 26 '15 at 07:47

Roland · Answer 1 · 2015-06-26T09:26:19.580

s1 <- "1      FOO BAR                         2K12/ME/01                  96            86            86            92             73             86             72            168            82                       30     84.93
2      FOO2 BAR2                        2K12/ME/02                  72            83            61            75             44             88             75            165            91                       30     72.60
3      FOO3 BOR BAR3                      2K12/ME/03                  63            84            62            62             50             79             74            157            85                       30     69.13
4      FOO4 BOR BAR4                  2K12/ME/04                  89            88            74            79             77             83             68            182            82                       30     81.93"
s2 <- readLines(textConnection(s1)) #read from your file here
s2 <- strsplit(s2, "\\s+") #splits by white space
s3 <- lapply(s2, function(s) {
  n <- length(s)
  s[2] <- paste(s[2:(2 + (n - 14))], collapse = " ")
  s[-(3:(2 + (n - 14)))]  
})

DF <- do.call(rbind, s3)
DF <- as.data.frame(DF, stringsAsFactors = FALSE)
DF[] <- lapply(DF, type.convert, as.is = TRUE)
str(DF)

#'data.frame':  4 obs. of  14 variables:
# $ V1 : int  1 2 3 4
# $ V2 : chr  "FOO BAR" "FOO2 BAR2" "FOO3 BOR BAR3" "FOO4 BOR BAR4"
# $ V3 : chr  "2K12/ME/01" "2K12/ME/02" "2K12/ME/03" "2K12/ME/04"
# $ V4 : int  96 72 63 89
# $ V5 : int  86 83 84 88
# $ V6 : int  86 61 62 74
# $ V7 : int  92 75 62 79
# $ V8 : int  73 44 50 77
# $ V9 : int  86 88 79 83
# $ V10: int  72 75 74 68
# $ V11: int  168 165 157 182
# $ V12: int  82 91 85 82
# $ V13: int  30 30 30 30
# $ V14: num  84.9 72.6 69.1 81.9

The data you used for your explanation doesn't match with mine! Like an idiot, I didn't add my own sample data, which I have done now. — Aditya Salapaka, Jun 26 '15 at 09:20
Edited. I assume that you don't have white space at the beginning of each line. Adjust as needed. — Roland, Jun 26 '15 at 09:26
There is a white space at the beginning of each line. I removed it using this function. `s3 <- lapply(s2, function(s) { c <- unlist(s) c <- c[-1] c <- list(c) })` But then I ran your lapply function to create s4, passing s3 in the function, and I got the error "Error in s[2:(2 + (n - 14))] : only 0's may be mixed with negative subscripts " — Aditya Salapaka, Jun 26 '15 at 10:12
I'm not sure what you are trying there. Just add `s <- s[-1]` to the beginning of the function in my `lapply` loop. — Roland, Jun 26 '15 at 10:16
Your method worked great, but it doesn't account for single names or names greater than lengths of 3. A record had a one-word name, and the script failed there. But still, thanks! Learnt something new here, will certainly help me in the future. — Aditya Salapaka, Jun 26 '15 at 10:46

Ritchie Sacramento · Accepted Answer · 2015-06-26T09:56:08.873

One approach is to use regex to enclose the names in quotes and then a simple read table. This approach has the advantage of allowing for cases with any number of names.

s1 <- "1      FOO BAR                         2K12/ME/01                  96            86            86            92             73             86             72            168            82                       30     84.93
2      FOO2 BAR2                        2K12/ME/02                  72            83            61            75             44             88             75            165            91                       30     72.60
3      FOO3 BOR BAR3                      2K12/ME/03                  63            84            62            62             50             79             74            157            85                       30     69.13
4      FOO4 BOR BAR4                  2K12/ME/04                  89            88            74            79             77             83             68            182            82                       30     81.93" 

s2 <- gsub("^ *|(?<= ) | *$", "", s1, perl = T)

read.table(text=gsub("(?<=[[:digit:]] )(.*)(?= 2K12)", "'\\1'", s2, perl = T), header = F)

Which gives:

  V1            V2         V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13   V14
1  1       FOO BAR 2K12/ME/01 96 86 86 92 73 86  72 168  82  30 84.93
2  2     FOO2 BAR2 2K12/ME/02 72 83 61 75 44 88  75 165  91  30 72.60
3  3 FOO3 BOR BAR3 2K12/ME/03 63 84 62 62 50 79  74 157  85  30 69.13
4  4 FOO4 BOR BAR4 2K12/ME/04 89 88 74 79 77 83  68 182  82  30 81.93

The data you used for your explanation doesn't match with mine! Like an idiot, I didn't add my own sample data, which I have done now. — Aditya Salapaka, Jun 26 '15 at 09:20
I tried to use your solution, though and it did work to an extent. It enclosed the full name in ''. Like 'FOO BOR BAR'. But when I tried to use the read.table function, it failed citing "line 1 did not have 16 elements" — Aditya Salapaka, Jun 26 '15 at 09:28
The answer has been updated to strip out additional whitespace after you provided a more complete sample of your data. Try again and post back if it doesn't work on your full file. — Ritchie Sacramento, Jun 26 '15 at 09:40
Hey, this worked! Thanks! BTW a minor niggle, some of the records have an extra string which says "ME-XXX" where XXX can be any number. How can I omit these? — Aditya Salapaka, Jun 26 '15 at 10:45

How do I extract data as a data frame from a text file in R? The data has names in it and the middle names are messing with my method

2 Answers2