I have output data from a piece of equipment. Unfortunately the output data is not organized very well, and I have been writing a code in R to break it down. Essentially the data is a separate list of information (basic descriptive information, and raw data for two different measurements A and B for each time interval) for each subject pasted into one long document. For example:
Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: Subject2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
I have written a code in R that works, but is not very elegant, using split(seq_along), for-loops, and do.call (based primarily on this stack overflow question and this blog post).
# First read text file in as a character vector called ‘example’
scan("example_file.txt", what="character", strip.white=T, sep="\n") -> example
# Separate the header text (before the colon) from the proceeding data
# and make that text name the components of the vector
regmatches(example, regexpr(example, pattern="[[:alnum:]]+:", useBytes = F)) -> names(example)
gsub(example, pattern="[[:print:]]+: ", replacement="", useBytes = F)-> example.2
# Then, split character vector into a list based on how many lines are
# dedicated to each subject (in this example, 11 lines); based on SE
# answer cited above
strsplit(example.2, "([A-Z]:)") -> example.3
split(as.list(example.3), ceiling(seq_along(example.2)/11)) -> example.4
# Use a for-loop to systematically add the data together for subjects 1
# and 2 for time interval 1, using the method detailed from a blog post
# (cited above)
my.list <- list()
for(i in 1:2){
strsplit(as.character(example.4[[i]][5]), split="[[:blank:]]+") -> A
strsplit(as.character(example.4[[i]][9]), split="[[:blank:]]+")-> B
as.vector(c(as.character(example.4[[i]][3]), "A", unlist(A))) -> A_char
as.vector(c(as.character(example.4[[i]][3]), "B", unlist(B))) -> B_char
paste(as.character(example.4[[i]][3]), "Measure_A") -> a_name
paste(as.character(example.4[[i]][3]), "Measure_B") -> b_name
my.list[[a_name]] <- A_char
my.list[[b_name]] <- B_char
}
final.data <- do.call(rbind, my.list)
as.data.frame(final.data) -> final.data
names(final.data) <- c("Subject", "Measure", "V1", "V2", "V3", "V4")
I can extract the data for a single time interval for A and B across all subjects using my code (for example, the lines "1: 1 2 4 1" and "1: 2 3 0 1" above) and put put all the information together in a data frame. Where is gets messy is when I want to do this for all of the time intervals, not just one time interval. I can't figure out how to do this without running separate for-loops for each time interval. I tried doing a for-loop within a for-loop, but that didn't work. I also couldn’t figure out how to do this with the apply()-type functions.
If I only had 3 time intervals, as per this example, this issue wouldn’t be so bad, but my actual data is a lot longer. Any suggestions for a more elegant and concise approach would be appreciated!
P.S. I am aware that the final data frame that the above code gives has redundant row names. However, this is a helpful way of making sure that the final data frame’s subject and measure information lines up with the labels I had applied to earlier R objects.