Use function to convert messy data in many data frames to single tidier format

Question

Hi have a set of data frames that come from exported Excel files from some proprietary software. The data looks like so:

> head(Ball)
                 Col1   Col2                Col3   Col4                Col5   Col6                Col7   Col8                Col9  Col10               Col11  Col12
1 2014-07-25 00:00:00   <NA> 2014-07-25 00:00:00   <NA> 2014-07-25 00:00:00   <NA> 2014-07-23 00:00:00   <NA> 2014-07-23 00:00:00   <NA> 2014-07-23 00:00:00   <NA>
2 1899-12-31 07:49:00   <NA> 1899-12-31 06:49:00   <NA> 1899-12-31 06:48:00   <NA> 1899-12-31 08:27:00   <NA> 1899-12-31 08:26:00   <NA> 1899-12-31 07:20:00   <NA>
3                   X      Y                   X      Y                   X      Y                   X      Y                   X      Y                   X      Y
4                   0      0                   0      0                   0      0                   0      0                   0      0                   0      0
5        0.0502222222 2.1945        0.0502222222 1.9437                0.05  1.254        0.0501123596 1.6302        0.0501086957      0                0.05      0
6        0.1004444444 5.7684        0.1004444444 4.7652                 0.1 4.2636        0.1002247191 4.2636        0.1002173913 0.3135                 0.1 2.1318
                Col13  Col14
1 2014-07-23 00:00:00   <NA>
2 1899-12-31 07:19:00   <NA>
3                   X      Y
4                   0      0
5        0.0501123596 1.7556
6        0.1002247191  4.389

This data contains a variable number of usable rows thus many columns have NAs for their last rows in the data frame. I'm trying to get all these data frames (Ball and about 10 others) into a single tidier format with the data from row 4 to the last non-NA row, for each pair of columns. The end result looks like so:

> head(df)
  id name routine trial     volume   flow
1  1 Ball    tech post1 0.00000000 0.0000
2  1 Ball    tech post1 0.05022222 2.1945
3  1 Ball    tech post1 0.10044444 5.7684
4  1 Ball    tech post1 0.15066667 6.8343
5  1 Ball    tech post1 0.20088889 7.2732
6  1 Ball    tech post1 0.25111111 7.5867

Where id is a random identifier linked to the name, name is the name of the data frame imported, routine is assigned a value based on the date on the first line, trial is also assigned a value based on the hour on the second line, volume are the values starting on row 4 under each X and flow the values starting on row 4 under each Y.

Here is the function I came up with, x is the original data frame (in this case "Ball") and y the new data frame to which rows should be added.

tidier <- function(x, y) {
    for(col in ncol(x) / 2) {
         end.current <- length(x[,col][!is.na(x[,col])])
         length.current <- end.current - 3
         id = rep(1, length.current)
         name = rep("Ball", length.current)
         routine <- rep("tech", length.current)
         trial <- rep("pre2", length.current)
         volume <- as.numeric(Ball[4:end.current, col])
         flow <- as.numeric(Ball[4:end.current, col + 1])
         temp.df <- data.frame(id, name, routine, trial, volume, flow)
         df <- rbind(y, temp.df)
         col <- col + 2
         return(df)
    }
}

I haven't got to the point of setting conditional values for id, name, routine and trial based on the values found in the original data frame. Running the function returns only the original df data frame, without any rows added. I don't get any errors and cannot figure out how to make this work. Hopefully this is clear enough, I'm new to building functions and any help in making this work would be greatly appreciated.

This is way too long and probably contains information not needed to reproduce the problem. OTOH, it's difficult to actually reproduce your problem. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for some guidance. — krlmlr, Sep 30 '14 at 15:34
Thanks for the information, I'll be considering for future questions. — Jonathan, Oct 01 '14 at 15:31

agstudy · Accepted Answer · 2014-09-30T18:44:11.223

I think you over complicate the task:

Use read.table with skip option to skip first lines
Convert extract x columns as a single vector : volume
Convert extract y columns as a single vector : flow
Use recycling to create other columns

here my code:

## here you change text by your file=file_name
d_f <- read.table(text=dat,header=TRUE,skip=3)[,-1]
## extract only X column and expand them in a single vector
volume <- unlist(as.list(d_f[grep('X',colnames(d_f))]))
## extract only Y column and expand them in a single vector
flow <- unlist(as.list(d_f[grep('Y',colnames(d_f))]))
## create you data frame using recycling for other columns
data.frame(id=1,name='Ball',routine='tech',
       trial='pos1',volume=volume,flow=flow)

   id name routine trial     volume   flow
X1    1 Ball    tech  pos1 0.00000000 0.0000
X2    1 Ball    tech  pos1 0.05022222 2.1945
X3    1 Ball    tech  pos1 0.10044444 5.7684
X.11  1 Ball    tech  pos1 0.00000000 0.0000
X.12  1 Ball    tech  pos1 0.05022222 1.9437
X.13  1 Ball    tech  pos1 0.10044444 4.7652
X.21  1 Ball    tech  pos1 0.00000000 0.0000
X.22  1 Ball    tech  pos1 0.05000000 1.2540
X.23  1 Ball    tech  pos1 0.10000000 4.2636

Thanks, this helped me figure out a simpler solution! – Jonathan Oct 01 '14 at 15:32 — Jonathan, Oct 01 '14 at 15:32

Use function to convert messy data in many data frames to single tidier format

1 Answers1