0

Objective

I have 100 hdf5 files in a folder. For a reproducible example let's consider only 2 files, namely:

> list.files(pattern="*.hdf5")
[1] "Cars_20160601_01.hdf5" "Cars_20160601_02.hdf5"  

Each hdf5 file contains 2 groups, data and frame. I want to extract out 2 objects from data group. These are called VDS_Veh_Speed and VDS_Chassis_CG_Position. Similarly, in the frame group there are 3 objects. Only the object frame is relevant in this group.
I want to read these files and extract the relevant variables described above.

What I tried:

# Create a list all the hdf5 files
temp = list.files(pattern="*.hdf5")

# Read all files and create data frames from each using the file name as df name
for (i in unique(temp)){
  data <- h5read(file = i, name = "data") # ED data
  frame <- h5read(file = i, name = "frame") # Frame numbers
  ED <- data.frame(frames = frame$frame, 
                   speed.kph.ED = round(data$VDS_Veh_Speed*1.46667*0.3048*3.6,2),
                   pedal_pos = data$CFS_Accelerator_Pedal_Position)#fps

  df <- h5read(file = i, name = "data/VDS_Chassis_CG_Position")
  df <- as.data.frame(df)
  colnames(df) <- c("y", "x", "z")
  df$speed <- ED$speed.kph.ED 
  df$pedal_pos <- ED$pedal_pos
  df$file.ID <- i
  assign(i, df)
}  

Now, because I have all the files in the Global environment, I removed the extra objects and only kept the new dfs:

# Remove extra objects
rm(data, df, ED, frame, i, temp)

Finally, I made a list of the dfs in the environment and then created a single data frame:

DF_obj <- lapply(ls(), get)
fdc <- do.call("rbind", DF_obj)   

This works for me. But, as mentioned in the comments, assign should be avoided. Also, I have to manually use rm(), without which this code won't work. Is there any way to avoid assign in this context?

If you need the data files, here is the link to the 2 mentioned above: https://1drv.ms/f/s!AsMFpkDhWcnw6g7StJp9dzZ-nCr4

umair durrani
  • 5,597
  • 8
  • 45
  • 85
  • 1
    `fortunes::fortune(236)` – alistaire Sep 26 '16 at 22:50
  • @alistaire what does that mean? – umair durrani Sep 27 '16 at 00:59
  • It's a quote in the [fortunes](https://cran.r-project.org/web/packages/fortunes/index.html) package that suggests `assign` is best avoided. – alistaire Sep 27 '16 at 01:03
  • Thanks for that. Could you please suggest some alternative? – umair durrani Sep 27 '16 at 01:13
  • Use `lapply` instead of `for` so you end up with a list instead of a mess in your global environment. – alistaire Sep 27 '16 at 05:10
  • 1
    Perhaps not a dupe, but this is well-covered in [How do I make a list of data frames?](http://stackoverflow.com/a/24376207/903061) – Gregor Thomas Sep 27 '16 at 20:14
  • @Gregor Thanks for the link. In my case, the files are hdf5 format. The problem is that I don't want to directly put, say, the `data` group from each file in a list. For each file I need to first extract the relevant variables from different groups and then combine them into a data frame. – umair durrani Sep 27 '16 at 20:34
  • That's not any different, you just have one additional step - extracting the relevant variables. – Gregor Thomas Sep 27 '16 at 20:39
  • @umairdurrani, those data files you've linked to are actually *.daq files rather than *.hd5? I just wanted to actually try what alistaire suggested on your data (instead of merely reading fortune(236)). – Angelo Oct 02 '16 at 14:58
  • @Angelo, Sorry, I have uploaded the hdf5 files in the same folder now. – umair durrani Oct 03 '16 at 15:22

1 Answers1

3

The answer is basically the same as your code, but with a couple minor changes. We just use a list and do normal assign to elements of the list rather than using assign() to create data frames in your global environment. This saves potential bugs, name clashes, and having to worry about extensive clean-up.

temp = list.files(pattern="*.hdf5")
df_list = list()  # initialize a list

# Read all files into a list of data frames
for (i in unique(temp)){
  data <- h5read(file = i, name = "data") # ED data
  frame <- h5read(file = i, name = "frame") # Frame numbers
  ED <- data.frame(frames = frame$frame, 
                   speed.kph.ED = round(data$VDS_Veh_Speed*1.46667*0.3048*3.6,2),
                   pedal_pos = data$CFS_Accelerator_Pedal_Position)#fps

  df <- h5read(file = i, name = "data/VDS_Chassis_CG_Position")
  df <- as.data.frame(df)
  colnames(df) <- c("y", "x", "z")
  df$speed <- ED$speed.kph.ED 
  df$pedal_pos <- ED$pedal_pos

  # assign to the list. We can take care of the id cols automatically
  df_list[[i]] <- df
} 

names(df) <- unique(temp)
fdc <- data.table::rbindlist(df_list, idcol = "file.ID")

Using data.table::rbindlist will be faster than using do.call(rbind), and it takes care of the ID column for us based on the names of the list.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks for your answer. However, this doesn't work because the `i` in the loop is the name of a file because of using `unique(temp)`. So, `df_list[[i]] <- df` throws error. This is the reason that I tried to use `assign`. – umair durrani Sep 30 '16 at 15:14
  • Why doesn't that work? What error? `my_list = list(); my_list[["item_one"]] = mtcars; my_list[["your_filename.hdf5"]] = iris; i = "another_filename.hdf5"; my_list[[i]] <- data.frame(x = 1:2)` works just fine. – Gregor Thomas Sep 30 '16 at 17:13
  • @umairdurrani Did you skip the `df_list = list()` line initializing the list before the loop?? – Gregor Thomas Sep 30 '16 at 17:20
  • No, I copied your code and tried it. When you do `for (i in unique(temp))`, the first `i = Cars_20160601_01.hdf5`. So, when `df_list[[i]] <- df` is run, you get an error. – umair durrani Oct 01 '16 at 03:10
  • `list.files`returns a vector of class `character`, so the first `i` should be `"Cars_20160601_01.hdf5"` (notice the quotes). The code from my first comment easily demonstrates that character indices work for creating new list elements. I don't know what the issue is because you won't say what the error is, but your assumption that the problem is using an element from a character vector to index a list is wrong. – Gregor Thomas Oct 01 '16 at 07:09
  • I understand your point now. I probably had something messed up in my RStudio session before. After restart, your code worked flawlessly. It is faster too. Thank you very much! I didn't know that lists also have names in R. – umair durrani Oct 03 '16 at 15:20