Import multiple txt files into one data frame and use part of file names as "id"

Question

I have a directory of text files named using the following convention: "Location[A-Z]_House[0-15]_Day[0_15].txt", so an example is LA_H05_D14.txt. Is there a way of splitting the names such that they can be made a factor? More specifically I would like to use the letter [A-Z] that comes after Location. E.g. LB_H01_D01.txt would be location "B" and all data belonging to Location B will be labelled "B"?

I have imported all the data from the files into one data frame:

l = list.files(patt="txt$", full.names = T)
library(dplyr)

Df = bind_rows(lapply(l,  function(i) {temp <- read.table(i,stringsAsFactors = FALSE,sep=";"); 
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
))}), .id = "id")

The data looks like this with an "id" column:

head(Df)
  id       Date     Time  Timestamp PM2_5(ug/m3) AQI(US) AQI(CN) PM10(ug/m3) Outdoor AQI(US) Outdoor AQI(CN) Temperature(C) Temperature(F)
1  1 2017/10/17 20:31:38 1508272298        102.5     175     135         512               0               0             30           86.1
2  1 2017/10/17 20:31:48 1508272308         93.6     171     124         477               0               0             30           86.1
3  1 2017/10/17 20:31:58 1508272318         98.0     173     129         397               0               0             30           86.0
4  1 2017/10/17 20:32:08 1508272328         98.0     173     129         422               0               0             30           86.0
5  1 2017/10/17 20:32:18 1508272338        104.3     176     137         466               0               0             30           86.0
6  1 2017/10/17 20:32:28 1508272348        101.6     175     134         528               0               0             30           86.0
  Humidity(%RH) CO2(ppm) VOC(ppb)
1            43      466       -1
2            43      467       -1
3            42      468       -1
4            42      469       -1
5            42      471       -1
6            42      471       -1

I guess the reason you're not getting a very meaningful "id" column is that the list returned by `lapply` is not named. If you use `bind_rows(setNames(lapply(l, ...), l), .id = "id")` it should work. Afterwards you can extract the relevant part of the filename. — talat, Nov 29 '17 at 15:44

Manuel Bickel · Accepted Answer · 2017-12-15T08:29:27.027

2

Independent of the issue concerning the content of the id column you might use the following code to extract the information from the filenames:

#you may use the original filenames
filenames <- basename(l)
#or the content of the id column
filenames <- as.character(Df$id) #if you have read in filenames in the Df
#for demonstration here a definition of exemplary filenames
filenames <- c("LA_H01_D01.txt"
               ,"LA_H02_D02.txt"
               ,"LD_H01_D14.txt"
               ,"LD_H01_D15.txt")

filenames <- gsub("_H|_D", "_", filenames)
filenames <- gsub(".txt|^L", "", filenames)

fileinfo <- as.data.frame(do.call(rbind, strsplit(filenames, "_")))
colnames(fileinfo) <- c("Location", "House", "Day")

fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)
#      Location House Day
# 1        A     1   1
# 2        A     2   2
# 3        D     1  14
# 4        D     1  15

#add the information to your Df as new columns
Df <- cbind(Df, fileinfo)

#the whole thing as a function used in your data import
add_fileinfo <- function(df, filename) {

  filename <- gsub("_H|_D", "_", filename)
  filename <- gsub(".txt|^L", "", filename)

  fileinfo <- as.data.frame(do.call(rbind, strsplit(filename, "_")))
  colnames(fileinfo) <- c("Location", "House", "Day")

  fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)

  cbind(df,  fileinfo[rep(seq_len(nrow(fileinfo)), each= nrow(df)),])

}

Df = bind_rows(lapply(l,  function(i) 
{temp <- read.table(i,stringsAsFactors = FALSE,sep=";"); 
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
));
temp <- add_fileinfo(temp, i);
}
), .id = "id")

edited Dec 15 '17 at 08:29

answered Nov 29 '17 at 15:58

Manuel Bickel

2,156
2
11
22

Thank you for this. The files are called " LA_H05_D14.txt" and if we split this we loose some information associated with location D – HCAI Nov 29 '17 at 16:10
1

I have edited my answer in response to your comment, it should now work with your filenames. Tell me if not. All types of combinations of words and digits can easily be handled with regex if you have more complex formats. – Manuel Bickel Nov 29 '17 at 16:29
OOOH this is looking cool! Seems gsub is really powerful! The only thing I'm stumbling over now is that fileinfo has 116 rows but Df has 96000 because each file I imported had many hundreds of rows. What do you think about that? – HCAI Nov 29 '17 at 16:45
updated the answer again. Had overseen that your importing tables. If I understand it correclty you want the same information in your factor columns for each id. Therefore, you simply expand the fileinfo to the number of rows in your table. Since I do not have your files at hand I could not test if everything works. Might have overseen semicolons or so. If speed is an issue we would rewrite the function (since growing objects is not a good idea in general) and only repeat the file info after importing the data by using the grouping of the id column. Can provide such solution if necessary. – Manuel Bickel Nov 29 '17 at 17:06
Thank you very much for such clear explanation! One quick question about the temp<-add_fileinfo(temp,i) function: I guess the counter 'i' acts as a for loop without explicitly saying for i=1..lastName do ...? – HCAI Nov 30 '17 at 10:25
1

Glad it works. Your object `temp` is the table you read in, it is a `data.frame`. Your outer loop to read in files uses ´i´ as iteration variable, which is actually the filename. The first argument of my function has to be a ´data.frame´, the second a `character()`, i.e., filename. I could also have written more clearly: `add_fileinfo(df = temp, filename = i)`. Does that help? – Manuel Bickel Nov 30 '17 at 10:31
Yes, that's perfect, thank you! Super quick question, do you know why dataframe headers appear like Df$`CO2(ppm)` rather than Df$CO2(ppm)? – HCAI Nov 30 '17 at 10:40
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/160184/discussion-between-manuel-bickel-and-hcai). – Manuel Bickel Nov 30 '17 at 10:42

score 1 · Answer 2 · answered Dec 14 '17 at 22:40

Something like this (generic) solution should get you going.

mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)

Then, merge

myfulldata = merge(mydata1, mydata2)

As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.

What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two… so I could merge the 20 data frames together with 19 merge statements like this:

mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)

That’s tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here’s the code to define the function:

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)

Here is a good resource that should help you out.

https://stats.idre.ucla.edu/r/codefragments/read_multiple/

Import multiple txt files into one data frame and use part of file names as "id"

2 Answers2