Read multiple csv data and create new columns at one time

Question

I have a file and there are many csv data in it.
I want to read them and create new columns at one time and then combine to one datatable. I explain more here.

Look at this pic:

I want to create 2 new columns YEAR and MONTH based on the csv data title.
ex. Take 201508 Sales Report(London) as an example. I want to create YEAR = 2015 and MONTH = 8.

I don't know how to do but I can read them at one time without create new columns.

my_read_data <- function(path){  
data <- data.table::fread(path, header = T, strip.white = T, fill = T)  
data <- data[data[[5]] != 0,]  
data <- subset(data, select = c(-1,-7,-10,-12,-13,-14,-15,-17))  
}  
file.list <- dir(path = "//path/", pattern='\\.csv', full.names = T)  
df.list <- lapply(file.list, my_read_data)  
dt <- rbindlist(df.list)

How to modify my code?
Actually I'm not sure whether my code is correct or not.
Appreciate.

Thanks to @Jaap, my new code is:

my_read_data <- function(x){
data <- data.table::fread(x, header = T, strip.white = T, fill = T)
data <- data[data[[5]] != 0,]
data <- subset(data, select = c(-1,-7,-10,-12,-13,-14,-15,-17))
}
file.list <- list.files(path = "/path/", pattern = '*.csv')
dt.list <- sapply(file.list, my_read_data, simplify=FALSE)

However, I get an error.

Error in data.table::fread(x, header = T, strip.white = T, fill = T) :   
File not found: C:\Users\PECHEN\AppData\Local\Temp\RtmpiihFR4\filea0c4d726488   

In addition: Warning messages:
1: running command 'C:\Windows\system32\cmd.exe /c (TWM-201508 Sales Report(London).csv) > C:\Users\PECHEN\AppData\Local\Temp\RtmpiihFR4\filea0c4d726488' had status 1 
2: In shell(paste("(", input, ") > ", tt, sep = "")) :
  '(TWM-201508 Sales Report(London).csv) > C:\Users\PECHEN\AppData\Local\Temp\RtmpiihFR4\filea0c4d726488' execution failed with error code 1

Moreover, I edit my code:

my_read_data <- function(x){
data <- data.table::fread(x, header = T, strip.white = T, fill = T)
data <- data[data[[5]] != 0,]
data <- subset(data, select = c(-1,-7,-10,-12,-13,-14,-15,-17))
}
file.list <- dir(path = "/path/", pattern='\\.csv', full.names = T)  
df.list <- lapply(file.list, my_read_data)  
dt <- rbindlist(df.list, idcol = 'id')[, `:=` (YEAR = substr(id,5,8), MONTH = substr(id,9,10))]

I use YEAR = substr(id,5,8), MONTH = substr(id,9,10) since each data title has four charater before numbers. ex. AAA-201508Sales Report
However, it doesn't work.
Thanks to @Peter TW, it works.

Get your files in a list with `list.files` read them and bind them together by using `rbindlist`'s `idcol`-parameter ([see here for an example](https://stackoverflow.com/questions/32888757/reading-multiple-files-into-r-best-practice)). Finally use `DT[, YEAR := substr(id,1,4)]` to get the year and `DT[, MONTH := substr(id,5,6)]` to get the month. — Jaap, Jun 01 '17 at 09:04
can you explain more? I think this is great but i cannot understand — Peter Chen, Jun 01 '17 at 09:19
You have to set ur working director first so `YEAR = substr(id,5,8), MONTH = substr(id,9,10)` may work. If you don't set it, you should consider all your path — , Jun 02 '17 at 01:56

score 3 · Accepted Answer · answered Jun 01 '17 at 09:41

3

Expanding on my comment and supposing that all the files have the same structure, the following should work:

library(data.table)
# get list of file-names
file.list <- list.files(pattern='*.csv')

# read the files with sapply & fread
# this will create a named list of data.tables
dt.list <- sapply(file.list, fread, simplify=FALSE)

# bind the list together to one data.table
# using the 'idcol'-parameter puts the names of the data.tables in the id-column
# create the YEAR & MONTH variables with 'substr'
DT <- rbindlist(dt.list, idcol = 'id')[, `:=` (YEAR = substr(id,1,4), MONTH = substr(id,5,6))]

This will result in one data.table with all the data and a YEAR and MONTH column added.

If you want to exclude certain columns from the files, you can use the drop-parameter of fread as follows:

dt.list <- sapply(file.list, fread, drop = c(1,7,10,12:15,17), simplify=FALSE)

answered Jun 01 '17 at 09:41

Jaap

81,064
34
182
193

1

Unless you're using `USE.NAMES`, using `sapply` with `simplify = FALSE` is exactly the same as using `lapply` (`sapply` calls `lapply`) – MichaelChirico Jun 01 '17 at 21:46
what does `simplify = False` mean? – Peter Chen Jun 02 '17 at 01:11
This answer is great. If I want to create a function like my above question `my_read_data` and use your way, how can I do? I get an error and I will edit my above question – Peter Chen Jun 02 '17 at 01:18
@PeterChen sorry to respond so late, do you still have that problem? – Jaap Jun 03 '17 at 09:09
No, thanks. But could u plz explain `simplify = False`? – Peter Chen Jun 04 '17 at 11:20
1

@PeterChen By using `simplify=FALSE`, the list of data.table's becomes a named list (with the file-names as names). `simplify=TRUE` is the default and will create an unnamed list. When applying `rbindlist` to an unnamed list the `id` column gets just numbers. Applying `rbindlist` to a named list (i.e. the result of using `simplify=FALSE`) the `id` column gets the names of the respective files. [See here](https://stackoverflow.com/a/32888918/2204410) for a more extensive answer on reading a list of files. – Jaap Jun 04 '17 at 11:27

score 0 · Answer 2 · answered Jun 01 '17 at 09:12

Here's how you can include the columns with dplyr:

nam <- c("201508 Sales Report(London)", "201509 Sales Report(London)", "201604 Sales Report(London)-Monthly")

dat <- data.frame(file=nam, var=nam)
dat %>% 
   separate(var, into=c(paste0("parts", 1:5))) %>% 
   mutate(Year=substring(parts1, 1,4), Month=substring(parts1, 5,6)) %>% 
   select(Year, Month, file)

#   Year Month                                file
# 1 2015    08         201508 Sales Report(London)
# 2 2015    09         201509 Sales Report(London)
# 3 2016    04 201604 Sales Report(London)-Monthly

Read multiple csv data and create new columns at one time

2 Answers2

Linked