Select multiple column from different directory and then merge them in R

Question

I have a directory and in my directory there are multiple files. Each file having different csv. For example in my directory there is 5 files (in actual case more than 100), and each file having 10 .csv file.

My question is that I want to select different columns (all files having equal no of columns) from that files and merge them and create a single data frame.

My data frame looks like -

Main Directory -

File1-

a.csv -

 store_code  Tran1  Tran2  Tran3
     10       5454   7645   7111
     10       4154   7675   7000
     10       1454   2145   8431
      .........

b.csv -

 store_code  Tran1  Tran2  Tran3
     10       5004   6645   7291
     10       4109   1675   7000
     10       9454   1045   1031
      .........

 File2-

 c.csv - 

 store_code  Tran1  Tran2  Tran3
     20       1054   2045   1111
     20       2954   3075   7080
     20       1454   2145   8431
      .........

  d.csv - 

 store_code  Tran1  Tran2  Tran3
     20       1994   2045   9011
     20       2004   3075   8080
     20       1004   2145   1031
      .........

I have done the merging of all files from a directory. But unable do select multiple column and then merge. For this example suppose I want to select two columns store_code and Tran2 and then merge all them.

For merging purpose I have done the following code -

i.Main directory set as a working directory where all files are there.
ii.csv_files <- dir(pattern='.*[.]csv', recursive = T)

library(dplyr)
df <- rbind_all(lapply(csv_files, read.csv))

My expected output -

store_code    Tran3 
10            7111 
10            7000 
10            1111  
..            ---- 
20            9011
20            8080

You've got two columns labelled `Tran2` in every CSV file. Is that a typo? — Maurits Evers, Mar 10 '18 at 03:05
So you only want to read 2 columns from each file then merge all together? — Tung, Mar 10 '18 at 03:05
Your expected output is not useful. Please **format** your code properly, and provide the output that are expecting **based on your sample data**. — Maurits Evers, Mar 11 '18 at 04:51

Maurits Evers · Answer 1 · 2018-03-10T03:19:59.550

I'm not entirely sure what you're after; unfortunately you don't provide your expected output.

Perhaps something like this?

library(tidyverse);
Reduce(
    function(x, y) full_join(x, y, by = c("store_code", "Tran2")),
    lapply(csv_files, read.csv));
#  store_code Tran1.x Tran2 Tran3.x Tran1.y Tran3.y Tran1.x.x Tran3.x.x
#1         10    5454  7645    7111      NA      NA        NA        NA
#2         10    4154  7675    7000      NA      NA        NA        NA
#3         10    1454  2145    8431      NA      NA        NA        NA
#4         10      NA  6645      NA    5004    7291        NA        NA
#5         10      NA  1675      NA    4109    7000        NA        NA
#6         10      NA  1045      NA    9454    1031        NA        NA
#7         20      NA  2045      NA      NA      NA      1054      1111
#8         20      NA  3075      NA      NA      NA      2954      7080
#9         20      NA  2145      NA      NA      NA      1454      8431
#  Tran1.y.y Tran3.y.y
#1        NA        NA
#2        NA        NA
#3        NA        NA
#4        NA        NA
#5        NA        NA
#6        NA        NA
#7      1994      9011
#8      2004      8080
#9      1004      1031

Explanation: Do an outer full_join on all CSV files, merge by store_code and Tran2. Because you have two Tran2 columns per file, I've relabelled the last column as Tran3.

From here, all that's left to do is some tidying and renaming of columns to reflect the source file origin.

Sample data

This is the dput of lapply(csv_files, read.csv):

list(structure(list(store_code = c(10L, 10L, 10L), Tran1 = c(5454L,
4154L, 1454L), Tran2 = c(7645L, 7675L, 2145L), Tran3 = c(7111L,
7000L, 8431L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)), structure(list(
store_code = c(10L, 10L, 10L), Tran1 = c(5004L, 4109L, 9454L
), Tran2 = c(6645L, 1675L, 1045L), Tran3 = c(7291L, 7000L,
1031L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)), structure(list(
store_code = c(20L, 20L, 20L), Tran1 = c(1054L, 2954L, 1454L
), Tran2 = c(2045L, 3075L, 2145L), Tran3 = c(1111L, 7080L,
8431L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)), structure(list(
store_code = c(20L, 20L, 20L), Tran1 = c(1994L, 2004L, 1004L
), Tran2 = c(2045L, 3075L, 2145L), Tran3 = c(9011L, 8080L,
1031L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)))

My expected output looks like - store_code Tran2 10 7645 10 7000 10 2145 ........................... 10 6645 10 1675 ----------------------- Means only two columns with all merged row. — Nikita Agarwal, Mar 10 '18 at 03:55
I mistake from my end is ..last column name is Tran3..I run your code but it's not working...I think know what is my expected output..any solutions? — Nikita Agarwal, Mar 10 '18 at 05:44
@NikitaAgarwal Don't put code in comments. Edit your question to include your expected output. See [here how to provide a minimal reproducible example/attempt](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including sample data and expected output. — Maurits Evers, Mar 10 '18 at 07:41

Select multiple column from different directory and then merge them in R

1 Answers1

Sample data