-1

I have a directory and in my directory there are multiple files. Each file having different csv. For example in my directory there is 5 files (in actual case more than 100), and each file having 10 .csv file.

My question is that I want to select different columns (all files having equal no of columns) from that files and merge them and create a single data frame.

My data frame looks like -

Main Directory -

File1-

a.csv -

 store_code  Tran1  Tran2  Tran3
     10       5454   7645   7111
     10       4154   7675   7000
     10       1454   2145   8431
      .........

b.csv -

 store_code  Tran1  Tran2  Tran3
     10       5004   6645   7291
     10       4109   1675   7000
     10       9454   1045   1031
      .........

 File2-

 c.csv - 

 store_code  Tran1  Tran2  Tran3
     20       1054   2045   1111
     20       2954   3075   7080
     20       1454   2145   8431
      .........

  d.csv - 

 store_code  Tran1  Tran2  Tran3
     20       1994   2045   9011
     20       2004   3075   8080
     20       1004   2145   1031
      .........

I have done the merging of all files from a directory. But unable do select multiple column and then merge. For this example suppose I want to select two columns store_code and Tran2 and then merge all them.

For merging purpose I have done the following code -

i.Main directory set as a working directory where all files are there.
ii.csv_files <- dir(pattern='.*[.]csv', recursive = T)

library(dplyr)
df <- rbind_all(lapply(csv_files, read.csv))

My expected output -

store_code    Tran3 
10            7111 
10            7000 
10            1111  
..            ---- 
20            9011
20            8080
halfer
  • 19,824
  • 17
  • 99
  • 186
Nikita Agarwal
  • 343
  • 1
  • 3
  • 13

1 Answers1

1

I'm not entirely sure what you're after; unfortunately you don't provide your expected output.

Perhaps something like this?

library(tidyverse);
Reduce(
    function(x, y) full_join(x, y, by = c("store_code", "Tran2")),
    lapply(csv_files, read.csv));
#  store_code Tran1.x Tran2 Tran3.x Tran1.y Tran3.y Tran1.x.x Tran3.x.x
#1         10    5454  7645    7111      NA      NA        NA        NA
#2         10    4154  7675    7000      NA      NA        NA        NA
#3         10    1454  2145    8431      NA      NA        NA        NA
#4         10      NA  6645      NA    5004    7291        NA        NA
#5         10      NA  1675      NA    4109    7000        NA        NA
#6         10      NA  1045      NA    9454    1031        NA        NA
#7         20      NA  2045      NA      NA      NA      1054      1111
#8         20      NA  3075      NA      NA      NA      2954      7080
#9         20      NA  2145      NA      NA      NA      1454      8431
#  Tran1.y.y Tran3.y.y
#1        NA        NA
#2        NA        NA
#3        NA        NA
#4        NA        NA
#5        NA        NA
#6        NA        NA
#7      1994      9011
#8      2004      8080
#9      1004      1031

Explanation: Do an outer full_join on all CSV files, merge by store_code and Tran2. Because you have two Tran2 columns per file, I've relabelled the last column as Tran3.

From here, all that's left to do is some tidying and renaming of columns to reflect the source file origin.


Sample data

This is the dput of lapply(csv_files, read.csv):

list(structure(list(store_code = c(10L, 10L, 10L), Tran1 = c(5454L,
4154L, 1454L), Tran2 = c(7645L, 7675L, 2145L), Tran3 = c(7111L,
7000L, 8431L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)), structure(list(
store_code = c(10L, 10L, 10L), Tran1 = c(5004L, 4109L, 9454L
), Tran2 = c(6645L, 1675L, 1045L), Tran3 = c(7291L, 7000L,
1031L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)), structure(list(
store_code = c(20L, 20L, 20L), Tran1 = c(1054L, 2954L, 1454L
), Tran2 = c(2045L, 3075L, 2145L), Tran3 = c(1111L, 7080L,
8431L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)), structure(list(
store_code = c(20L, 20L, 20L), Tran1 = c(1994L, 2004L, 1004L
), Tran2 = c(2045L, 3075L, 2145L), Tran3 = c(9011L, 8080L,
1031L)), .Names = c("store_code", "Tran1", "Tran2", "Tran3"
), class = "data.frame", row.names = c(NA, -3L)))
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • My expected output looks like - store_code Tran2 10 7645 10 7000 10 2145 ........................... 10 6645 10 1675 ----------------------- Means only two columns with all merged row. – Nikita Agarwal Mar 10 '18 at 03:55
  • I mistake from my end is ..last column name is Tran3..I run your code but it's not working...I think know what is my expected output..any solutions? – Nikita Agarwal Mar 10 '18 at 05:44
  • @NikitaAgarwal Don't put code in comments. Edit your question to include your expected output. See [here how to provide a minimal reproducible example/attempt](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including sample data and expected output. – Maurits Evers Mar 10 '18 at 07:41