-2

I am attempting to combine multiple CSV datasets and each of the datasets have variable names with no observations. I want to create a new combined CSV file with the name of the CSV file as the variable name and the list of the variable names from the previous file as observations.

The column names from each dataset may overlap somewhat, but many variables are different. I would like to keep all variables from each dataset.

Currently, I only know how to write code to merge the files and have the variable names listed continuously as separate columns.

This the code I've been using to combine the files:

library(dplyr)
library(readr)
newFile <- list.files(full.names = TRUE) %>%
  lapply(read_csv) %>%
  bind_rows

This is one example of what the current CSV files look like

This is what I would like the dataset to look like once I combine them

r2evans
  • 141,215
  • 6
  • 77
  • 149
saze4
  • 1
  • 1
  • 1
    Can you provide a few lines from the top of 2-3 files? (And your expected output from that sample content.) It doesn't have to be real data, just something representative enough. – r2evans Jul 16 '19 at 17:52
  • Some [example data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would be great. We don't need a whole lot, just something that's structurally the same as your data. Are column names unique?, do number of rows vary? With a given input, what output do you expect? – AkselA Jul 16 '19 at 17:54
  • @r2evans Just updated my question with more information, thanks for the suggestion! – saze4 Jul 16 '19 at 18:36
  • @AkselA I updated my post with that info, thanks for the feedback! – saze4 Jul 16 '19 at 18:39
  • 1
    sav44, please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557/3358272 (and https://xkcd.com/2116/) – r2evans Jul 16 '19 at 18:50

1 Answers1

1

You can try using map() and unnest() to do this. This will create a single data frame combining all of the data read in from the CSV files. It will create an additional grouping variable using the file name. Assumes the column names are the same across all files.

# list files
d <- list.files(full.names = F)
# for each file, read in, expand rows (unnest)
# read all cols in as character
# and add label from filename
dd <- data.frame(filename=d) %>% 
  mutate(cont=map(filename, ~read_csv(file.path('',.), # !!! specify your file path here !!!
                                      col_types = cols(.default = "c")))) %>% 
  unnest()

UPDATE

I think the map and unnest functions will still work here, based on your example data and expected output. If you gather the combined data, you can then filter and find which variables are in each CSV file. It's not pretty but it might suit your purposes.

# create toy CSV files in wd
df1 <- data.frame(matrix(runif(100),10)) # 10 vars named X1 to X10
df2 <- data.frame(matrix(runif(120),12)) # as above but more obs
df3 <- data.frame(matrix(runif(120),8)) # 15 vars named X1 to X15
df4 <- data.frame(matrix(runif(100),10)) # 10 vars names a to j
names(df4) <- letters[1:10]
write_csv(df4,'df4.csv')

d <- list.files(pattern='\\.csv',full.names = T)

dd <- data.frame(filename=d) %>% 
  mutate(cont=map(filename,~read_csv(file.path('.',.),
    col_types = cols(.default = "c")))) %>% 
  unnest() %>% 
  gather(k,v,-filename) %>% filter(!is.na(v)) %>% 
  distinct(filename,k)

dd %>% reshape2::dcast(k~filename)

     k ./df1.csv ./df2.csv ./df3.csv ./df4.csv
1    a      <NA>      <NA>      <NA>         a
2    b      <NA>      <NA>      <NA>         b
3    c      <NA>      <NA>      <NA>         c
4    d      <NA>      <NA>      <NA>         d
5    e      <NA>      <NA>      <NA>         e
6    f      <NA>      <NA>      <NA>         f
7    g      <NA>      <NA>      <NA>         g
8    h      <NA>      <NA>      <NA>         h
9    i      <NA>      <NA>      <NA>         i
10   j      <NA>      <NA>      <NA>         j
11  X1        X1        X1        X1      <NA>
12 X10       X10       X10       X10      <NA>
13 X11      <NA>      <NA>       X11      <NA>
14 X12      <NA>      <NA>       X12      <NA>
15 X13      <NA>      <NA>       X13      <NA>
16 X14      <NA>      <NA>       X14      <NA>
17 X15      <NA>      <NA>       X15      <NA>
18  X2        X2        X2        X2      <NA>
19  X3        X3        X3        X3      <NA>
20  X4        X4        X4        X4      <NA>
21  X5        X5        X5        X5      <NA>
22  X6        X6        X6        X6      <NA>
23  X7        X7        X7        X7      <NA>
24  X8        X8        X8        X8      <NA>
25  X9        X9        X9        X9      <NA>
kstew
  • 1,104
  • 6
  • 21
  • I should have mentioned above that the column names vary between the dataset (just updated it), would there be a way to modify the code above for this? Thanks! – saze4 Jul 16 '19 at 18:33
  • In the intermediate step after calling `unnest()`, it will append any new variables to the end of the data and input `NA` where the other datasets are missing those variables. – kstew Jul 16 '19 at 19:14