0

I'm trying to combine multiple .csv files into one in a nice and easy script. Currently, I have the code

data_files = list.files(path=file_source, pattern = "*.csv", full.names = TRUE) %>%
  lapply(read_csv) %>%
  bind_rows 

but when inspecting the output it has replaced some values with NA. I believe this to be because some values are non-numeric, i.e. SMITH_201. Is there a way I can avoid this so that the non-numeric values are preserved?

EDIT:

An example of what I'm trying to do. I have multiple .csv files such as those below

file_A.csv looks like this

x         y
1         1
2         1
3         1
4         1

file_B.csv looks like this

x         y
5         2
6         2
A3        2
A4        1  

and I want to combine them to be a single .csv

x         y
1         1
2         1
3         1
4         1
5         2
6         2
A3        2
A4        1
alssm
  • 23
  • 4
  • 1
    You could make it read in the columns as character using the `col_types` argument. See `?read_csv` – IceCreamToucan Nov 13 '19 at 16:00
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Nov 13 '19 at 16:01
  • Repalce `read_csv` with ```read.csv(. , colClasses=c('character'))``` – M-- Nov 13 '19 at 16:01
  • @IceCreamToucan How exactly would I work `col_types` into the code? – alssm Nov 13 '19 at 16:22

1 Answers1

0

You can be a bit more compact and use purrr.

library(purrr)
library(readr)

list.files(path = file_source, pattern = "*.csv", full.names = TRUE) %>%
  map_dfr( ~ read_csv(., col_types = "cn"))

So this is saying two columns, first is character, second is numeric. You could alternatively do col_types = "c?" for which it will correctly guess the second column as numeric. From the help file (?read_csv):

Alternatively, you can use a compact string representation where each character represents one column: c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time, ? = guess, or _/- to skip the column.


Here is a second way if you don't want to manually specify column types.

my_files <- list.files(path = file_source, pattern = "*.csv", full.names = TRUE)

file_list <- lapply(my_files, read_lines, skip = 1)
file_header <- read_lines(my_files[1], n_max = 1)

read_csv(c(file_header, unlist(file_list)))

# A tibble: 8 x 2
  x         y
  <chr> <dbl>
1 1         1
2 2         1
3 3         1
4 4         1
5 5         2
6 6         2
7 A3        2
8 A4        1

What we are doing here is reading in the files line-by-line and not parsing as a CSV yet. We then take the top row from the first file (the header) and everything except the top row for all of the other files. We can then push that into read_csv where it will correctly choose data types on the combined files.