merging numerous .csv datafile into one data frame

Question

I have numerous .csv files, that I have save in one folder on my PC. Then, I create a list of theses dataset as follows:

> file_list <- list.files()
> file_list
 [1] "ABWAbwut50.csv"        "ABWEinfam50.csv"       "ABWFeldwaldasph50.csv" "ABWGarage50.csv"      
 [5] "ABWGemeindestr50.csv"  "ABWHotel50.csv"        "ABWInd50.csv"          "ABWIntflaechen50.csv" 
 [9] "ABWKantonsstr50.csv"   "ABWMehrfam50.csv"      "ABWNutzwald50.csv"     "ABWSchutzwald50.csv"  
[13] "ABWstahlmitvieh50.csv" "ABWStromut50.csv"      "ABWWeideland50.csv"

The .csv file sontain identical columns, decimals use ., columns are separated by ;. I tried to combine these datasets using following code:

for (file in file_list){
  if (!exists("dataset")){
    dataset <- read_delim(file, ";", escape_double = FALSE, trim_ws = TRUE)
  }
}
dataset

but it only reads the first file. How can I get it to combine all 15 .csv files into one data frame?

when I run different code i got the following error message:

> View(dataset)
> dataset <- do.call("rbind",lapply(file_list,
+                                   FUN=function(files){read.table(files,
+                                                                  header=TRUE, sep=";")}))
 Show Traceback

 Rerun with Debug
 Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 103 did not have 8 elements

I assume something went wrong and one of the files (actually I know its only couple of rows within a file) has only 7 columns instead of 8. I do not want to be looking into every file separately to try to find if there is some anomaly. How can I have these lines that do not follow the pattern removed automatically?

My datafile looks something like:

> dput(dataset[1:10,])
structure(list(Berechnung = c("EconoMe original", "Berechnung 1", 
"Berechnung 2", "Berechnung 3", "Berechnung 4", "Berechnung 5", 
"Berechnung 6", "Berechnung 7", "Berechnung 8", "Berechnung 9"
), Situation = c("Nach Massnahme Neue Gerinnefuehrung Gafenbach", 
"Nach Massnahme Neue Gerinnefuehrung Gafenbach", "Nach Massnahme Neue Gerinnefuehrung Gafenbach", 
"Nach Massnahme Neue Gerinnefuehrung Gafenbach", "Nach Massnahme Neue Gerinnefuehrung Gafenbach", 
"Nach Massnahme Neue Gerinnefuehrung Gafenbach", "Nach Massnahme Neue Gerinnefuehrung Gafenbach", 
"Nach Massnahme Neue Gerinnefuehrung Gafenbach", "Nach Massnahme Neue Gerinnefuehrung Gafenbach", 
"Nach Massnahme Neue Gerinnefuehrung Gafenbach"), NK = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0), PID = c(2639L, 2639L, 2639L, 2639L, 
2639L, 2639L, 2639L, 2639L, 2639L, 2639L), Case = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), Differenz = c(0, 0, 0, 0, 0, 0, 
0, 0, 0, 0), Prozess = c("Murgang", "Murgang", "Murgang", "Murgang", 
"Murgang", "Murgang", "Murgang", "Murgang", "Murgang", "Murgang"
), Objektart = c("Abwasser unter Terrain", "Abwasser unter Terrain", 
"Abwasser unter Terrain", "Abwasser unter Terrain", "Abwasser unter Terrain", 
"Abwasser unter Terrain", "Abwasser unter Terrain", "Abwasser unter Terrain", 
"Abwasser unter Terrain", "Abwasser unter Terrain")), .Names = c("Berechnung", 
"Situation", "NK", "PID", "Case", "Differenz", "Prozess", "Objektart"
), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"

You may use the `fill=TRUE` argument i.e. `read.table(files,fill = TRUE, header=TRUE, sep=";")` or if some have only 7 columns compared to 8, then use `bind_rows` from `dplyr` or `rbindlist` from data.table` — akrun, Apr 27 '17 at 10:13
@arkun thanks, I have tried that previous and get error `unused argument (fill = TRUE)` — Nneka, Apr 27 '17 at 10:15
[See here for an extensive example](http://stackoverflow.com/a/32888918/2204410). It is easier to use `bind_rows` from `dplyr` or `rbindlist` from `data.table`. — Jaap, Apr 27 '17 at 10:16
@Danka I think then issue is that one of the column is missing,. Try `library(dplyr); bind_rows(yourlistofdata)` — akrun, Apr 27 '17 at 10:16
@akrun but i do not have all the date in one file. It is 15 separete datafiles (.csv) that are saved in one folder. How can I addjust bind_rown for different datafiles? when i apply `bind_rows(file_list)` I get `cannot convert object to a data frame`. — Nneka, Apr 27 '17 at 10:19
Can you post a small reproducible example that shows the error because your example here works with even one column removed i.e. `bind_rows(list(df1, df1[-1]))# A tibble: 20 × 8` — akrun, Apr 27 '17 at 10:20

score 1 · Answer 1 · answered Apr 27 '17 at 10:41

1

One of the files probably contains ; in the text. This solution uses your first coding example with a modification that will check which files contain problems.

file_list <- list.files()
# setup the dataset
dataset <- read.table(file_list[1], sep = ";", header = TRUE)

# cycle through all other files
for (file in file_list[-1]){
    temp <- try(read.table(file, sep = ";", header = TRUE))
    # check if the file can be read as a table
    if(class(temp) == "try-error"){
        message(paste("One file skipped. Correct mistakes in file", file))
        print(temp)
        next
    }
    dataset <- rbind(dataset, temp)
}

answered Apr 27 '17 at 10:41

nya

2,138
15
29

you are right! One of the descriptions indeed contained a `;`. However you solution created a dataframe with 1 variable only. I have obtained numerous error messages that indicate which files need correcting. Is there a way to automatically fix the error withou having to go back to the files? – Nneka Apr 27 '17 at 11:34
I have corrected all the mistakes in my file and it does exactlly what I was hoping for. However the dilema still remains: Is there a shortcut so i do not have to go through all "contaminated" files individually? – Nneka Apr 27 '17 at 12:49
Maybe have a look at `purrr:safely` – FlorianGD Apr 27 '17 at 13:18
@Danka Several options come to mind, depending on where your datasets might have the incorrectly placed semicolons. [Here](http://stackoverflow.com/a/38606660/5967807) is one for whitespace. If your data has one word in the second and seventh columns as in the example, you could selectively remove all extra characters. In general, there need to be a rule that could be universal. Or: try what FlorianGD suggested (which I don't know). – nya Apr 27 '17 at 18:27
1

@nya actually, I was wrong about the merging. It only merged 2 files. I have used `master <- do.call("rbind", lapply(file_list, read.csv, sep = ";", header = TRUE)) ` instead, which seems to work. – Nneka Apr 28 '17 at 07:07

merging numerous .csv datafile into one data frame

1 Answers1