1

I followed Hadley's thread: Issue in Loading multiple .csv files into single dataframe in R using rbind to read multiple CSV files and then convert them to one dataframe. I also experimented with lapply vs. sapply as discussed on Grouping functions (tapply, by, aggregate) and the *apply family.

Here's my first CSV file:

dput(File1)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A", 
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L, 
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L, 
23L, 34L, 45L, 44L), Tax = c(23L, 21L, 22L, 24L, 25L), Location = structure(c(3L, 
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name", 
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA, 
-5L))

Here's my second CSV file:

dput(File2)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A", 
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L, 
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L, 
55L, 55L, 55L, 55L), Tax = c(24L, 24L, 24L, 24L, 24L), Location = structure(c(3L, 
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name", 
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA, 
-5L))

Here's my code:

dat1 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,23,EMEA\n2,C,D,23,21,EMEA\n3,A,D,34,22,Americas\n4,A,D,45,24,LATAM\n5,A,D,44,25,AP"
dat2 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,24,EMEA\n2,C,D,55,24,EMEA\n3,A,D,55,24,Americas\n4,A,D,55,24,LATAM\n5,A,D,55,24,AP"

tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)

merged_file <- do.call(rbind, lapply(list(tc1,tc2), read.csv))

While this works beautifully, I wanted to change lapply to sapply. From the above thread, I realize that sapply would change the read factors from csv file to matrices, but I am unsure why the fields are flipped. For instance, Income field occupies row#3 and row#8, but are not in one column.

Here's the code:

tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)

# change lapply to sapply    
merged_file <- do.call(rbind, sapply(list(tc1,tc2), read.csv))

Here's the output:

    [,1] [,2] [,3] [,4] [,5]
 [1,]    1    2    1    1    1
 [2,]    1    2    2    2    2
 [3,]   55   23   34   45   44
 [4,]   23   21   22   24   25
 [5,]    3    3    1    4    2
 [6,]    1    2    1    1    1
 [7,]    1    2    2    2    2
 [8,]   55   55   55   55   55
 [9,]   24   24   24   24   24
[10,]    3    3    1    4    2

I'd appreciate any help. I am fairly new to R and not sure what's going on.

smci
  • 32,567
  • 20
  • 113
  • 146
watchtower
  • 4,140
  • 14
  • 50
  • 92
  • 3
    Why do you want to change `lapply` to `sapply`? `lapply` is the appropriate function here, and it's more efficient. Btw, `paste` is vectorized. – Rich Scriven Sep 23 '16 at 17:56
  • @RichScriven - I am just experimenting to understand the reason why the output is different when I use `sapply` instead of `lapply`. – watchtower Sep 23 '16 at 20:38
  • *"While this works beautifully"* It doesn't even work at all, as a reproducible example. We don't have your paths so it will fail. It's easiest to read dataframes from a `textConnection()` instead of a file. I edited your code. – smci Sep 03 '18 at 01:08
  • 1
    The issue had nothing to do with factors, it's generic sapply vs lapply. Duplicate of [Why does sapply return a matrix that I need to transpose...](https://stackoverflow.com/questions/4140371/why-does-sapply-return-a-matrix-that-i-need-to-transpose-and-then-the-transpose) – smci Sep 03 '18 at 01:28

1 Answers1

1

The issue had nothing to do with factors, it's generic sapply vs lapply. Why does sapply get it so wrong whereas lapply gets it right? Remember in R, dataframes are lists-of-columns. and each column can have a distinct type.

  • lapply returns a list-of-columns to rbind, which does the concatenation correctly. It keeps corresponding columns together. So your factors emerge correctly.
  • sapply however...
    • returns a matrix of numeric... (since matrices can only have one type, unlike dataframes)
    • ...which, worse still, has an unwanted transpose
    • so sapply turns your two 5x6 input dataframes into transposed 6x5 matrices (columns now correspond to rows)...
    • with all data coerced to numeric (garbage!).
    • then rbind row-"concatenates" those two garbage 6x5 matrices of numeric into one very-garbage 12x5 matrix. Since columns have been transposed into rows, row-concatenating the matrices combines datatypes, and obviously your factors are messed up.

Summary: just use lapply

smci
  • 32,567
  • 20
  • 113
  • 146