13

In R, some functions only work on a data.frame and others only on a tibble or a matrix.

Converting my data using as.data.frame or as.matrix often solves this, but I am wondering how the three are different ?

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
Sylvia Rodriguez
  • 1,203
  • 2
  • 11
  • 30
  • 2
    A matrix and a data frame do different things. A data frame can contain different data types, like characters, numbers, factors and times all at once. A matrix can only contain a single type. A matrix is therefore more limited in terms of functionality, but because it is guaranteed to be of a single type, it can be stored as a contiguous array in memory, which allows for more efficient computations. R does not have the tibble - this is an add on from an external package and inherits from data frame, so the two are often mutually compatible. – Allan Cameron Nov 16 '20 at 10:55
  • 3
    Thanks a lot for clarifying. Really no opinion here, just not knowing the difference. If it was possible to use a universal type that served all purposes, someone would probably have thought of it. I figured there must be pros and cons to each different type. Just all out of ignorance, but okay for you to close it, if you consider this ignorance an opinion :) – Sylvia Rodriguez Nov 16 '20 at 11:35
  • 1
    matrices are used in linear algebra, regression, etc. whlie data frames are used to represent data sets as in relational data base tables. tibble is not part of R but is part of the tidyverse and "Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors)." see https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html – G. Grothendieck Nov 16 '20 at 14:59
  • 1
    @G.Grothendieck Concerning `data.frame` vs. `tibble`: This is the only difference "converting character vectors to factors"? I think, R changed the default `stringasfactors`behavior? See also https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html – Christoph Feb 25 '22 at 14:22

2 Answers2

18

Because they serve different purposes.

Short summary:

  • Data frame is a list of equal-length vectors. This means, that adding a column is as easy as adding a vector to a list. It also means that while each column has its own data type, the columns can be of different types. This makes data frames useful for data storage.

  • Matrix is a special case of an atomic vector that has two dimensions. This means that whole matrix has to have a single data type which makes them useful for algebraic operations. It can also make numeric operations faster in some cases since you don't have to perform type checks. However if you are careful enough with the data frames, it will not be a big difference.

  • Tibble is a modernized version of a data frame used in the tidyverse. They use several techniques to make them 'smarter' - for example lazy loading.

Long description of matrices, data frames and other data structures as used in R.

So to sum up: matrix and data frame are both 2d data structures. Each of these serves a different purpose and thus behaves differently. Tibble is an attempt to modernize the data frame that is used in the widely spread Tidyverse.

If I try to rephrase it from a less technical perspective: Each data structure is making tradeoffs.

  • Data frame is trading a little of its efficiency for convenience and clarity.
  • Matrix is efficient, but harder to wield since it enforces restrictions upon its data.
  • Tibble is trading more of the efficiency even more convenience while also trying to mask the said tradeoff with techniques that try to postpone the computation to a time when it doesn't appear to be its fault.
Shamis
  • 2,544
  • 10
  • 16
  • 1
    Woahhh, this is a really awesome explanation. Thank you. I have worked with all three types before, but I never knew exactly why/how they were different. This is an excellent detailed description. Thank you! – Sylvia Rodriguez Nov 16 '20 at 11:39
  • 1
    The explanation about tibbles is wrong. there is no postponing of computations and I'm not sure what lazy loading means here. Maybe there is confusion with a special type of tibble inheriting from "tbl_df" and "tbl_lazy" and that is designed to work with data bases. – moodymudskipper Oct 01 '22 at 10:03
1

About the difference between data frame and tibbles, the 2 main differences are explained here:https://www.rstudio.com/blog/tibble-1-0-0/

Besides, my understanding is the following: -If you subset a tibble, you always get back a tibble. -Tibbles can have complex entries. -Tibbles can be grouped. -Tibbles display better

kAmJi
  • 77
  • 12