596

I'm trying to initialize a data.frame without any rows. Basically, I want to specify the data types for each column and name them, but not have any rows created as a result.

The best I've been able to do so far is something like:

df <- data.frame(Date=as.Date("01/01/2000", format="%m/%d/%Y"), 
                 File="", User="", stringsAsFactors=FALSE)
df <- df[-1,]

Which creates a data.frame with a single row containing all of the data types and column names I wanted, but also creates a useless row which then needs to be removed.

Is there a better way to do this?

Jaap
  • 81,064
  • 34
  • 182
  • 193
Jeff Allen
  • 17,277
  • 8
  • 49
  • 70

17 Answers17

785

Just initialize it with empty vectors:

df <- data.frame(Date=as.Date(character()),
                 File=character(), 
                 User=character(), 
                 stringsAsFactors=FALSE) 

Here's an other example with different column types :

df <- data.frame(Doubles=double(),
                 Ints=integer(),
                 Factors=factor(),
                 Logicals=logical(),
                 Characters=character(),
                 stringsAsFactors=FALSE)

str(df)
> str(df)
'data.frame':   0 obs. of  5 variables:
 $ Doubles   : num 
 $ Ints      : int 
 $ Factors   : Factor w/ 0 levels: 
 $ Logicals  : logi 
 $ Characters: chr 

N.B. :

Initializing a data.frame with an empty column of the wrong type does not prevent further additions of rows having columns of different types.
This method is just a bit safer in the sense that you'll have the correct column types from the beginning, hence if your code relies on some column type checking, it will work even with a data.frame with zero rows.

digEmAll
  • 56,430
  • 9
  • 115
  • 140
  • 3
    Would it be the same if I initialize all fields with NULL? – yosukesabai Aug 20 '13 at 15:04
  • 11
    @yosukesabai: no, if you initialize a column with NULL the column won't be added :) – digEmAll Aug 20 '13 at 16:32
  • I see that... why I thought it would work...? So this means I have to know type of data on each column ahead of time and initialize properly? – yosukesabai Aug 20 '13 at 16:38
  • 7
    @yosukesabai: `data.frame`'s have typed columns, so yes, if you want to initialize a `data.frame` you must decide the type of the columns... – digEmAll Aug 21 '13 at 07:06
  • For the sake of completeness this would be good to give a second example with all the possible primitive types that could be assumed to make this answer a solid reference. – jxramos Jun 09 '15 at 20:47
  • 1
    @jxramos: well, actually `data.frame` is not really restrictive on the "primitivity" of the columns types (for example, you can add a column of dates or even a column containing list of elements). Also, this question is not an absolute reference, since for example if you don't specify the correct type of the column you will not block further row addition having column of different types... so, I will add a note, but not an example with all primitive types because it does not cover all the possibilities... – digEmAll Jun 10 '15 at 10:38
  • 1
    That's all good and true, the initially specified type is not a limiting contract for any given column but it's still useful for communicating intent and whatever degree of readability it offers. I wound up using your example with some double() columns in my application, which funnily enough was written by another using pretty much the same approach as the Question author's solution. I too wanted to see an easier way to do it without recourse to a throwaway row. Exhaustive coverage may be too much, but a good sampling beyond character seems reasonable too. – jxramos Jun 10 '15 at 21:17
  • @digEmAll how do you specify the number of rows? – Herman Toothrot Dec 15 '16 at 16:24
  • 3
    @user4050: the question was about creating an empty data.frame, so when the number of rows is zero...maybe you want to create a data.frame full on NAs... in that case you can use e.g. `data.frame(Doubles=rep(as.double(NA),numberOfRow), Ints=rep(as.integer(NA),numberOfRow))` – digEmAll Dec 15 '16 at 16:45
  • Without ``stringsAsFactors=FALSE``, ``character()`` is constrained to a factor! ``str(data.frame(a=character())) 'data.frame': 0 obs. of 1 variable: $ a: Factor w/ 0 levels:`` – PatrickT Oct 01 '17 at 20:38
  • Yes, that's why I set that parameter – digEmAll Oct 02 '17 at 19:53
  • 2
    how do you append to such a data frame without triggering `data has 0` rows error? – 3pitt Jan 09 '18 at 14:45
  • @MikePalmice: use `rbind` or `DF[nrow(DF)+1,] <- the row to append` – digEmAll Jan 09 '18 at 19:55
  • @digEmAll is there any advantage of creating empty data frame vs empty but with `NA`s vs actually **filled** with `NA`s? – vasili111 Sep 21 '19 at 23:45
  • @vasili111: they are different approach for different situations. For instance a data.frame preallocated with NAs is applicable only when you know the number of rows in advance; in this case may be more efficient but really it depends on the problem you're going to solve – digEmAll Sep 25 '19 at 14:55
202

If you already have an existent data frame, let's say df that has the columns you want, then you can just create an empty data frame by removing all the rows:

empty_df = df[FALSE,]

Notice that df still contains the data, but empty_df doesn't.

I found this question looking for how to create a new instance with empty rows, so I think it might be helpful for some people.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
toto_tico
  • 17,977
  • 9
  • 97
  • 116
  • 2
    Wonderful idea. Keep none of the rows, but ALL the columns. Whoever downvoted missed something. – Ram Narasimhan Jun 04 '18 at 21:39
  • 1
    Nice solution, however I found that I get a data frame with 0 rows. In order to keep the size of the data frame the same, I suggest new_df = df[NA,]. This also allows to store any previous column into the new data frame. For example to obtain the "Date" column from original df (while keeping rest NA): new_df$Date <- df$Date. – Katya Sep 01 '18 at 10:45
  • 2
    @Katya, if you do `df[NA,]` this will affect the index as well (which is unlikely to be what you want), I would instead use `df[TRUE,] = NA`; however notice that this will overwrite the original. You will need to copy the dataframe first `copy_df = data.frame(df)` and then `copy_df[TRUE,] = NA` – toto_tico Sep 03 '18 at 07:49
  • 4
    @Katya, or you can also easily add empty rows to the `empty_df` with `empty_df[0:nrow(df),] <- NA`. – toto_tico Sep 03 '18 at 07:50
  • toto_tico, thank you for the additions, indeed the labels were affected, so I deleted them with: row.names(df) <- c(). However I think that your suggestion is better, because it allowed to create the correct size df with col names and correct row names: newDf <- df then newDf[,]<-NA By the way, how do you get your typed code to come up in grey? :-) – Katya Sep 10 '18 at 14:05
  • 1
    @Katya, you use a the backquote (\`) around what you would like to mark as code, and there is other stuff as *italics* using \*, and **bold** using \*\*. You probably want to read all the [Markdown Syntax of SO](https://stackoverflow.com/editing-help). Most of it only make sense for answers though. – toto_tico Sep 10 '18 at 17:20
  • @toto_tico This is a great solution! Save typing out potentially a large number of columns. +1 – horaceT Apr 19 '21 at 17:48
  • 1
    `empty_df = df[FALSE,]` creates an `numeric (empty)` instead of retaining the `dataframe` type, has this behaviour been changed since this solution was added? – Larry Cai May 14 '21 at 08:02
  • @LarryCai I get the same behavior – user551504 Oct 26 '21 at 14:43
  • @LarryCai, could you post a small example? I tested the one in the question `df <- data.frame(Date=as.Date("01/01/2000", format="%m/%d/%Y"), File="", User="", stringsAsFactors=FALSE)`, and still works for me. – toto_tico Oct 29 '21 at 09:06
88

You can do it without specifying column types

df = data.frame(matrix(vector(), 0, 3,
                dimnames=list(c(), c("Date", "File", "User"))),
                stringsAsFactors=F)
MERose
  • 4,048
  • 7
  • 53
  • 79
zeleniy
  • 2,232
  • 19
  • 26
  • 4
    In that case, the column types default as logical per vector(), but are then overridden with the types of the elements added to df. Try str(df), df[1,1]<-'x' – Dave X Aug 28 '14 at 16:50
69

You could use read.table with an empty string for the input text as follows:

colClasses = c("Date", "character", "character")
col.names = c("Date", "File", "User")

df <- read.table(text = "",
                 colClasses = colClasses,
                 col.names = col.names)

Alternatively specifying the col.names as a string:

df <- read.csv(text="Date,File,User", colClasses = colClasses)

Thanks to Richard Scriven for the improvement

Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • 4
    Or even `read.table(text = "", ...)` so you don't need to explicitly open a connection. – Rich Scriven Oct 28 '14 at 18:19
  • 1
    snazzy. probably the most extensible/automable way of doing this for _many_ potential columns – MichaelChirico May 03 '16 at 01:31
  • 3
    The `read.csv` approach also works with `readr::read_csv`, as in `read_csv("Date,File,User\n", col_types = "Dcc")`. This way you can directly create an empty tibble of the required structure. – Heather Turner Feb 20 '17 at 19:37
54

Just declare

table = data.frame()

when you try to rbind the first line it will create the columns

dpel
  • 1,954
  • 1
  • 21
  • 31
Daniel Fischer
  • 947
  • 6
  • 8
  • 3
    Doesn't really meet the OP's requirements of "I want to specify the data types for each column and name them". *If* the next step is an `rbind` this would work well, if not... – Gregor Thomas Sep 02 '15 at 00:31
  • 1
    Anyway, thanks for this simple solution. I wanted also to initialize a data.frame with specific columns since I thought rbind can only be used if the columns corresponds between the two data.frame. This seems not to be the case. I was surprised that I can so simply initialize a data.frame when using rbind. Thanks. – giordano Dec 06 '16 at 17:11
  • 7
    The best proposed solution here. For me, using the proposed way, worked perfectly with `rbind()`. – Kots Oct 04 '18 at 11:20
30

The most efficient way to do this is to use structure to create a list that has the class "data.frame":

structure(list(Date = as.Date(character()), File = character(), User = character()), 
          class = "data.frame")
# [1] Date File User
# <0 rows> (or 0-length row.names)

To put this into perspective compared to the presently accepted answer, here's a simple benchmark:

s <- function() structure(list(Date = as.Date(character()), 
                               File = character(), 
                               User = character()), 
                          class = "data.frame")
d <- function() data.frame(Date = as.Date(character()),
                           File = character(), 
                           User = character(), 
                           stringsAsFactors = FALSE) 
library("microbenchmark")
microbenchmark(s(), d())
# Unit: microseconds
#  expr     min       lq     mean   median      uq      max neval
#   s()  58.503  66.5860  90.7682  82.1735 101.803  469.560   100
#   d() 370.644 382.5755 523.3397 420.1025 604.654 1565.711   100
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • `data.table` is usually contains a `.internal.selfref` attribute, which cannot be faked without calling the `data.table` functions. Are you sure you are not relying on an undocumented behavior here? – Adam Ryczkowski Feb 10 '17 at 16:26
  • @AdamRyczkowski I think you're confusing the base "data.frame" class and the add-on "data.table" class from the [data.table package](https://cran.r-project.org/package=data.table). – Thomas Feb 10 '17 at 17:35
  • Yes. Definitely. My bad. Ignore my last comment. I came across this thread when searching for the `data.table` and assumed that Google did find what I wanted and everything here is `data.table`-related. – Adam Ryczkowski Feb 11 '17 at 14:22
  • 1
    @PatrickT There's no checking that what your code is doing makes any sense. `data.frame()` provides checks on naming, rownames, etc. – Thomas Oct 02 '17 at 07:00
17

If you are looking for shortness :

read.csv(text="col1,col2")

so you don't need to specify the column names separately. You get the default column type logical until you fill the data frame.

Community
  • 1
  • 1
marc
  • 355
  • 3
  • 6
  • read.csv parses the text argument so you get the column names. It is more compact than read.table(text="", col.names = c("col1", "col2")) – marc Jan 27 '15 at 16:10
  • I get : `Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 2` – Climbs_lika_Spyder May 17 '15 at 21:29
  • This doesn't meet OP's requirements, *"I want to specify the data types for each column"*, though it could probably be modified to do so. – Gregor Thomas Oct 10 '17 at 17:07
  • Very late for the party but `readr` can do it: `read_csv2("a;b;c;d;e\n", col_types = "icdDT")`. There need to be `\n` to regognize it is string not a file (or use `c("a;b;c;d;e", "")`. As a bonus column names won't be modified (e.g. `col-1` or `why spaces`) – Marek Feb 12 '21 at 07:39
15

I created empty data frame using following code

df = data.frame(id = numeric(0), jobs = numeric(0));

and tried to bind some rows to populate the same as follows.

newrow = c(3, 4)
df <- rbind(df, newrow)

but it started giving incorrect column names as follows

  X3 X4
1  3  4

Solution to this is to convert newrow to type df as follows

newrow = data.frame(id=3, jobs=4)
df <- rbind(df, newrow)

now gives correct data frame when displayed with column names as follows

  id nobs
1  3   4 
Shrikant Prabhu
  • 709
  • 8
  • 13
9

To create an empty data frame, pass in the number of rows and columns needed into the following function:

create_empty_table <- function(num_rows, num_cols) {
    frame <- data.frame(matrix(NA, nrow = num_rows, ncol = num_cols))
    return(frame)
}

To create an empty frame while specifying the class of each column, simply pass a vector of the desired data types into the following function:

create_empty_table <- function(num_rows, num_cols, type_vec) {
  frame <- data.frame(matrix(NA, nrow = num_rows, ncol = num_cols))
  for(i in 1:ncol(frame)) {
    print(type_vec[i])
    if(type_vec[i] == 'numeric') {frame[,i] <- as.numeric(frame[,i])}
    if(type_vec[i] == 'character') {frame[,i] <- as.character(frame[,i])}
    if(type_vec[i] == 'logical') {frame[,i] <- as.logical(frame[,i])}
    if(type_vec[i] == 'factor') {frame[,i] <- as.factor(frame[,i])}
  }
  return(frame)
}

Use as follows:

df <- create_empty_table(3, 3, c('character','logical','numeric'))

Which gives:

   X1  X2 X3
1 <NA> NA NA
2 <NA> NA NA
3 <NA> NA NA

To confirm your choices, run the following:

lapply(df, class)

#output
$X1
[1] "character"

$X2
[1] "logical"

$X3
[1] "numeric"
DSides
  • 3
  • 6
Cybernetic
  • 12,628
  • 16
  • 93
  • 132
7

If you want to create an empty data.frame with dynamic names (colnames in a variable), this can help:

names <- c("v","u","w")
df <- data.frame()
for (k in names) df[[k]]<-as.numeric()

You can change the types as well if you need so. like:

names <- c("u", "v")
df <- data.frame()
df[[names[1]]] <- as.numeric()
df[[names[2]]] <- as.character()
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Ali Khosro
  • 1,580
  • 18
  • 25
7

By Using data.table we can specify data types for each column.

library(data.table)    
data=data.table(a=numeric(), b=numeric(), c=numeric())
Rushabh Patel
  • 2,672
  • 13
  • 34
6

If you don't mind not specifying data types explicitly, you can do it this way:

headers<-c("Date","File","User")
df <- as.data.frame(matrix(,ncol=3,nrow=0))
names(df)<-headers

#then bind incoming data frame with col types to set data types
df<-rbind(df, new_df)
3

If you want to declare such a data.frame with many columns, it'll probably be a pain to type all the column classes out by hand. Especially if you can make use of rep, this approach is easy and fast (about 15% faster than the other solution that can be generalized like this):

If your desired column classes are in a vector colClasses, you can do the following:

library(data.table)
setnames(setDF(lapply(colClasses, function(x) eval(call(x)))), col.names)

lapply will result in a list of desired length, each element of which is simply an empty typed vector like numeric() or integer().

setDF converts this list by reference to a data.frame.

setnames adds the desired names by reference.

Speed comparison:

classes <- c("character", "numeric", "factor",
             "integer", "logical","raw", "complex")

NN <- 300
colClasses <- sample(classes, NN, replace = TRUE)
col.names <- paste0("V", 1:NN)

setDF(lapply(colClasses, function(x) eval(call(x))))

library(microbenchmark)
microbenchmark(times = 1000,
               read = read.table(text = "", colClasses = colClasses,
                                 col.names = col.names),
               DT = setnames(setDF(lapply(colClasses, function(x)
                 eval(call(x)))), col.names))
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval cld
#  read 2.598226 2.707445 3.247340 2.747835 2.800134 22.46545  1000   b
#    DT 2.257448 2.357754 2.895453 2.401408 2.453778 17.20883  1000  a 

It's also faster than using structure in a similar way:

microbenchmark(times = 1000,
               DT = setnames(setDF(lapply(colClasses, function(x)
                 eval(call(x)))), col.names),
               struct = eval(parse(text=paste0(
                 "structure(list(", 
                 paste(paste0(col.names, "=", 
                              colClasses, "()"), collapse = ","),
                 "), class = \"data.frame\")"))))
#Unit: milliseconds
#   expr      min       lq     mean   median       uq       max neval cld
#     DT 2.068121 2.167180 2.821868 2.211214 2.268569 143.70901  1000  a 
# struct 2.613944 2.723053 3.177748 2.767746 2.831422  21.44862  1000   b
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
3

If you already have a dataframe, you can extract the metadata (column names and types) from a dataframe (e.g. if you are controlling a BUG which is only triggered with certain inputs and need a empty dummy Dataframe):

colums_and_types <- sapply(df, class)

# prints: "c('col1', 'col2')"
print(dput(as.character(names(colums_and_types))))

# prints: "c('integer', 'factor')"
dput(as.character(as.vector(colums_and_types)))

And then use the read.table to create the empty dataframe

read.table(text = "",
   colClasses = c('integer', 'factor'),
   col.names = c('col1', 'col2'))
toto_tico
  • 17,977
  • 9
  • 97
  • 116
3

I keep this function handy for whenever I need it, and change the column names and classes to suit the use case:

make_df <- function() { data.frame(name=character(),
                     profile=character(),
                     sector=character(),
                     type=character(),
                     year_range=character(),
                     link=character(),
                     stringsAsFactors = F)
}

make_df()
[1] name       profile    sector     type       year_range link      
<0 rows> (or 0-length row.names)
stevec
  • 41,291
  • 27
  • 223
  • 311
1

Say your column names are dynamic, you can create an empty row-named matrix and transform it to a data frame.

nms <- sample(LETTERS,sample(1:10))
as.data.frame(t(matrix(nrow=length(nms),ncol=0,dimnames=list(nms))))
jpmarindiaz
  • 1,599
  • 1
  • 13
  • 21
1

This question didn't specifically address my concerns (outlined here) but in case anyone wants to do this with a parameterized number of columns and no coercion:

> require(dplyr)
> dbNames <- c('a','b','c','d')
> emptyTableOut <- 
    data.frame(
        character(), 
        matrix(integer(), ncol = 3, nrow = 0), stringsAsFactors = FALSE
    ) %>% 
    setNames(nm = c(dbNames))
> glimpse(emptyTableOut)
Observations: 0
Variables: 4
$ a <chr> 
$ b <int> 
$ c <int> 
$ d <int>

As divibisan states on the linked question,

...the reason [coercion] occurs [when cbinding matrices and their constituent types] is that a matrix can only have a single data type. When you cbind 2 matrices, the result is still a matrix and so the variables are all coerced into a single type before converting to a data.frame

d8aninja
  • 3,233
  • 4
  • 36
  • 60