2

This post provides an elegant way to create an empty pandas DataFrame of a specified data type. And if you specify np.nan values when you initialize it, the data type is set to float:

df_training_outputs = pd.DataFrame(np.nan, index=index, columns=column_names)

But I want to create an empty DataFrame with different data types in each column. It seems the dtype keyword argument will only accept one.

Background: I am writing a script that generates data incrementally and so I need somewhere to store it during the execution of the script. I thought an empty data frame (large enough to take all the expected data) would be the best way to do this. This must be a fairly common tasks so if someone has a better way please advise.

Community
  • 1
  • 1
Bill
  • 10,323
  • 10
  • 62
  • 85
  • May be it will be effective to use a number of Series for each column and concatenate them when a DataFrame is needed? – knagaev May 23 '16 at 11:52

1 Answers1

3

One way you can create an empty dataframe with columns of different types is by providing an empty numpy array with a correct structured dtype:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.empty(0, dtype=[('a', 'u4'), ('b', 'S20'), ('c', 'f8')]))

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 3 columns):
a    0 non-null uint32
b    0 non-null object
c    0 non-null float64
dtypes: float64(1), object(1), uint32(1)
memory usage: 76.0+ bytes
aldanor
  • 3,371
  • 2
  • 26
  • 26
  • Thanks. This works. However, perhaps not surprisingly, I noticed a huge speed difference in doing it this way (3.23 s to complete) compared to the earlier method above (168 ms) where the dataframe was created entirely of the same data type (float). So in my case I think it's better to first fill the dataframe with floats then convert the desired columns to integers at the end. – Bill May 23 '16 at 06:41
  • To clarify: by speed difference, I mean the time it takes to fill the resulting dataframe with values using setter methods such as `df.at[] = ...` – Bill May 23 '16 at 06:55
  • @Bill there's nothing surprising here really, for homogeneous arrays pandas may use a single 2-D container as a backend. – aldanor May 23 '16 at 07:03
  • @Bill You could also try just using raw numpy record array, and then convert it to a dataframe at the very end, this way it's zero-copy and could be faster even than the homogeneous dataframe approach. – aldanor May 23 '16 at 09:18
  • Thanks @aldanor. I realize now a dataframe wasn't the right approach for capturing this data. I need to build the data in separate but fast and efficient data objects such as pandas.Series or numpy arrays and then combine them at the end into a data frame. – Bill May 23 '16 at 16:58
  • Can someone explain the syntax of this solution a bit? It is not intuitive to me. Thank you. – Windstorm1981 May 27 '17 at 17:19
  • @Windstorm1981 this creates an empty numpy structured array with the desired dtype -- which pandas then uses to set dtypes of all columns in the dataframe. – aldanor May 28 '17 at 21:23