1

I don't quite understand how pandas' row (of dataframe) can be represented by series.

I understand that underlying representation of pandas series is numpy array. That means array with homogeneous values. I understand why pandas column of dataframe is represented by series (a column of dataframe represents some attribute for different entities, i.e. values of that attribute belongs to the same data type).

But how come that row of dataframe (i.e. set of potentially different attributes with different data types) can be represented by series?

I just guess that the values of all those different attributes are represented by more abstract data type such as 'object' and the underlying (homogeneous) numpy array is array of 'object's.

Can someone please confirm that my understanding is right?

Thanks

Tomas

jpp
  • 159,742
  • 34
  • 281
  • 339
Tomas Sedlacek
  • 345
  • 4
  • 11

1 Answers1

1

Internally, pandas represents each series, or column, of data with a specific data type, or dtype:

df = pd.DataFrame([[2, True, 3.5, 'hello'], [4, False, 5.12, 'again']])

print(df)

   0      1     2      3
0  2   True  3.50  hello
1  4  False  5.12  again

print(df.dtypes)

0      int64
1       bool
2    float64
3     object
dtype: object

When you ask for a row of data which contains mixed types, pandas performs an explicit conversion to create a series of dtype=object. Such a series can hold virtually anything:

# extract first row
print(df.iloc[0])

0        2
1     True
2      3.5
3    hello
Name: 0, dtype: object

Notice that there are many different types in this object series. For efficiency, you should aim to perform operations on series which are held in contiguous memory blocks. This is the case with int, float, datetime and bool series, but will not be the case for object series which contain pointers to data rather than the data itself.

You can get a numpy array from your series:

print(df.iloc[0].values)

array([2, True, 3.5, 'hello'], dtype=object)

But this is not the same as a regular series:

Creating an array with dtype=object is different. The memory taken by the array now is filled with pointers to Python objects which are being stored elsewhere in memory (much like a Python list is really just a list of pointers to objects, not the objects themselves).

jpp
  • 159,742
  • 34
  • 281
  • 339
  • I am not sure if this behaviour is documented somewhere but as far as I can tell, if the columns are all sub dtypes of [`np.number`](https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#scalars), it does upcasting to numbers again (ints, floats, complex etc.) but if it contains something else (including booleans) the dtype is always object. – ayhan May 21 '18 at 21:03
  • @user2285236, This appears to be *sometimes* the case. In my example, `df.iloc[0, [0, 2]].values.dtype` has dtype `object`, but `df.iloc[:, [0, 2]].values.dtype` has dtype `float`. – jpp May 21 '18 at 21:32