Append series to empty dataframe column always results the same after a loop

Question

import pandas as pd

df = pd.DataFrame(columns=["A", "B"])

df2 = pd.DataFrame({"C": [5, 6, 7, 8, 9], "D": [1, 2, 3, 4, 5]})

for i in range(5):
    df["A"] = df["A"].append(df2["C"], ignore_index=True)

# print(df)

   A    B
0  5  NaN
1  6  NaN
2  7  NaN
3  8  NaN
4  9  NaN

As you can see, dataframe df is an empty column. After appending C column of df2 to A column of df, no matter how many times the loop, the A column is always 5, 6, 7, 8, 9.

If we try to append to B column of df after A column, the result is still the same.

for i in range(5):
    df["A"] = df["A"].append(df2["C"], ignore_index=True)
    df["B"] = df["B"].append(df2["D"], ignore_index=True)

# print(df)

   A    B
0  5  NaN
1  6  NaN
2  7  NaN
3  8  NaN
4  9  NaN

If we try to append df2 to df, the result is ok.

for i in range(5):
    df = df.append(df2, ignore_index=True)

      A    B    C    D
0   NaN  NaN  5.0  1.0
1   NaN  NaN  6.0  2.0
2   NaN  NaN  7.0  3.0
3   NaN  NaN  8.0  4.0
4   NaN  NaN  9.0  5.0
...
21  NaN  NaN  6.0  2.0
22  NaN  NaN  7.0  3.0
23  NaN  NaN  8.0  4.0
24  NaN  NaN  9.0  5.0

It would be much simpler if you explain what you are trying to achieve with input and expected output. — Vishnudev Krishnadas, Mar 23 '21 at 07:04
@Vishnudev After a 5 loop assignment, IMO, `df["A"]` should have `25` values. Why it only has `5`? Why `df["B"]` is `NaN` after appending values to it? — Ynjxsjmh, Mar 23 '21 at 07:10

Ynjxsjmh · Answer 1 · 2021-03-23T07:15:32.947

TL;DR By assigning series to Dataframe column, the series will be conformed to the DataFrames index. The result of append() has more elements than the index of df, so column value won't change.

There is no problem with the append() function, the problem is in df["A"] assignment.

With df["A"] = xx, we are calling __setitem__():

    def __setitem__(self, key, value):
        key = com.apply_if_callable(key, self)

        # see if we can slice the rows
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            # either we have a slice or we have a string that can be converted
            #  to a slice for partial-string date indexing
            return self._setitem_slice(indexer, value)

        if isinstance(key, DataFrame) or getattr(key, "ndim", None) == 2:
            self._setitem_frame(key, value)
        elif isinstance(key, (Series, np.ndarray, list, Index)):
            self._setitem_array(key, value)
        else:
            # set column
            self._set_item(key, value)

In this case, we are not accessing the dataframe like df[:], so indexer is None. key value is A, which is just a string type. So we actually call:

self._set_item(key, value)

Let's see how _set_item() is defined:

    def _set_item(self, key, value):
        """
        Add series to DataFrame in specified column.
        If series is a numpy-array (not a Series/TimeSeries), it must be the
        same length as the DataFrames index or an error will be thrown.
        Series/TimeSeries will be conformed to the DataFrames index to
        ensure homogeneity.
        """
        self._ensure_valid_index(value)
        value = self._sanitize_column(key, value)
        NDFrame._set_item(self, key, value)

        # check if we are modifying a copy
        # try to set first as we want an invalid
        # value exception to occur first
        if len(self):
            self._check_setitem_copy()

From the doc, we can see Series/TimeSeries will be conformed to the DataFrames index to ensure homogeneity.. This explains why the dataframe df doesn't change. Because after the first loop, the result of append() is larger than the index of df, the redundant is truncated.

If so, why appending to dataframe df is successful in the first loop? The answer lays in self._ensure_valid_index(value)

    def _ensure_valid_index(self, value):
        """
        Ensure that if we don't have an index, that we can create one from the
        passed value.
        """

If the dataframe is empty, this method extends the dataframe to a len(value)*columns matrix with NaN values. Then with NDFrame._set_item(self, key, value), we replace the column key with value.

In the second example, we are trying to append to B column after A column:

for i in range(5):
    df["A"] = df["A"].append(df2["C"], ignore_index=True)
    df["B"] = df["B"].append(df2["D"], ignore_index=True)

In the first loop, after appending to A column, the B column of dataframe df is filled with NaN. df["B"].append(df2["D"], ignore_index=True) appends values to original NaN. By assigning it to df["B"], the append() result will be conformed to the DataFrames index. That's why df["B"] remains NaN.

In the third example, we just replace the dataframe df with the result of append, it doesn't involve with dataframe __setitem__().

for i in range(5):
    df = df.append(df2, ignore_index=True)

@Vishnudev Yes, I met this question and don't understand why. Trying to search the web, it seems nobody has the same question. After debugging, I think I find the answer. Now share it out, if anyone has the same doubt, this would be a nice start. — Ynjxsjmh, Mar 23 '21 at 07:21

Append series to empty dataframe column always results the same after a loop

1 Answers1

Linked