What is the best approach for appending a column to dataframe filled with the same Boolean value

Question

Given a dataframe such as the following:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                'B': ['B0', 'B1', 'B2'],
                'C': ['C0', 'C1', 'C2']},
                index=[0, 1, 2])

   A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

I want to add a column 'D' initialized with value False. Column 'D' will be used in future processing of the dataframe:

    A   B   C      D
0  A0  B0  C0  False
1  A1  B1  C1  False
2  A2  B2  C2  False

I generated a list of False values based on the df1 index and used it to create a df2, which was then concatenated with df1:

Dlist = [False for item in list(range(len(df1.index)))]
d = {'D':Dlist}
df2 = pd.DataFrame(d, index = df1.index)
result = pd.concat([df1, df2], axis=1, join_axes=[df1.index])

A couple of questions: Does the list comprehension in the first line need to be so involved? I tried the following, thinking that 'df1.index' is list like. It didn't work.

Dlist = [False for item in df1.index]

More broadly, is there a better approach for doing this with dataframe operations? If I were dealing with a 'csv' file containing data for df1, I could easily add 'D' to the file before generating the dataframe.

In terms of philosophy, is modifying dataframes in place, or the 'csv' files they came from, unavoidable when processing data? It certainly doesn't seem like a good when dealing with data in very large files.

@hugo that's literally all you need to do to get your desired result. — juanpa.arrivillaga, Sep 27 '18 at 08:18

juanpa.arrivillaga · Accepted Answer · 2018-09-27T08:24:07.703

You can just use index-based assignment:

In [16]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
    ...:                 'B': ['B0', 'B1', 'B2'],
    ...:                 'C': ['C0', 'C1', 'C2']},
    ...:                 index=[0, 1, 2])

In [17]: df1
Out[17]:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

In [18]: df1['D'] = False

In [19]: df1
Out[19]:
    A   B   C      D
0  A0  B0  C0  False
1  A1  B1  C1  False
2  A2  B2  C2  False

You can also use .assign which returns a new data-frame if you don't want to modify the original:

In [20]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
    ...:                 'B': ['B0', 'B1', 'B2'],
    ...:                 'C': ['C0', 'C1', 'C2']},
    ...:                 index=[0, 1, 2])

In [21]: df1
Out[21]:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

In [22]: df1.assign(D=False)
Out[22]:
    A   B   C      D
0  A0  B0  C0  False
1  A1  B1  C1  False
2  A2  B2  C2  False

In [23]: df1
Out[23]:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

And using pd.concat here is really not useful, you could have simply assigned the list! Either way, it's still much much slower:

In [44]: import timeit

In [45]: setup = 'import pandas as pd; df = pd.DataFrame({"a":list(range(100000))})'

In [46]: lstcomp = "df['D'] = [False for item in range(len(df.index))]"

In [47]: assgnmt = "df['D'] = False"

In [48]: timeit.timeit(lstcomp, setup, number=100)
Out[48]: 0.6879564090049826

In [49]: timeit.timeit(assgnmt, setup, number=100)
Out[49]: 0.008814844011794776

As for your list-comprehension, it is not necessary, but it is definitely over-complicated. You said you tried iterating over the index, but "it didn't work", but you never explained how it didn't work. It works for me:

In [24]: [False for item in list(range(len(df1.index)))]
Out[24]: [False, False, False]

In [25]: [False for item in df1.index]
Out[25]: [False, False, False]

Note, your's is doubly inefficient because it calls list on the range object, which creates a whole list instead of taking advantage of range's fixed-memory behavior (not to mention iterating twice).

and also list comprehesnion should be slowier in large data? Is possible verify it? — jezrael, Sep 27 '18 at 08:15
Arg!, I'm certainly making this much more complicated than it needs to be. Many thanks. — hugo, Sep 27 '18 at 08:26

What is the best approach for appending a column to dataframe filled with the same Boolean value

1 Answers1