0

Given a dataframe such as the following:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                'B': ['B0', 'B1', 'B2'],
                'C': ['C0', 'C1', 'C2']},
                index=[0, 1, 2])

   A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2 

I want to add a column 'D' initialized with value False. Column 'D' will be used in future processing of the dataframe:

    A   B   C      D
0  A0  B0  C0  False
1  A1  B1  C1  False
2  A2  B2  C2  False

I generated a list of False values based on the df1 index and used it to create a df2, which was then concatenated with df1:

Dlist = [False for item in list(range(len(df1.index)))]
d = {'D':Dlist}
df2 = pd.DataFrame(d, index = df1.index)
result = pd.concat([df1, df2], axis=1, join_axes=[df1.index])

A couple of questions: Does the list comprehension in the first line need to be so involved? I tried the following, thinking that 'df1.index' is list like. It didn't work.

Dlist = [False for item in df1.index]

More broadly, is there a better approach for doing this with dataframe operations? If I were dealing with a 'csv' file containing data for df1, I could easily add 'D' to the file before generating the dataframe.

In terms of philosophy, is modifying dataframes in place, or the 'csv' files they came from, unavoidable when processing data? It certainly doesn't seem like a good when dealing with data in very large files.

hugo
  • 3
  • 2

1 Answers1

2

You can just use index-based assignment:

In [16]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
    ...:                 'B': ['B0', 'B1', 'B2'],
    ...:                 'C': ['C0', 'C1', 'C2']},
    ...:                 index=[0, 1, 2])

In [17]: df1
Out[17]:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

In [18]: df1['D'] = False

In [19]: df1
Out[19]:
    A   B   C      D
0  A0  B0  C0  False
1  A1  B1  C1  False
2  A2  B2  C2  False

You can also use .assign which returns a new data-frame if you don't want to modify the original:

In [20]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
    ...:                 'B': ['B0', 'B1', 'B2'],
    ...:                 'C': ['C0', 'C1', 'C2']},
    ...:                 index=[0, 1, 2])

In [21]: df1
Out[21]:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

In [22]: df1.assign(D=False)
Out[22]:
    A   B   C      D
0  A0  B0  C0  False
1  A1  B1  C1  False
2  A2  B2  C2  False

In [23]: df1
Out[23]:
    A   B   C
0  A0  B0  C0
1  A1  B1  C1
2  A2  B2  C2

And using pd.concat here is really not useful, you could have simply assigned the list! Either way, it's still much much slower:

In [44]: import timeit

In [45]: setup = 'import pandas as pd; df = pd.DataFrame({"a":list(range(100000))})'

In [46]: lstcomp = "df['D'] = [False for item in range(len(df.index))]"

In [47]: assgnmt = "df['D'] = False"

In [48]: timeit.timeit(lstcomp, setup, number=100)
Out[48]: 0.6879564090049826

In [49]: timeit.timeit(assgnmt, setup, number=100)
Out[49]: 0.008814844011794776

As for your list-comprehension, it is not necessary, but it is definitely over-complicated. You said you tried iterating over the index, but "it didn't work", but you never explained how it didn't work. It works for me:

In [24]: [False for item in list(range(len(df1.index)))]
Out[24]: [False, False, False]

In [25]: [False for item in df1.index]
Out[25]: [False, False, False]

Note, your's is doubly inefficient because it calls list on the range object, which creates a whole list instead of taking advantage of range's fixed-memory behavior (not to mention iterating twice).

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172