5

I would like to select a subset of columns from a DataFrame without copying the data. From this answer it seems that it's impossible, if the columns have different dtypes. Can anybody confirm? For me, it seems that there must be a way as the feature is so essential.

For example, df.loc[:, ['a', 'b']] produces a copy.

Konstantin
  • 2,451
  • 1
  • 24
  • 26
  • If you are referencing an example that shows it to be true, have you tried that it works? – mrCarnivore Nov 27 '17 at 09:40
  • I've tried the `df.loc` example and `_is_view` is set to false. Also, assignment does not propagate to the original DataFrame. So, it produces a copy. (I edited the question to reflect that.) – Konstantin Nov 27 '17 at 09:42
  • I smell an XY problem... what is it you are trying to achieve here? – cs95 Nov 27 '17 at 09:49
  • For example, selecting a subset of columns, then using `itertuples()` to create a list to pass as parameters argument for the `executemany` function of pyodbc. – Konstantin Nov 27 '17 at 09:58
  • Why is copy a problem again? If its related to pyodbc you should tag that too and specify the same in your question. – Bharath M Shetty Nov 27 '17 at 10:41
  • pyodbc is just one use case. The general question is how to select a subset of columns from a dataframe without copying. Exactly as it's stated above. – Konstantin Nov 27 '17 at 10:49
  • If I want to pass a subset of columns as an argument to a function, the same question arises. – Konstantin Nov 27 '17 at 10:50
  • 3
    It's an obvious problem for large data sets! Not every problem revealing bad design is an XY problem. – Joseph Garvin Jan 30 '18 at 18:28

1 Answers1

2

This post is only applicable for dataframes having same dtypes across all columns.

It is possible if the columns to be selected are at regular strides from each other using slicing within .iloc. As such selecting any two columns is always possible, but for more than two columns, we need to have regular strides between them. In all of those cases, we need to know their column IDs and strides.

Let's try to understand these with the help of some sample cases.

Case #1 : Two columns starting at 0th col ID

In [47]: df1
Out[47]: 
   a  b  c  d
0  5  0  3  3
1  7  3  5  2
2  4  7  6  8

In [48]: np.array_equal(df1.loc[:, ['a', 'b']], df1.iloc[:,0:2])
Out[48]: True

In [50]: np.shares_memory(df1, df1.iloc[:,0:2]) # confirm view
Out[50]: True

Case #2 : Two columns starting at 1st col ID

In [51]: df2
Out[51]: 
   a0  a  a1  a2  b  c  d
0   8  1   6   7  7  8  1
1   5  8   4   3  0  3  5
2   0  2   3   8  1  3  3

In [52]: np.array_equal(df2.loc[:, ['a', 'b']], df2.iloc[:,1::3])
Out[52]: True

In [54]: np.shares_memory(df2, df2.iloc[:,1::3]) # confirm view
Out[54]: True

Case #2 : Three columns starting at 1st col ID and a stride of 2 columns

In [74]: df3
Out[74]: 
   a0  a  a1  b  b1  c  c1  d  d1
0   3  7   0  1   0  4   7  3   2
1   7  2   0  0   4  5   5  6   8
2   4  1   4  8   1  1   7  3   6

In [75]: np.array_equal(df3.loc[:, ['a', 'b', 'c']], df3.iloc[:,1:6:2])
Out[75]: True

In [76]: np.shares_memory(df3, df3.iloc[:,1:6:2]) # confirm view
Out[76]: True

Select 4 columns :

In [77]: np.array_equal(df3.loc[:, ['a', 'b', 'c', 'd']], df3.iloc[:,1:8:2])
Out[77]: True

In [78]: np.shares_memory(df3, df3.iloc[:,1:8:2])
Out[78]: True
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I forgot to mention that the dtypes of columns are different. I edited the question. – Konstantin Nov 27 '17 at 10:07
  • @Konstantin You should have mentioned that earlier. Don't think this will work with different dtypes. Keeping this post for future readers for the regular case of same dtype case. – Divakar Nov 27 '17 at 10:15
  • I'm not sure why you've assumed that all dtypes are equal? That's a very narrow use case. Nevertheless, sorry for wasting your time. – Konstantin Nov 27 '17 at 10:33