0

I'm working with fairly large datasets that are close to my available memory. I want to select a subset of columns based on column names and then save this data. I don't think I can use regular slicing, as in :2 notation, so I need to select based on label or location. But it seems the only way to do this produces a copy, increasing memory usage considerably whenever I want to save a subset of the data. Is it possible to select a view without using slices? Or is there some creative way to use slices that can allow me to select arbitrarily located columns?

Consider the following:

import pandas as pd

df = pd.DataFrame([[1, 2, 1], [3, 4, 1]], columns=list('abc'))

# you can get a view using :2 slicing
my_slice = df.iloc[:, :2]
my_slice.iloc[0, 0] = 100

df
     a  b  c
0  100  2  1
1    3  4  1

my_slice
     a  b
0  100  2
1    3  4

This returns a view and hence doesn't copy, but I had index by slicing.

Now I try alternatives.

my_slice = df.iloc[:, [0, 1]]
my_slice.iloc[0, 0] = 99

my_slice
    a  b
0  99  2
1   3  4

df
     a  b  c
0  100  2  1
1    3  4  1

Or

my_slice = df.loc[:, ['a', 'b']]
my_slice.iloc[0, 0] = 55

my_slice
    a  b
0  55  2
1   3  4

df
     a  b  c
0  100  2  1
1    3  4  1

Thus, the last two attempts returned a copy. Again, this is just a simple example. In reality, I have many more columns and the location of the subset of columns I want to save may not be amenable to slicing. This post is related, as it discusses selecting columns from dataframes, but it doesn't focus on being able to select views.

jtorca
  • 1,531
  • 2
  • 17
  • 31
  • If it's just to save the data, then pretty much everything has an option to subset output by column labels... eg `df.to_csv('somefile.csv', columns=['a', 'f', 'z'])` – Jon Clements Jan 27 '20 at 03:20
  • Good point. In my case I was using `to_hdf`, which doesn't appear to throw an error when I include a `columns` argument, but it does not save a subset of the columns. – jtorca Jan 27 '20 at 03:28
  • Looks like it might be called `data_columns` according to [`DataFrame.to_hdf`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_hdf.html) ? – Jon Clements Jan 27 '20 at 03:31
  • I think `data_columns` allows you to select some columns that you'd like to be able to use to query, but does not select a subset to be saved. – jtorca Jan 27 '20 at 03:34
  • Okies... well, it's nearly 4am here and I'm knackered... but since a pandas DataFrame is a souped up 2D numpy array which you can get direct access to by (`df.values`) then the gory details are at https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html for what makes a view and what doesn't and all that... – Jon Clements Jan 27 '20 at 03:40

0 Answers0