How to iterate over columns of pandas dataframe to run regression

Question

I have this code using Pandas in Python:

all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')

prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})  
returns = prices.pct_change()

I know I can run a regression like this:

regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()

but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?

Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.

I've tried various versions of the following, but nothing I've tried gives the desired result:

resids = {}
for k in returns.keys():
    reg = sm.OLS(returns[k],returns.FSTMX).fit()
    resids[k] = reg.resid

Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?

You can subscript the cols like so: `for i in len(df): if i + 1 != len(df): # sm.OLS(returns[returns.coloumns[i]], returns[returns.columns[ i+1]]), fit()` os similar — EdChum, Jan 29 '15 at 15:56

score 551 · Accepted Answer · answered Sep 14 '15 at 06:42

551

for column in df:
    print(df[column])

answered Sep 14 '15 at 06:42

The Unfun Cat

29,987
31
114
156

2

I seem to only get back the column header when I use this method. So for example: print(df) shows me the data in the dataframe columns but for c in df: print(c) only prints the header not the data. – Reddspark Mar 20 '17 at 09:30
8

Ok ignore me -- I was doing print(column) not print (df[column]) – Reddspark Mar 20 '17 at 11:26
28

Watch out for columns with the same name! – freethebees Aug 29 '17 at 13:53
6

It's nice and concise. I'd expect `for x in df` to iterate over rows, though. :-/ – Eric Duminil Jan 29 '18 at 14:47
12

`for idx, row in df.iterrows()` iterates over rows. Since colbased operations are vectorized it is natural that the main iteration is over columns :) – The Unfun Cat Jan 30 '18 at 10:52
2

why isn't there `df.itercols()` for iterating over columns instead? – develarist Feb 03 '21 at 23:46
@develarist It seems like `df.itercols()` is called `df.iteritems()`. See https://stackoverflow.com/a/36372667/2470337 answer below. – Dr_Zaszuś Jul 12 '21 at 12:14
1

Beware `df[column]` can be a `DataFrame` or a `Serie` if you have columns with duplicated names. – loicgasser Jun 30 '22 at 16:48
2

Beware; this only iterates column _names_, not columns. (As such, it answers the OPs detailed question, but NOT the headline title that they used!) – dsz Aug 23 '22 at 05:45

score 123 · Answer 2 · edited Nov 25 '18 at 22:53

123

You can use iteritems():

for name, values in df.iteritems():
    print('{name}: {value}'.format(name=name, value=values[0]))

edited Nov 25 '18 at 22:53

mmBs

8,421
6
38
46

answered Apr 02 '16 at 11:31

mdh

5,355
5
26
33

3

Great answer. By the way, `df.iteritems()` can be also written as `df.items()` giving the same result. – Dr_Zaszuś Feb 01 '22 at 18:20
In fact, pandas >= 2.0 only has `.items()` but no `.iteritems()`. – Gregor Sturm Aug 22 '23 at 11:24

Abhinav Gupta · Answer 3 · 2018-09-13T22:18:57.220

This answer is to iterate over selected columns as well as all columns in a DF.

df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.

We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:

for column in df.columns[1:]:
    print(df[column])

Similarly to iterate over all the columns in reversed order, we can do:

for column in df.columns[::-1]:
    print(df[column])

We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:

for ind, column in enumerate(df.columns):
    print(ind, column)

score 22 · Answer 4 · answered Jan 29 '15 at 15:51

22

You can index dataframe columns by the position using ix.

df1.ix[:,1]

This returns the first column for example. (0 would be the index)

df1.ix[0,]

This returns the first row.

df1.ix[:,1]

This would be the value at the intersection of row 0 and column 1:

df1.ix[0,1]

and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

answered Jan 29 '15 at 15:51

JAB

12,401
6
45
50

17

`ix` is deprecated, use `iloc` – Yohan Obadia Feb 08 '18 at 10:47

score 17 · Answer 5 · answered Jul 22 '15 at 17:40

17

A workaround is to transpose the DataFrame and iterate over the rows.

for column_name, column in df.transpose().iterrows():
    print column_name

answered Jul 22 '15 at 17:40

kdauria

6,300
4
34
53

9

Transposition is rather expensive :) – The Unfun Cat Sep 23 '18 at 08:33
2

Might be expensive, but this is a great solution for relatively small dataframes. Thanks kdauria! – elPastor Feb 11 '20 at 21:12
2

I guess this suggestion is deprecated. With recent versions of pandas, better use [DataFrame.items()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.items.html). Also, transposition may lead to data type conversions if the DataFrame consists of different dtypes. – normanius Apr 11 '22 at 14:48

score 10 · Answer 6 · answered Mar 22 '17 at 22:38

10

Using list comprehension, you can get all the columns names (header):

[column for column in df]

answered Mar 22 '17 at 22:38

MEhsan

2,184
9
27
41

3

Shorter version: `list(df.columns)` or `[c for c in df]` – The Unfun Cat Mar 26 '18 at 08:07

score 8 · Answer 7 · answered Apr 23 '18 at 17:36

Based on the accepted answer, if an index corresponding to each column is also desired:

for i, column in enumerate(df):
    print i, df[column]

The above df[column] type is Series, which can simply be converted into numpy ndarrays:

for i, column in enumerate(df):
    print i, np.asarray(df[column])

Gaurav · Answer 8 · 2017-04-29T04:09:48.607

I'm a bit late but here's how I did this. The steps:

Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.

This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import statsmodels.formula.api as smf
import itertools

# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)

# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])

# excluded cols
exc = []

# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
    lmstr = "+".join(x)
    m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
    f = m.fit()
    exc = [item for item in x if item not in itercols]
    regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))

regression_res.sort_values(by="Rsq", ascending = False)

score 0 · Answer 9 · answered Feb 04 '21 at 13:38

I landed on this question as I was looking for a clean iterator of columns only (Series, no names).

Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:

x, y = df[['x', 'y']]  # does not work

There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.

One (inelegant) way is to do:

x, y = (v for _, v in df[['x', 'y']].items())

but it's less pythonic than I'd like.

Hey @Pierre D I came across your answer & was looking for something similar. I don't know if [this link](https://stackoverflow.com/questions/68200351/fastest-way-to-iterate-pandas-series-column) helps or not but it may be worth a look. — JC23, Jun 30 '21 at 20:15
I had a similar [question](https://stackoverflow.com/questions/51225275/) regarding the assignment. `x, y = df[["x", "y"]].T.values` works. — normanius, Apr 11 '22 at 14:53

score 0 · Answer 10 · answered Nov 07 '22 at 01:31

0

Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:

for series in (df.iloc[:,i] for i in range(df.shape[1])):
   ...

answered Nov 07 '22 at 01:31

dsz

4,542
39
35

Good point about iterating over columns rather than names but you can do it using `items` as said in answers above: `for _, col in data_df.items():` – Jérôme May 11 '23 at 12:54

score -1 · Answer 11 · answered Dec 12 '21 at 09:20

-1

assuming X-factor, y-label (multicolumn):

columns = [c for c in _df.columns if c in ['col1', 'col2','col3']]  #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)

X, y =  _df.iloc[:,:4].values, _df.index.values

answered Dec 12 '21 at 09:20

JeeyCi

354
2
9

How to iterate over columns of pandas dataframe to run regression

11 Answers11

Linked

Related