5

I have some relatively simple code that I'm struggling to put together. I have a CSV that I've read into a dataframe. The CSV is panel data (i.e., unique company and year observations for each row). I have two columns that I want to perform a function on and then I want to create new variables based on the output of the function.

Here's what I have so far with code:

#Loop through rows in a CSV file
for index, rows in df.iterrows():
    #Start at column 6 and go to the end of the file
    for row in rows[6:]:
        data = perform_function1( row )
        output =  perform_function2(data)    
        df.ix[index, 'new_variable'] = output
        print output

I want this code to iterate starting in column 6 and then going to the end of the file (e.g., I have two columns I want to perform the function on Column6 and Column7) and then create new columns based on the functions that were performed (e.g., Output6 and Output7). The code above returns the output for Column7, but I can't figure out how to create a variable that allows me to capture the outputs from both columns (i.e., a new variable that isn't overwritten by loop). I searched Stackoverflow and didn't see anything that immediately related to my question (maybe because I'm too big of a noob?). I would really appreciate your help.

Thanks,

TT

P.S. I'm not sure if I've provided enough detail. Please let me know if I need to provide more.

TaterTots
  • 99
  • 1
  • 1
  • 7
  • Can you give an example of your input and desired output? By "a new variable" do you actually mean "a new column name"? – BrenBarn Jun 12 '15 at 20:04
  • Yes--a new column name. My input is a paragraph of text. The function calculates a readability statistic based on the input text. So, the new variable (or new column) would basically be the readability statistic (Flesch–Kincaid Score) based on the text. – TaterTots Jun 12 '15 at 20:07
  • `for row in rows[6:]` is confusing, because `rows` is a single row of the dataset and `for row in rows[6:]` is iterating the columns actually. – GeauxEric Jun 13 '15 at 01:03

3 Answers3

4

Operating iteratively doesn't take advantage of Pandas' capabilities. Pandas' strength is in applying operations efficiently across the whole dataframe, rather than in iterating row by row. It's great for a task like this where you want to chain a few functions across your data. You should be able to accomplish your whole task in a single line.

df["new_variable"] = df.ix[6:].apply(perform_function1).apply(perform_function2)

perform_function1 will be applied to each row, and perform_function2 will be applied to the results of the first function.

ASGM
  • 11,051
  • 1
  • 32
  • 53
  • Thanks! I tried this approach and received the following error: TypeError: ('expected string or buffer', u'occurred at index CaseNum') I believe it's because my functions were written to handle individual strings (iterating down a column) rather than applying it to an entire row. – TaterTots Jun 13 '15 at 02:47
  • Wait, are both functions meant to apply to individual strings instead of a whole row, or just `perform_function1`? Maybe it would help to include your functions (if they're not too complicated). – ASGM Jun 15 '15 at 10:26
  • The functions are kind of complicated. They're meant to apply to individual strings. – TaterTots Jun 15 '15 at 15:25
  • @TaterTots what do they return? Does `perform_function1` return a single value or multiple values? How about `perform_function2`? – ASGM Jun 16 '15 at 00:03
  • Both of the functions return single values. The answer to this problem was really simple. I just had to make a dynamic variable name by adding a counter and adding this code in the second for loop: `name = df.columns[i+6]` `df.ix[index, 'new_var' + '_' + str(name)] = perform_function2` Thanks for the help! – TaterTots Jun 23 '15 at 04:01
1

If you want to apply function to certain columns in a dataframe

# Get the Series
colmun6 = df.ix[:, 5]  
# perform_function1 applied to each row
output6 = column6.apply(perform_function1)  
df["new_variable"] = output6
GeauxEric
  • 2,814
  • 6
  • 26
  • 33
0

Pandas is quite slow acting row-by-row: you're much better off using the append, concat, merge, or join functionalities on the whole dataframe.

To give some idea why, let's consider a random DataFrame example:

import numpy as np
import pandas as pd
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df2 = df.copy()
# operation to concatenate two dataframes
%timeit pd.concat([df2, df])
1000 loops, best of 3: 737 µs per loop
 %timeit df.loc['2013-01-01']
1000 loops, best of 3: 251 µs per loop
# single element operation
%timeit df.loc['2013-01-01', 'A'] = 3
1000 loops, best of 3: 218 µs per loop

Notice how efficiently Pandas handles entire dataFrame operations, and how inefficiently it handles operations on single elements?

If we expand this, the same tendency occurs, only is much more pronounced:

df = pd.DataFrame(np.random.randn(200, 300))
# single element operation
%timeit df.loc[1,1] = 3
10000 loops, best of 3: 74.6 µs per loop
df2 = df.copy()
# full dataframe operation
%timeit pd.concat([df2, df])
1000 loops, best of 3: 830 µs per loop

Pandas performs an operation on the whole, 200x300 DataFrame about 6,000 times faster than it does for an operation on a single element. In short, the iteration would kill the whole purpose of using Pandas. If you're accessing a dataframe element-by-element, consider using a dictionary instead.

Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67