Add multiple calculated columns to a pandas dataframe at once

Question

I have a pandas dataframe that looks like this:

 ID1    ID2  Len1   Date1   Type1   Len2    Date2   Type2   Len_Diff    Date_Diff   Score
 123    456         1-Apr    M              6-Apr    L          
 234    567         20-Apr   S              19-Apr   S          
 345    678         10-Apr   M              1-Jan    M

I want to fill in the columns that are Len1, Len2, Len_Diff and Date_Diff by calculating them from the dataset. Each ID corresponds to a text file and whose text can be retrieved using a get_text function and the length of that text can be calculated

As of now, I have code that can do this individually for each column:

def len_text(key):
   text = get_text(key)
   return len(text)

df['Len1'] = df['ID1'].map(len_text)
df['Len2'] = df['ID2'].map(len_text)
df['Len_Diff'] = (abs(df['Len1'] - df['Len2']))
df['Date_Diff'] = (abs(df['Date1'] - df['Date2']))
df['Same_Type'] = np.where(df['Type1']==df['Type2'],1,0)

How can I add all these columns to the dataframe in one step. I want them in one step because I want to wrap the code in a try/except block to overcome value errors from failure to decode the text.

try: 
    <code to add all five columns at once>
except ValueError: 
    print "Failed to decode"

Adding a try/except block to each line above makes it ugly.
There are other questions like: Changing certain values in multiple columns of a pandas DataFrame at once, that deal with multiple columns, but they are all talking about one calculation/change affecting multiple columns. What I want is different calculations to add different columns.

UPDATE: From the answers given below, I tried two different ways to approach the problem, with partial luck so far. Here's what I did:
Approach 1:

# Add calculated columns Len1, Len2, Len_Diff, Date_Diff and Same_Type
def len_text(key):
    try:
        text = get_text(key)
        return len(text)
    except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
        return 0

df.loc[:, ['Len1','Len2','Len_Diff','Date_Diff','Same_Type']] = pd.DataFrame([
        df['ID1'].map(len_text),
        df['ID2'].map(len_text),
        np.abs(df['ID1'].map(len_text) - df['ID2'].map(len_text)),
        np.abs(df['Date1']- df['Date2'])
        np.where(df['Type1']==df['Type2'],1,0)
    ])

print df.info()

Result1:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 570 entries, 0 to 569
df columns (total 10 columns):
ID1                  570 non-null int64
Date1                570 non-null int64
Type1                566 non-null object     
Len1                 0 non-null float64
ID2                  570 non-null int64
Date2                570 non-null int64
Type2                570 non-null object     
Len2                 0 non-null float64     
Date_Diff            0 non-null float64   
Len_Diff             0 non-null float64
dtypes: float64(4), int64(4), object(2)
memory usage: 58.0+ KB
None

Approach2:

def len_text(col):
    try:
        return col.map(get_text).str.len()
    except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
        return 0

formulas = """
     Len1 = @len_text(ID1)
     Len2 = @len_text(ID2)
     Len_Diff = Len1 - Len2
     Len_Diff = Len_Diff.abs()
     Same_Type = (Type1 == Type2) * 1
     """
try:
    df.eval(formulas, inplace=True, engine='python')
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
    print e

print df.info()

Result2:

"__pd_eval_local_len_text" is not a supported function
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 570 entries, 0 to 569
df columns (total 7 columns):
ID1             570 non-null int64
Date1           570 non-null int64
Type1           566 non-null object
ID2             570 non-null int64
Date2           570 non-null int64
Type2           570 non-null object
Len1            570 non-null int64
dtypes: int64(5), object(2)
memory usage: 31.2+ KB
None
/Users/.../anaconda2/lib/python2.7/site-packages/pandas/computation/eval.py:289:
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  target[parsed_expr.assigner] = ret

@Jerome, I tried. Only the first column is being populated when I do that. — Minu, Apr 27 '17 at 21:27

MaxU - stand with Ukraine · Answer 1 · 2017-04-27T22:35:55.153

4

you can use DataFrame.eval() method:

In [254]: x
Out[254]:
   ID1  ID2   Date1 Type1   Date2 Type2
0  123  456   1-Apr     M   6-Apr     L
1  234  567  20-Apr     S  19-Apr     S
2  345  678  10-Apr     M   1-Jan     M

In [255]: formulas = """
     ...: Len1 = @len_text(ID1)
     ...: Len2 = @len_text(ID2)
     ...: Len_Diff = Len1 - Len2
     ...: Len_Diff = Len_Diff.abs()
     ...: Same_Type = (Type1 == Type2) * 1
     ...: """
     ...:

In [256]: x.eval(formulas, inplace=False, engine='python')
Out[256]:
   ID1  ID2   Date1 Type1   Date2 Type2  Len1  Len2  Len_Diff  Same_Type
0  123  456   1-Apr     M   6-Apr     L     3     3         0          0
1  234  567  20-Apr     S  19-Apr     S     3     3         0          1
2  345  678  10-Apr     M   1-Jan     M     3     3         0          1

PS this solution assumes that the len_text() function can accept a column (Pandas.Series). For example:

def len_text(col):
    return col.map(get_text).str.len()

edited Apr 27 '17 at 22:35

answered Apr 27 '17 at 21:45

MaxU - stand with Ukraine

205,989
36
386
419

But `Len1` is not suppose to be the length of `ID1`... the value of `ID1` gets passed to `get_text` which returns a string, and it's the length of *that* string that is of interest. – juanpa.arrivillaga Apr 27 '17 at 21:47
@juanpa.arrivillaga, thanks for your comment. I've fixed it – MaxU - stand with Ukraine Apr 27 '17 at 21:49
@MaxU, This works awesomely. It looks clean too. Except, I still have the same problem. Len_Diff is not getting populated just like with the previous answer. – Minu Apr 28 '17 at 01:52
@MaxU, Infact, only `Len1` is getting populated. No other column. – Minu Apr 28 '17 at 02:22
@Minu, make sure that each new column in `formulas` is on the new line - it's important. Beside that what is happening with another new columns - errors, they are empty, something else? – MaxU - stand with Ukraine Apr 28 '17 at 08:08
@MaxU, they are empty. I have updated the question with the code I'm using now. – Minu Apr 28 '17 at 12:50
@Minu, as i have mentioned, my solution expects that the `len_text()` function can work with vectors - see an example in my answer – MaxU - stand with Ukraine Apr 28 '17 at 14:12
@MaxU, I tried wrapping the function that you provided in a try/except block to see if it works with the code. Same result. See update in the question. – Minu Apr 28 '17 at 14:40

matusko · Answer 2 · 2017-04-27T22:07:31.717

3

Something like this should do the job

EDIT 2: here is actually really nasty way to make it in one assignment evaluating the Len1 and Len2 multiple times.

df.loc[:, ['Len1', 'Len2', 'Len_Diff', 'Date_Diff', 'Same_Type']] = \ 
    pd.DataFrame([
        df['ID1'].map(len_text),
        df['ID2'].map(len_text),
        np.abs(df['ID1'].map(len_text) - df['ID2'].map(len_text)),
        np.abs(df['Date1'] - df['Date2']),
        np.where(df['Type1']==df['Type2'],1,0)
    ])

However, it is much less readable then original version.

EDIT: Here is a nicer way to do it in 2 lines.

df.loc[:, ['Len1', 'Len2']] = \ 
    pd.DataFrame([
        df['ID1'].map(len_text),
        df['ID2'].map(len_text)
    ])

df.loc[:, [ 'Len_Diff', 'Date_Diff', 'Same_Type'] = \
    pd.DataFrame([
        np.abs(df['Len1'] - df['Len2']),
        np.abs(df['Date1'] - df['Date2']),
        np.where(df['Type1']==df['Type2'],1,0)
    ])

edited Apr 27 '17 at 22:07

answered Apr 27 '17 at 21:38

matusko

3,487
3
20
31

I'm not sure this would work.... the evaluation of the RHS will occur before the assignment... So `abs(df['Len1'] - df['Len2'])` would be working with the wrong data, for example. – juanpa.arrivillaga Apr 27 '17 at 21:47
oh, that's true, in that case, it is impossible to make it in one line. – matusko Apr 27 '17 at 21:48
Well, maybe not *impossible*, but certainly not cleanly ;) – juanpa.arrivillaga Apr 27 '17 at 21:49
1

I like your EDIT2 solution. One small thing - i think we should use `np.abs()` instead of `abs()` – MaxU - stand with Ukraine Apr 27 '17 at 22:04
@matusko, The Edit2 is working like a charm. However, there's a problem: `Len_Diff` is not getting populated. Not sure why. – Minu Apr 28 '17 at 00:39
@matusko, On second thought, the code isn't working when wrapping it in the try/except block. – Minu Apr 28 '17 at 04:29
could you expand on that? – matusko Apr 28 '17 at 06:18

score 2 · Answer 3 · answered Apr 27 '17 at 21:22

2

Here's an example of how you could do this:

>>> df
      a  b     c
0  None  1  None
1  None  2  None
2  None  3  None
3  None  4  None
>>> def f(val):
...     return random.randint(1,10)
...
>>> df.loc[:,['a','c']] = df[['a','c']].applymap(f)
>>> df
    a  b   c
0   3  1   7
1  10  2  10
2   6  3   4
3   4  4   8

So, in your case:

df.loc[:,['Len1', 'Len2']] = df[['ID1','ID2']].applymap(len_text)

However, to be frank, you are likely better off with the ugly version, because then you'll know which column is giving you an error.

answered Apr 27 '17 at 21:22

juanpa.arrivillaga

88,713
10
131
172

I tried this too, but I get an error `KeyError: ('ID1', 'ID2')` – Minu Apr 27 '17 at 21:29
@Minu that's odd. Are you sure the column names are exactly those? E.g. no errant whitespace? – juanpa.arrivillaga Apr 27 '17 at 21:30
This would not generalize well in case you want to apply various functions (which is exactly the case of this question). Right? – matusko Apr 27 '17 at 21:41
@matusko ah, good point, yes, but I believe the OP just wants to avoid wrapping every line in a `try-except`, but these are the columns that would actually throw that error! – juanpa.arrivillaga Apr 27 '17 at 21:43
The column names are correct. No white spaces. @matusko I agree about this not generalizing, but I am just curious why it wouldn't work. – Minu Apr 28 '17 at 00:46

Add multiple calculated columns to a pandas dataframe at once

3 Answers3

Linked