I have a pandas dataframe that looks like this:
ID1 ID2 Len1 Date1 Type1 Len2 Date2 Type2 Len_Diff Date_Diff Score
123 456 1-Apr M 6-Apr L
234 567 20-Apr S 19-Apr S
345 678 10-Apr M 1-Jan M
I want to fill in the columns that are Len1, Len2, Len_Diff and Date_Diff by calculating them from the dataset. Each ID corresponds to a text file and whose text can be retrieved using a get_text
function and the length of that text can be calculated
As of now, I have code that can do this individually for each column:
def len_text(key):
text = get_text(key)
return len(text)
df['Len1'] = df['ID1'].map(len_text)
df['Len2'] = df['ID2'].map(len_text)
df['Len_Diff'] = (abs(df['Len1'] - df['Len2']))
df['Date_Diff'] = (abs(df['Date1'] - df['Date2']))
df['Same_Type'] = np.where(df['Type1']==df['Type2'],1,0)
How can I add all these columns to the dataframe in one step. I want them in one step because I want to wrap the code in a try/except block to overcome value errors from failure to decode the text.
try:
<code to add all five columns at once>
except ValueError:
print "Failed to decode"
Adding a try/except block to each line above makes it ugly.
There are other questions like: Changing certain values in multiple columns of a pandas DataFrame at once, that deal with multiple columns, but they are all talking about one calculation/change affecting multiple columns. What I want is different calculations to add different columns.
UPDATE: From the answers given below, I tried two different ways to approach the problem, with partial luck so far. Here's what I did:
Approach 1:
# Add calculated columns Len1, Len2, Len_Diff, Date_Diff and Same_Type
def len_text(key):
try:
text = get_text(key)
return len(text)
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
return 0
df.loc[:, ['Len1','Len2','Len_Diff','Date_Diff','Same_Type']] = pd.DataFrame([
df['ID1'].map(len_text),
df['ID2'].map(len_text),
np.abs(df['ID1'].map(len_text) - df['ID2'].map(len_text)),
np.abs(df['Date1']- df['Date2'])
np.where(df['Type1']==df['Type2'],1,0)
])
print df.info()
Result1:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 570 entries, 0 to 569 df columns (total 10 columns): ID1 570 non-null int64 Date1 570 non-null int64 Type1 566 non-null object Len1 0 non-null float64 ID2 570 non-null int64 Date2 570 non-null int64 Type2 570 non-null object Len2 0 non-null float64 Date_Diff 0 non-null float64 Len_Diff 0 non-null float64 dtypes: float64(4), int64(4), object(2) memory usage: 58.0+ KB None
Approach2:
def len_text(col):
try:
return col.map(get_text).str.len()
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
return 0
formulas = """
Len1 = @len_text(ID1)
Len2 = @len_text(ID2)
Len_Diff = Len1 - Len2
Len_Diff = Len_Diff.abs()
Same_Type = (Type1 == Type2) * 1
"""
try:
df.eval(formulas, inplace=True, engine='python')
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
print e
print df.info()
Result2:
"__pd_eval_local_len_text" is not a supported function <class 'pandas.core.frame.DataFrame'> RangeIndex: 570 entries, 0 to 569 df columns (total 7 columns): ID1 570 non-null int64 Date1 570 non-null int64 Type1 566 non-null object ID2 570 non-null int64 Date2 570 non-null int64 Type2 570 non-null object Len1 570 non-null int64 dtypes: int64(5), object(2) memory usage: 31.2+ KB None /Users/.../anaconda2/lib/python2.7/site-packages/pandas/computation/eval.py:289: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy target[parsed_expr.assigner] = ret