12

I am very new to Pandas (i.e., less than 2 days). However, I can't seem to figure out the right syntax for combining two columns with an if/else condition.

Actually, I did figure out one way to do it using 'zip'. This is what I want to accomplish, but it seems there might be a more efficient way to do this in pandas.

For completeness sake, I include some pre-processing I do to make things clear:

records_data = pd.read_csv(open('records.csv'))

## pull out a year from column using a regex
source_years = records_data['source'].map(extract_year_from_source) 

## this is what I want to do more efficiently (if its possible)
records_data['year'] = [s if s else y for (s,y) in zip(source_years, records_data['year'])]
pocketfullofcheese
  • 8,427
  • 9
  • 41
  • 57
  • I wonder whether you can write `records_data['year'] = records_data['source'] if records_data['source'] else records_data['year']`. – minopret Nov 28 '12 at 01:56
  • nope! --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () ----> 1 x=records_data['source'] if records_data['source'] else records_data['year'] ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() – pocketfullofcheese Nov 28 '12 at 01:59
  • I'm looking at [Combining overlapping data sets](http://pandas.pydata.org/pandas-docs/stable/basics.html#combining-overlapping-data-sets). Maybe you can figure that out faster than I can. I was merely intrigued by your question. I have no familiarity with Pandas. – minopret Nov 28 '12 at 02:05
  • I imagine your zip operation can be eliminated by the technique of [How to slice by columns in pandas](http://stackoverflow.com/questions/10665889/how-to-slice-by-columns-in-pandas). But I still think there's a better way that one could figure out from the documentation that I linked. – minopret Nov 28 '12 at 02:16
  • seems that you could use `parse_dates` and an index col to get your dates parsed by `read_csv`. btw you do not need to `open` your file, when you pass it to `read_csv`. Please provide some example data! – bmu Nov 28 '12 at 07:39

2 Answers2

17

In pandas >= 0.10.0 try

df['year'] = df['year'].where(source_years!=0,df['year'])

and see:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking

As noted in the comments, this DOES use np.where under the hood - the difference is that pandas aligns the series with the output (so for example you can only do a partial update)

denfromufa
  • 5,610
  • 13
  • 81
  • 138
Jeff
  • 125,376
  • 21
  • 220
  • 187
10

Perhaps try np.where:

import numpy as np
df['year'] = np.where(source_years,source_years,df['year'])
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This is great! Unlike pandas' `.where` this allows me to specify only a scalar and it'll do the dirty work of filling in the series itself. A contrived example: `df['c'] = np.where(df.a > df.b, 1, -1)` – fantabolous Aug 18 '14 at 03:34