-1

I am trying to filter out some data and seem to be running into some errors. Below this statement is a replica of the following code I have:

 url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
 source = requests.get(url).text
 s = StringIO(source)
 election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
        convert_dates="coerce", convert_numeric=True)
 election_data.head(n=3)
 last_day = max(election_data["Start Date"])
 filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]

As you can see last_day is the max within the column election_data I would like to filter out the data in which the difference between the max and x is less than or equal to 5 days I have tried using for - loops, and various combinations of list comprehension.

 filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]

This line would normally work however, python3 gives me the following error:

 <map object at 0x10798a2b0> 
KatieRose1029
  • 185
  • 1
  • 2
  • 9

2 Answers2

0

Your first attempt has it almost right. The issue is

(last_day - election_date['Start Date']).days

which should instead be

(last_day - election_date['Start Date']).dt.days

Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.

data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]

Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).

Igor Raush
  • 15,080
  • 1
  • 34
  • 55
-1

If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.

If you want an entirely new DataFrame object with the filtered data:

election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]

Or if you just want the Start Date column post-filtering:

last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]

Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

semore_1267
  • 1,327
  • 2
  • 14
  • 29
  • `last_day-election_data["Start Date"]<=5` is not a valid comparison. The left side is a `Series` of `timedelta64[ns]` which cannot be compared against an integer. – Igor Raush Dec 07 '16 at 22:46
  • Go note @IgorRaush. Totally forgot about the formatting of the dates. Updated the answer. – semore_1267 Dec 07 '16 at 22:55