2

SO, I have been trying to edit this script in order to define how many days ago the code starts working for each row.

The code I am using currently:

import csv
import datetime
import copy
from collections import defaultdict

with open(r"C:\Temp\test.csv") as i, open(r"C:\Temp\resuls.csv", "wb") as o:
    rdr = csv.reader(i)
    wrt = csv.writer(o)

    data, currdate = defaultdict(lambda:[0, 0, 0, 0]), None
    for line in rdr:
        date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]

        if date != currdate or not currdate:
            for v in data.itervalues(): v[:2] = v[2:]
            currdate = date

        wrt.writerow(line + data[name][:2])

        data[name][3] += 1
        if line[6] == "1": data[name][2] += 1

The data this works on:

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips

And produces:

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,1
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,1,4
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,3
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,2,7
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,1,2
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,5

It is doing the following and appending as two new columns:

  1. Counting the number of times a person appears in the list on dates prior to the date specified in the row and a 1 occurs in column 7.
  2. The number of times a person (column 8) appears in the list on dates prior to the date specified in the row (note the source data are sorted chronologically.)

Currently my source CSV contains years of historic data, what I need to be able to do is to limit the amount of time it remembers the data for. I am not sure if the best way to achieve this is by editing the original code or by re writing the whole thing. For instance if I wanted to perform the same counts for the last 365 days only or the last 60 days only. Any suggestions of a method or edit to achieve this?

AEA
  • 213
  • 2
  • 12
  • 34
  • It might be easier to put the data into a database (e.g. sqlite) and query what you want from there. – georg Nov 21 '13 at 00:42
  • Yeh I did initially consider this, but the amount of data manipulations and standardisations has led to me keeping it in csv format. – AEA Nov 21 '13 at 00:58
  • is the question being efficient and only loading what you need from the csv, or is it what is a good way to easily query by date in python? – Ryan Saxe Nov 21 '13 at 01:09
  • @RyanSaxe I am sorry Ryan, I do not follow? – AEA Nov 21 '13 at 01:11
  • @AEA are you more concerned with a nice API for dates and a "Pythonic" way of writing this, or is it more that you do not want to load such a large csv into python? – Ryan Saxe Nov 21 '13 at 01:12
  • @RyanSaxe The former :) – AEA Nov 21 '13 at 01:20

2 Answers2

4

If I understand your requirement correctly, you want to calculate 1_occurrence (in Column 7) & name_occurrence (Column 8) in the last N days and ignore any data prior to Nth Day

As you want to limit the search by N of days, you need to maintain date wise data. If you do that, your script would still work with some tweak

You can do something like this

def f5(last_N_days):
    with open("data.csv") as i, open("resuls.csv", "wb") as o:
        rdr = csv.reader(i)
        wrt = csv.writer(o)
        data, currdate = defaultdict(lambda:[0, 0,0,0]), None
        cumulativedata = defaultdict(lambda:defaultdict(lambda:[0, 0,0,0]))
        for line in rdr:
            date, name = datetime.datetime.strptime(line[0], '%d/%m/%Y'), line[7]

            if date != currdate or not currdate:
                for k,v in data.iteritems():
                    cumulativedata[currdate][k] = list(v[2:])
                    v[:2]=[v[2] - cumulativedata[clean_before][k][0],
                                 v[3] - cumulativedata[clean_before][k][1]]
                clean_before = date + datetime.timedelta(days=-(last_N_days))
                currdate = date

            wrt.writerow(line + data[name][:2])

            data[name][3] += 1

            if line[6] == "1":
                data[name][2] += 1

f5(1) # Calculate only for last 1 day

which produces the following

02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0
02/01/2005,Data,Class tn2,4,10yo+,6,4,Tom Phillips,0,0
03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2
03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1
03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2
04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,0
04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,0,2
04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,2
05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,1
05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,1,3
05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,0,0
05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,2

Note : I assumed every date has an entry. If this is not true (if any gap), you may want to tweak the script accordingly

Just a thought:

If the reading huge historical data is hurting you, then you can limit the csv data read to 2*N_days (N1...Nn,Nn+1...N2n where Nn+1...N2n is your working block)

a solution is proposed here

then workout your search logic within this buffer

Community
  • 1
  • 1
user2390183
  • 975
  • 8
  • 17
  • v[:2] was not used earlier, moved arithmetic calculation to the loop to avoid being calculated for every written. Please check the updated code above. – user2390183 Nov 24 '13 at 09:56
  • I can get this script working perfectly with the example, however I simplified my example a little. (not much) However I cannot get this code to work within my real script. (for some reason) I will include my attempt to apply the code in my question. – AEA Nov 25 '13 at 03:48
  • please see the edit I have made to the post with regards to my issues. – AEA Nov 25 '13 at 04:03
  • As I can understand from your modified script, you are handling header and doing additional calculation for extra columns. This should not impact other logic. Is it possible for you to share input file, to reproduce the problem? – user2390183 Nov 25 '13 at 11:01
  • I cannot share the exact input file but I can create an example. On it now. – AEA Nov 25 '13 at 12:48
  • I tried running the script for 50K+ rows (as your original input). I didnt face any issue. [Input here](http://www.sendspace.com/file/oc4xxb) and [Output here](http://www.sendspace.com/file/dyrwvn) . Looks like you got input in different format now, which means you got different requirement than you originally posted ;). I cant comment on that unless you share the input that you are currently working on – user2390183 Nov 25 '13 at 18:08
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/41894/discussion-between-user2390183-and-aea) – user2390183 Nov 25 '13 at 19:28
  • I am not sure if you get notified of stuff posted in chat, I haven't used it before. But I have put some content in chat. I will add it to the question in case anyone else decides to join the party. – AEA Nov 25 '13 at 22:23
  • I have changed my attempt at applying your methodology to my code in the question as well as a link to pastebin with a csv for which I had hoped said code would work. Kind regards AEA – AEA Nov 26 '13 at 02:34
  • Looks like you are looping unnecessarily. I have commented in chat discussion – user2390183 Nov 26 '13 at 15:07
  • I got it fixed thanks to your comments, I must have been half asleep when I tried to get that working. Great answer! Thanks so much for the help :) – AEA Nov 27 '13 at 00:19
  • No worries. Glad that I could help! – user2390183 Nov 27 '13 at 09:48
0

If you really don't want to use a database, and load the whole csv, I would use pandas for this kind of querying (note,(possibly much) slower than SQL querying)

Here is how you would load the csv and get only the last 60 days:

import pandas as pd
import datetime

today = datetime.date.today()
sixty_days_ago = today - datetime.timedelta(days=60)

df = pd.io.parsers.read_csv("your_csv.csv")
#assume column name is date
condition = (df.date >= sixty_days_ago) & (df.date <= now)

df_between_correct_dates = df[condition]

now df_between_correct_dates will be a pandas DataFrame that includes only rows from your csv that had a date in the date column that is no older than 60 days ago!

Let me know if this does not solve your problem.

Ryan Saxe
  • 17,123
  • 23
  • 80
  • 128
  • Hello Ryan, I feel either my comments or the original question have mislead you about what I want to achieve. I will add some more content at the bottom of my question to clarify the objective of the question. – AEA Nov 21 '13 at 01:42
  • @AEA sorry I actually completely forgot about this haha...looking over your clarification, I am still not 100% sure what you want. Please comment in your code what does what and what doesn't do what you want. I understand that you want something to do with a specific timeframe, but do you also want to do things with the other times? it's not fully clear – Ryan Saxe Nov 23 '13 at 17:29