-2

my data is formatted as such:

Username, Timestamp, Text
Joe Bloggs, Thu Oct 5 09:00:00 +0000 2017, Starting work
Jane Doe, Fri Oct 7 18:00:00 +0000 2017, Finished work
Tom Smith, Sat Oct 8 04:00:00 +0000 2017, Still coding this thing

I have a CSV of 5M rows like this, and I'd like to extract just those within 9am-5pm Mon-Fri.

I've read plenty of posts about dummy data and row-by-row extraction, but I'd like to actually filter the dataset as a whole and the examples are either incomplete or confusing to non-experts.

EDIT:

Thanks to @ivan7707 for the answer. Here is my completed code, I didn't include anything at the start as I knew my code was wildly wrong. (I was having issues with %z so resorted to splitting.)

import csv
from datetime import datetime
main_file = csv.DictReader(open("source.csv","rb"))
for row in main_file: #points to csv
    username = row['Username']
    text = row['Text']
    timestamp = row['Timestamp']

    #Convert timestamp to useable format
    timestamp = timestamp.split()
    timestamp = (timestamp[2] + "-" + timestamp[1] + "-" + timestamp[5] + " " + timestamp[3])
    dt = datetime.strptime(timestamp, "%d-%b-%Y %H:%M:%S")

    if dt.isoweekday() in range(1, 6): #If day is Mon-Fri    
        if dt.hour in range(9, 17): #If hour is 9am-5pm
            output_file.writerow([username,text,timestamp]) #Save

EDIT 2:

Following the conversation ivan7707 and I had in the comments, here is the code that adds a week number to the data:

import csv
from datetime import datetime
main_file = csv.DictReader(open("source.csv","rb"))
for row in main_file:
    username = row['Username']
    text = row['Text']
    timestamp = row['Timestamp']

    #Convert timestamp to usable format as it was erroring with %z (+0000 part)
    timestamp = timestamp.split()
    timestamp = (timestamp[2] + "-" + timestamp[1] + "-" + timestamp[5] + " " + timestamp[3])
    dt = datetime.strptime(timestamp, "%d-%b-%Y %H:%M:%S")

    #Check if timestamp is within Mon-Fri 9am-5pm   
    if dt.isoweekday() in range(1, 6): #Mon-Fri
        if dt.hour in range(9, 17): #9am-5pm                
            weekday_list.append(week)
            output_file.writerow([username,text,timestamp,week]) #Writes to csv

    #Handy bit to iterate one week per 5 business days        
    elif dt.isoweekday() == 7:
        if len(weekday_list) > 1:
            weekday_list = []
            week += 1

Output for weekly script

Username, Timestamp, Text, Week,
Joe Bloggs, 06-10-2017 16:59:59, Hello World!, 1
Jane Doe,  09-10-2017 09:00:01, Hello!, 2
Wyatt Shipman
  • 1,669
  • 1
  • 10
  • 22
Nick B
  • 25
  • 1
  • 1
  • 6
  • Sounds good. Why don't you have a shot at it, then come back to use with any specific problem you're having with your attempt, and we'll be happy to help. – Jean-François Corbett Oct 06 '17 at 10:53
  • Thanks for the response. I've solved the issue thanks to ivan7707 below, though if you know how to save a csv after each week and month that would be really helpful. – Nick B Oct 06 '17 at 11:33

1 Answers1

0

Python's datetime module is your friend here. This should be enough to get you going.

example:

from datetime import datetime


dt = datetime.strptime("21/11/06 16:59", "%d/%m/%y %H:%M")

if dt.isoweekday() in range(1, 6):
    print('weekday')

if dt.hour in range(9, 17:00):
    print('working time')

Be aware of the right number in range not being included Good stack overflow answer

ivan7707
  • 1,146
  • 1
  • 13
  • 24
  • Thanks for your example, these two commands (.isoweekday() and .hour) were all I needed but for the life of me I couldn't find them. – Nick B Oct 06 '17 at 11:11
  • Any clues on how to adapt this to save a csv after each week and month? – Nick B Oct 06 '17 at 11:24
  • @NickB, Hi Nick, I am glad that you were able to figure it out. If this answered your question, please mark it as such. What you are looking for is how to schedule tasks. On windows I use task scheduler to run scripts that I need to run on a schedule. On Linux, cron jobs is the way. If you have issues with that, it should be setup as a separate question. – ivan7707 Oct 06 '17 at 11:47
  • ah that's not quite what I mean - if the timestamps comprise a 5-day week I'd like to export that to a csv. Months I can get around by using the basic month number (i.e. 10 for October). – Nick B Oct 06 '17 at 12:04
  • Hmm, still don't understand. – ivan7707 Oct 06 '17 at 12:18
  • Where there are sequences of "Mon, Tue, Wed, Thu, Fri, Sat, Sun", I'd like to output each of the the Mon-Fri sequences to a csv. With my current code in the first post I can save the entire dataset, but would like to segment it into weekly chunks. – Nick B Oct 06 '17 at 12:46
  • In your current code structure, try creating a list, appending to that list if it is within your criteria, if dt.isoweekday() == 5: 'create csv here' and reset the list. – ivan7707 Oct 06 '17 at 13:24
  • It took a lot of head scratching but I figured it out - updated the code above. Thanks for all your help! – Nick B Oct 06 '17 at 15:04
  • Pain is where the learning happens. Well done for figuring it out. – ivan7707 Oct 06 '17 at 22:41