0

I am new to this and probably might be a naive question to even ask. I want to generate a random dataset with some constraints :

date_1 - already generated in csv (Dated from 1 august 2018- 1 august 2019)
date_2 - 60% of the data lies within the 30 days from the date_1 and 40% of the data lies within 90 days of the date_2.  

capacity_1 - 3500 kgs is the threshold for a day. Cannot exceed the same for date_2 
capacity_2 - leftout weight for the day. its 3500-capacity_1 for a particular day.

The date_1 format that I have is d/m/y

Can anyone advise me as to how to achieve the other columns as well. I am planning to build the dummy data with 100,000 rows.

Edit : Attaching the csv file for the data here

EDIT2 : The input would look like :

date_1   

01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018
01/08/2018

Expected Output :

enter image description here

Here capacity_2 would be 3500-capacity_1 for a particular date_2. capacity_2 basically would give the idea of how much out of 3500 has been used for a particular date.

Thanks

dper
  • 884
  • 1
  • 8
  • 31
  • @AndiDomi Can you help me with a pseudo code for a smaller entries. I didn't get the explained part. – dper Aug 26 '19 at 08:36
  • Sorry for the first unclear comment, an update: I would load the csv with [pandas](https://stackoverflow.com/questions/17465045/can-pandas-automatically-recognize-dates) Then I would take 60% of the first rows in [pandas](https://stackoverflow.com/questions/12021754/how-to-slice-a-pandas-data-frame-by-position), by taking the total number of rows then mulitpy that number by 0.6 getting df[ : first_60] then take df[first_60 : ] for the rest (40%). – Andi Domi Aug 26 '19 at 08:42
  • Then for each column depending on the dataset above i would use [randrange(30)](https://stackoverflow.com/questions/3996904/generate-random-integers-between-0-and-9) or `randrange(90)` to generate a day with [timedelta](https://stackoverflow.com/questions/6871016/adding-5-days-to-a-date-in-python) – Andi Domi Aug 26 '19 at 08:42
  • @AndiDomi I did the splitting of the dataset : `df1 = df[:60000] df2 = df[60000:100000]` How would we generate a day with timedelta, I am not able to get. Could you help ? – dper Aug 27 '19 at 10:22
  • it would help us to have also some rows from your csv file (at least some dummy data to work with) as we dont really know how your data looks like and how the endresult should look. – Andi Domi Aug 27 '19 at 14:17
  • @AndiDomi I have edited the question and provided the link to the csv file... https://drive.google.com/open?id=1rIOvbE4LGuTJh86lQvaSVbpRa5KsSaLH Here is the link as well to the csv file. Can you please help with the same. thanks – dper Aug 28 '19 at 09:12

1 Answers1

0

Have a look at this example:

import pandas as pd
import datetime
import random

data = [
  [1,'a','b','01/08/2018'],
  [2,'a','b','01/08/2018'],
  [3,'a','b','01/08/2018'],
  [4,'a','b','01/08/2018'],
  [5,'a','b','01/08/2018'],
  [6,'a','b','01/08/2018'],
  [7,'a','b','01/08/2018'],
  [8,'a','b','01/08/2018'],
  [9,'a','b','01/08/2018'],
  [10,'a','b','01/08/2018']
  ]
# this should be crated by reading the csv file
df = pd.DataFrame(data) 

df = pd.to_datetime(df[0:][3], format='%d/%m/%Y')

generated_data = []
for index, date_1 in df.iteritems():
  # if in the first 60% generate +-30 days
  if index <= len(df) * 0.6 :
    days_to_add = random.randint(-30,30)
  else : 
    days_to_add = random.randint(-90,90)

  # add that day to the date_1
  date_2 = date_1 + datetime.timedelta(days=days_to_add)
  weight = random.randint(0,3500)
  capacity_1 = random.randint(0, 3500 - weight)
  capacity_2 = 3500 - capacity_1
  generated_data.append([date_1, date_2, weight, capacity_1, capacity_2])

df2 = pd.DataFrame(generated_data)
print(df2)

First we remove from the extract CSV only the third column. Then we use generate the days to add to the date then we date_1 and create date_2. capacity_1 is created by taking 3500-weight as the ceiling, the rest is easy. The output looks like:

0 2018-08-01 2018-07-06  3136    11  3489
1 2018-08-01 2018-08-11  3476    13  3487
2 2018-08-01 2018-07-28  2620   207  3293
3 2018-08-01 2018-07-22  1976  1437  2063
4 2018-08-01 2018-07-06  3057    19  3481
5 2018-08-01 2018-08-19   773  1481  2019
6 2018-08-01 2018-08-06   823  2002  1498
7 2018-08-01 2018-07-01  1166  2200  1300
8 2018-08-01 2018-10-22   156  2567   933
9 2018-08-01 2018-06-18  1248  1842  1658
Andi Domi
  • 731
  • 2
  • 19
  • 48
  • Thanks for your response, I think you misunderstood what I am trying to say, 1.The dates need to be selected randomly and not just index based, as it would result in only initial values assigned to the 30 days. Also, the rule would be for ex- if there are 20 entries for date_2 in 2018-01-01, the summation of capacity_1 should not exceed the total of 3500. it should be 3500 or less. and then capacity_2 = 1-capacity_1 for each day combined. – dper Aug 29 '19 at 09:36
  • Not really clear what your expected output is. Should capacity_1 be updated after each date? Meaning if we have 1 date repeating twice, how would capacity_1 look like. Is it a floating number? Please update your question with an expected output example. – Andi Domi Aug 29 '19 at 11:21
  • I have edited the question with the input and expected output. Thanks – dper Aug 30 '19 at 03:24