1

I have df that looks like this

df:

id   dob
1    7/31/2018
2    6/1992

I want to generate 88799 random dates to go into column dob in the dataframe, between the dates of 1960-01-01 to 1990-12-31 while keeping the format mm/dd/yyyy no time stamp.

How would I do this?

I tried:

date1 = (1960,01,01)
date2 = (1990,12,31)

for i range(date1,date2):
    df.dob = i
RustyShackleford
  • 3,462
  • 9
  • 40
  • 81

1 Answers1

8

I would figure out how many days are in your date range, then select 88799 random integers in that range, and finally add that as a timedelta with unit='d' to your minimum date:

min_date = pd.to_datetime('1960-01-01')
max_date = pd.to_datetime('1990-12-31')

d = (max_date - min_date).days + 1

df['dob'] = min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')

>>> df.head()
         dob
0 1963-03-05
1 1973-06-07
2 1970-08-24
3 1970-05-03
4 1971-07-03

>>> df.tail()
             dob
88794 1965-12-10
88795 1968-08-09
88796 1988-04-29
88797 1971-07-27
88798 1980-08-03

EDIT You can format your dates using .strftime('%m/%d/%Y'), but note that this will slow down the execution significantly:

df['dob'] = (min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')).strftime('%m/%d/%Y')

>>> df.head()
          dob
0  02/26/1969
1  04/09/1963
2  08/29/1984
3  02/12/1961
4  08/02/1988
>>> df.tail()
              dob
88794  02/13/1968
88795  02/05/1982
88796  07/03/1964
88797  06/11/1976
88798  11/17/1965
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    Could use `strftime` to format the date as OP asked – Kevin Fang Oct 29 '18 at 22:41
  • 1
    @sacul thank you, how could I format the date on the fly ? – RustyShackleford Oct 29 '18 at 22:41
  • 1
    @sacuL. Could I please check a couple of points? In the line pd.np.random.randint do we need to include the pd.np.random.randint or could we just write np.random.randint? I couldn't see any difference in my result when I included the pd. or not. Also for the line d = (max_date - min_date).days + 1, can you explain the use of .days here? I understand that we are using days as a unit of time (hence unit='d' later in the code) however I don't fully understand why I need to include .days here as d is just a max integer value for randint? My code fails if I don't include it. Many thanks – mmTmmR Aug 06 '19 at 15:12
  • 1
    @mmTmmR `pd.np.random.randint` is there just so you don't have to explicitly import numpy by `import numpy as np`, but it is exactly the same as saying `np.random.randint` *if* you have imported numpy already. For the `d = (max_date - min_date).days + 1`, that is just to get a list of valid integers. `.days` gives an integer of the number of days in the range `max_date - min_date` – sacuL Aug 06 '19 at 15:22