108

I have a pandas data frame with 50k rows. I'm trying to add a new column that is a randomly generated integer from 1 to 5.

If I want 50k random numbers I'd use:

df1['randNumCol'] = random.sample(xrange(50000), len(df1))

but for this I'm not sure how to do it.

Side note in R, I'd do:

sample(1:5, 50000, replace = TRUE)

Any suggestions?

smci
  • 32,567
  • 20
  • 113
  • 146
screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • In pandas/numpy, there is a direct function `np.random.randint(low, high, size)`. No need to actually generate the range `low:high` and sample from it, as we do in R. – smci Apr 07 '17 at 09:32

3 Answers3

162

One solution is to use numpy.random.randint:

import numpy as np
df1['randNumCol'] = np.random.randint(1, 6, df1.shape[0])

Or if the numbers are non-consecutive (albeit slower), you can use this:

df1['randNumCol'] = np.random.choice([1, 9, 20], df1.shape[0])

In order to make the results reproducible you can set the seed with numpy.random.seed (e.g. np.random.seed(42))

Matt
  • 17,290
  • 7
  • 57
  • 71
36

To add a column of random integers, use randint(low, high, size). There's no need to waste memory allocating range(low, high) which is what that used to do in Python 2.x; that could be a lot of memory if high is large.

df1['randNumCol'] = np.random.randint(0,5, size=len(df1))

Notes:

smci
  • 32,567
  • 20
  • 113
  • 146
4

An option that doesn't require an additional import for numpy:

df1['randNumCol'] = pd.Series(range(1,6)).sample(int(5e4), replace=True).array
shortorian
  • 1,082
  • 1
  • 10
  • 19