Sampling rows with sample size greater than length of DataFrame

Question

I'm being asked to generate a new variable based on the data from an old one. Basically, what is being asked is that I take values at random (by using the random function) from the original one and have at least 10x as many observations as the old one, and then save this as a new variable.

This is my dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv

The variable I wanna work with, is area

This is my attempt but it is giving me a module object is not callable error:

import pandas as pd
import random as rand

dataFrame = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv")

area = dataFrame['area']

random_area = rand(area)

print(random_area)

possible duplicate of https://stackoverflow.com/questions/15923826/random-row-selection-in-pandas-dataframe — Vignesh Krishnan, Jan 05 '19 at 13:24
@vigneshkrishnan Sorry but that does not answer this question, the sample size is greater than the population size. Thanks. — cs95, Jan 05 '19 at 13:27

cs95 · Accepted Answer · 2019-01-05T13:56:56.690

3

You can use the sample function with replace=True:

df = df.sample(n=len(df) * 10, replace=True)

Or, to sample only the area column, use

area = df.area.sample(n=len(df) * 10, replace=True)

Another option would involve np.random.choice, and would look something like:

df = df.iloc[np.random.choice(len(df), len(df) * 10)]

The idea is to generate random indices from 0-len(df)-1. The first argument specifies the upper bound and the second (len(df) * 10) specifies the number of indices to generate. We then use the generated indices to index into df.

If you just want to get the area, this is sufficient.

area = df.iloc[np.random.choice(len(df), len(df) * 10), df.columns.get_loc('area')]

Index.get_loc converts the "area" label to position, for iloc.

df = pd.DataFrame({'A': list('aab'), 'B': list('123')})
df
   A  B
0  a  1
1  a  2
2  b  3

# Sample 3 times the original size
df.sample(n=len(df) * 3, replace=True)

   A  B
2  b  3
1  a  2
1  a  2
2  b  3
1  a  2
0  a  1
0  a  1
2  b  3
2  b  3

df.iloc[np.random.choice(len(df), len(df) * 3)]

   A  B
0  a  1
1  a  2
1  a  2
0  a  1
2  b  3
0  a  1
0  a  1
0  a  1
2  b  3

edited Jan 05 '19 at 13:56

answered Jan 05 '19 at 13:19

cs95

379,657
97
704
746

Btw. About the code: `df.sample(n=len(df) * 10, replace=True)`. Why are you multiplying the entire dataset by 10? – Jan 05 '19 at 13:28
@OnurOzbek I am not multiplying the dataset by 10, I am specifying the sample size to be len(df) times 10 since your requirement was " have at least 10x as many observations as the old one" – cs95 Jan 05 '19 at 13:29
1

@OnurOzbek, Re: "You need to explain the syntax". There are better ways to *request help from volunteers*. For example, "please can you explain how `iloc` and `get_loc` work"? You're lucky coldspeed has responded, I would be *less* likely to respond to such a comment from a user. – jpp Jan 05 '19 at 13:38
1

Thanks, @coldspeed. I've already accepted your answer. I appreciate it. – Jan 05 '19 at 14:10

Sampling rows with sample size greater than length of DataFrame

1 Answers1

Linked