0

The table below has summary statistics about the expense for each leader and expense type. I have the stable stored in python as a multi-index data frame object. My goal is to generate random data for each of the leaders and expense type using the mean and standard deviation under each category (run code snippet below to get the table). There is a "count" column which represents how many random numbers I want to generate for each Leader-Expense_Type combination. I've came up with extensive and inefficient looping structures which do not seem to get the job done right. How should I approach this problem?

Note: This is just a sample of the data. There are many more leaders with just as many expense types.

<table border="1" class="dataframe">  <thead>    <tr>      <th></th>      <th></th>      <th colspan="3" halign="left">Expense_Amount</th>    </tr>    <tr>      <th></th>      <th></th>      <th>mean</th>      <th>std</th>      <th>count</th>    </tr>    <tr>      <th>Leader</th>      <th>Expense_Type</th>      <th></th>      <th></th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th rowspan="7" valign="top">Leader1</th>      <th>Airfare</th>      <td>1979.684219</td>      <td>2731.629767</td>      <td>1358</td>    </tr>    <tr>      <th>Booking Fees</th>      <td>118.994538</td>      <td>270.007390</td>      <td>1179</td>    </tr>    <tr>      <th>Conference/Seminars</th>      <td>1553.830923</td>      <td>1319.295946</td>      <td>65</td>    </tr>    <tr>      <th>Hotel</th>      <td>1656.643658</td>      <td>2104.721093</td>      <td>1405</td>    </tr>    <tr>      <th>Meals</th>      <td>435.665122</td>      <td>676.705857</td>      <td>1476</td>    </tr>    <tr>      <th>Mileage</th>      <td>213.785046</td>      <td>284.908031</td>      <td>979</td>    </tr>    <tr>      <th>Taxi/Uber</th>      <td>308.530724</td>      <td>380.288964</td>      <td>1422</td>    </tr>    <tr>      <th rowspan="7" valign="top">Leader2</th>      <th>Airfare</th>      <td>1730.196911</td>      <td>2334.688155</td>      <td>628</td>    </tr>    <tr>      <th>Booking Fees</th>      <td>112.020556</td>      <td>573.407269</td>      <td>576</td>    </tr>    <tr>      <th>Conference/Seminars</th>      <td>1647.576500</td>      <td>1154.320584</td>      <td>80</td>    </tr>    <tr>      <th>Hotel</th>      <td>1693.080356</td>      <td>1953.552474</td>      <td>618</td>    </tr>    <tr>      <th>Meals</th>      <td>574.228548</td>      <td>844.997595</td>      <td>620</td>    </tr>    <tr>      <th>Mileage</th>      <td>215.898798</td>      <td>291.231331</td>      <td>466</td>    </tr>    <tr>      <th>Taxi/Uber</th>      <td>298.655852</td>      <td>340.926518</td>      <td>569</td>    </tr>  </tbody></table>
Agarp
  • 433
  • 7
  • 15
  • I'm not sure there is way that avoids looping, each distribution is unique. I don't see any need for it to be an "extensive?" loop, you just need to call `np.random.normal()` (assuming normality) with the arguments given by the three columns, so your code can be quite concise. Perhaps post what you've attempted so we can see how to improve it. Then we at least have something to benchmark against. – ALollz Feb 27 '19 at 16:28
  • Thanks @ALollz, I posted my solution below. – Agarp Feb 27 '19 at 17:32

2 Answers2

0

You can use df.apply(your_function, axis=1) with

def your_function(df):
    mean = df['mean']
    std = df['std']
    result = mean  # Replace with your number generator
    return result

For more extensive description see this answer: How to apply a function to two columns of Pandas dataframe

ulmefors
  • 516
  • 3
  • 11
  • This does not work because it's a multi-index data frame. I cannot slice it by df["mean"] (or df["std"]). – Agarp Feb 27 '19 at 17:34
0

This is my solution:

# Dictionary to hold generated data
rand_expenses_dict = {}

# Loop over each unique leader
for leader in agg_data.index.get_level_values("Leader").unique():

# Loop over each unique expense type
for expense_type in agg_data.index.get_level_values("Expense_Type").unique():

    # Not al leaders have all expense types
    # The exception handling method will ignore expense types
    # That do not correspond to a leader
    try:

        # Generate random numbers
        rand = (np.round(
                        np.random.normal(
                            loc=agg_data.loc[leader, expense_type][0],
                            scale = agg_data.loc[leader, expense_type][1],
                            size  = int(agg_data.loc[leader, expense_type][2])
                        ),2))

        # Add random numbers to data dictionaty
        rand_expenses_dict[(leader,expense_type)] = rand

    # If it finds an error, go to the next expense
    except:
        pass
Agarp
  • 433
  • 7
  • 15