0

I'm switching from R to Python. Unfortunately I'm stumbling upon a variety of loops which happen to run fast in my R scripts and too slow in Python (at least in my literal translations of such scripts). This code sample is one of them.

I'm slowly getting used to the idea that, when it comes to pandas, it's advisable to drop for loops and use instead while, vectorizing funcions, and apply.

I need a few examples on how to exactly do this, since unfortunately my loops rely too much on classic subsetting, matching and appending, operations that are too slow in its raw form.

# Create two empty lists to append results during loop
values = []
occurrences = []

#Create sample dataset, and sample series. It's just a sorted column (time series) and a column of random values:
time = np.arange(0,5000000,1)
variable = np.random.uniform(1,1000,5000000).round()
data = pd.DataFrame({'time' : time, 'variable':variable })

#Time datapoints to match
time_datapoints_to_match = np.random.uniform(0,5000000,200).round()

for i in time_datapoints_to_match:
    time_window = data[(data['time'] > i) & (data['time'] <= i+1000  )] #Subset a time window
    first_value_1pct = time_window['variable'].iloc[0] * 0.01 #extract 1/100 of the first value in time window
    try: #Check if we have a value which is lower than this 1/100 value within the time window
        first_occurence = time_window.loc[time_window['variable'] < first_value_1pct , 'time' ].iloc[0]
    except IndexError: #In case there are no matches, let's return NaN
        first_occurence = float('nan')
    values.append(first_value_1pct)
    occurrences.append(first_occurence)        

#Create DataFrame out of the two output lists
final_report = pd.DataFrame({'values': values, 'first_occurence': occurrences})
Kenan
  • 13,156
  • 8
  • 43
  • 50
  • Note: You're not actually using datetimes here, but the idea is the same; you should be able to [set an index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html) to allow efficient slicing. – ShadowRanger Jan 15 '20 at 02:50
  • @ShadowRanger This is more about efficient memory management. I want to find a way of subetting a dataframe 200 or 500 times to perform a calculation and then extract a single value out of it again 200 or 500 times. This is time consuming. I don't see how that other thread answers this. – Nahuel Patiño Jan 15 '20 at 02:54
  • BTW, not responsive to your question, but both Python and R are represented in https://julialang.org/benchmarks/, which try to compare idiomatic code across various languages in the statistical-computing arena. (It's noteworthy that overall, R comes out slowest of the lot -- though Python is pretty pokey too). – Charles Duffy Jan 15 '20 at 03:00
  • @ShadowRanger I'm forced to open the same question again. – Nahuel Patiño Jan 15 '20 at 03:00
  • can you post a small portion of the dataframe (5-10 rows) with an expected output and just the LOGIC, so we can help with a vectorized solution. The for loop isn't easy to follow – Kenan Jan 15 '20 at 03:37

0 Answers0