I'm switching from R to Python. Unfortunately I'm stumbling upon a variety of loops which happen to run fast in my R scripts and too slow in Python (at least in my literal translations of such scripts). This code sample is one of them.
I'm slowly getting used to the idea that, when it comes to pandas, it's advisable to drop for
loops and use instead while
, vectorizing funcions
, and apply
.
I need a few examples on how to exactly do this, since unfortunately my loops rely too much on classic subsetting, matching and appending, operations that are too slow in its raw form.
# Create two empty lists to append results during loop
values = []
occurrences = []
#Create sample dataset, and sample series. It's just a sorted column (time series) and a column of random values:
time = np.arange(0,5000000,1)
variable = np.random.uniform(1,1000,5000000).round()
data = pd.DataFrame({'time' : time, 'variable':variable })
#Time datapoints to match
time_datapoints_to_match = np.random.uniform(0,5000000,200).round()
for i in time_datapoints_to_match:
time_window = data[(data['time'] > i) & (data['time'] <= i+1000 )] #Subset a time window
first_value_1pct = time_window['variable'].iloc[0] * 0.01 #extract 1/100 of the first value in time window
try: #Check if we have a value which is lower than this 1/100 value within the time window
first_occurence = time_window.loc[time_window['variable'] < first_value_1pct , 'time' ].iloc[0]
except IndexError: #In case there are no matches, let's return NaN
first_occurence = float('nan')
values.append(first_value_1pct)
occurrences.append(first_occurence)
#Create DataFrame out of the two output lists
final_report = pd.DataFrame({'values': values, 'first_occurence': occurrences})