So I have a Pandas DataFrame with x columns that have y rows. The data in the DataFrame are float64 values. I'm trying to calculate the slope correlation between two columns, but for the range of a single column (e.g. column has 25000 rows, I only want values ranging from 5-10, which happen to be in rows 2000-4000). In order to do so, I was going to iterate in a way demonstrated by the following psuedocode:
for i in range(i, len(df['Column 1']))
if df.loc[i, 'Column 1'] <= 10.0 & df.loc[i, 'Column 1'] >= 5.0:
value = df.loc[i, 'Column 1'] / df.loc[i, 'Column 2']
df['New Column'].append(value)
Note: the above code isn't meant to work; more just an outline of what I am trying to accomplish
I was looking at ways to iterate through Pandas DataFrames, and came across this link: How to iterate over rows in a Pandas DataFrame.
One of the answers refers to much better ways of manipulating data besides brute iteration: "Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "iter" in its name for more than a few thousand rows or you will have to get used to a lot of waiting." Thus, I want to vectorize my approach so I can manipulate multiple rows at a time to drastically decrease my runtime.
I was looking through other questions, and most answers are somewhat helpful but I need help with the specifics for my particular problem. I think the bulk of what I am trying to accomplish can be summarized with the following list:
- Given a Pandas DataFrame that contains multiple columns, iterate through a single column.
- In the single column, iterate through a certain range of values (e.g. over the course of 10k rows where values increase from 1 to 100 from 1st row to 10kth row, only iterate over values 20-50).
Sorry in advance for the repetitive nature of my question, I'm just really struggling with this particular problem in trying to create efficient iteration code.