0

Python newbie here. Here's a simplified example of my problem. I have 2 pandas dataframes.

One dataframe lightbulb_df has data on whether a light is on or off and looks something like this:

Light_Time Light On?
5790.76 0
5790.76 0
5790.771 1
5790.779 1
5790.779 1
5790.782 0
5790.783 1
5790.783 1
5790.784 0

Where the time is in seconds since start of day and 1 is the lightbulb is on, 0 means the lightbulb is off.

The second dataframe sensor_df shows whether or not a sensor detected the lightbulb and has different time values and rates.

Sensor_Time Sensor Detect?
5790.8 0
5790.9 0
5791.0 1
5791.1 1
5791.2 1
5791.3 0

Both dataframes are very large with 100,000s of rows. The lightbulb will turn on for a few minutes and then turn off, then back on, etc.

Using the .diff function, I was able to compare each row to its predecessor and depending on whether the result was 1 or -1 create a truth table with simplified on and off times and append it to lightbulb_df.

# use .diff() to compare each row to the last row
lightbulb_df['light_diff'] = lightbulb_df['Light On?'].diff()

# the light on start times are when 
#.diff is less than 0 (0 - 1 = -1)
light_start = lightbulb_df.loc[lightbulb_df['light_diff'] < 0]

# the light off start times (first times when light turns off)
# are when .diff is greater than 0 (1 - 0 = 1)
light_off = lightbulb_df.loc[lightbulb_df['light_diff'] > 0]

# and then I can concatenate them to have 
# a single changed state df that only captures when the lightbulb changes
lightbulb_changes = pd.concat((light_start, light_off)).sort_values(by=['Light_Time'])

So I end up with a dataframe of on start times, a dataframe of off start times, and a change state dataframe that looks like this.

Light_Time Light On? light_diff
5790.771 1 1
5790.782 0 -1
5790.783 1 1
5790.784 0 -1

Now my goal is to search the sensor_df dataframe during each of the changed state times (above 5790.771 to 5790.782 and 5790.783 to 5790.784) by 1 second intervals to see whether or not the sensor detected the lightbulb. So I want to end up with the number of seconds the lightbulb was on and the number of seconds the sensor detected the lightbulb for each of the many light on periods in the change state dataframe. I'm trying to get % correctly detected.

Whenever I try to plan this out, I end up using lots of nested for loops or while loops which I know will be really slow with 100,000s of rows of data. I thought about using the .cut function to divide up the dataframe into 1 second intervals. I made a for loop to cycle through each of the times in the changed state dataframe and then nested a while loop inside to loop through 1 second intervals but that seems like it would be really slow.

I know python has a lot of built in functions that could help but I'm having trouble knowing what to google to find the right one.

Any advice would be appreciated.

user3735930
  • 21
  • 1
  • 4
  • 1
    I think you can try to see the problem from a different perspective. If you find the start and end of the contiguous states [like in this question](https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array) then you can represent the data as intervals over the number line. Doing the same for the other dataframe, you just need to compute the intersection of both intervals [like here](https://stackoverflow.com/questions/40367461/intersection-of-two-lists-of-ranges-in-python) and split as you want (by second?) – Hugo Guillen Oct 26 '22 at 19:52

0 Answers0