0

I am currently working on extracting data from sensors during the operation of a machine using a large time-series data.

The data can be accessed through either a Google Sheets link or a Dropbox link.

enter image description here

The machine undergoes two distinct movements: upward and downward. I have a data frame that includes the following variables:

  • Dates (millisecondes included)
  • Sensor 1
  • Sensor 2
  • UP: Indicates a value of 1 when the machine is moving up and -1 when it doesn't
  • DOWN: Indicates a value of 1 when the machine is moving down and -1 when it doesn't

While I have been able to successfully extract data using the "UP/DOWN" variables to identify machine movement, I am currently encountering an issue due to the presence of time lags or shifts in these variables. The inconsistency arises from the fact that the machine initiates its upward or downward movement prior to the "UP" or "DOWN" variables indicating the corresponding direction.

To illustrate the problem visually, please refer to the graph below:

enter image description here

My objective is to find a solution that allows me to extract the entire movement (UP/DOWN) from the beginning to the end without losing any data.

The solution could be in R or Python. Thanks in advance.

AnonX
  • 167
  • 8
  • rather than relying on the UP and DOWN variables (which seem quite dubious if the second chart is anything to go by) could you use a cut-off point for one or both of the Sensors? e.g. Sensor 2 starts off, when the value goes above -0.75 (or the derivative changes dramatically) it's on, when it goes below it it is off again, and that is one movement added to a list. – Mark Jul 13 '23 at 17:37
  • It looks like one issue is that there are multiple observations for several values of the `Date` variable: that could lead to the lag problems you are encountering because the values within the repeated values may end up being sorted in a random order (and thus not correspond to "up/down"). – mikebader Jul 13 '23 at 18:31
  • @mikebader just downloaded the file as XLSX from Google Sheets and run it in R. It does show the dates with milliseconds... There are no duplicates in dates. – AnonX Jul 13 '23 at 19:42
  • @Mark I was thinking about this but I'm afraid to miss a mouvement... What if my sensor turns off (its value drops near -1) but the machine keeps moving up or down? – AnonX Jul 13 '23 at 21:05
  • I couldn’t access the data on Google Sheets except by downloading it and can’t edit the Google Sheet to widen the column. Can you use `dput()` in R to share the data? (example here: https://stackoverflow.com/questions/49994249/example-of-using-dput#49995752) – mikebader Jul 14 '23 at 00:36
  • @mikebader I used Google Sheets because it's a large data and I couldn't share just with ```dput()``` Here's a dropbox link if you want: https://www.dropbox.com/scl/fi/9d5k3ejpb0k413zii2li8/Data_SO.xlsx?rlkey=la0915rnvtc4tup8s2amn4jev&dl=0 – AnonX Jul 14 '23 at 02:57
  • I had a look over the data and had a think over things @AuronusBen. You say in your question you want "to extract the entire movement (UP/DOWN) from the beginning to the end *without losing any data*. I think the first part of that is possible, the second part (not losing any data) is probably not – Mark Jul 14 '23 at 13:17
  • Your worries about times when the sensors don't work are warranted- the data does indeed include parts where one or both of the Sensors are NA. The issue with this is, how can you then tell how far something is up or down? In the image you showed, the movement is sometimes all the way up, then halfway down, stagnation, then half way down again. If you only have the up and down data, you aren't able to know what type you have – Mark Jul 14 '23 at 13:20
  • So essentially at that point, you will be guessing, which means some of your guesses will be right, and others wrong. This is funnily enough a good situation for building a machine learning model – Mark Jul 14 '23 at 13:21
  • @Mark my end goal is to create a classifier of good and bad "up/down" movements.. that's why I need data that describe these "up/down" movements... – AnonX Jul 14 '23 at 18:39
  • Ah so the plan is to use this as the training data? – Mark Jul 14 '23 at 20:38
  • @Mark yes but i need to clean my data before any manipulation... – AnonX Jul 14 '23 at 22:03

1 Answers1

0
library(tidyverse)
library(readxl)

df <- read_excel("~/Downloads/Data_SO.xlsx")

missing_data <- df[is.na(df$Sensor1) | is.na(df$Sensor2),]$ID

missing_data 
# [1]    62   899  1699  2492  3281  4109  4909  5707  6493  7291  8078  8886
# [13]  9673 10471 10650 11259 12062 12221 12833 13459 13632 13812 14420 15240
# [25] 15419 16026 16822 17003 17621 18423 18604 19217 20013 20807 21016 21135
# [37] 21136 21306 21407 21602 21737 21837 21978 22130 22765 22938 23725 24528
# [49] 25315 26133 26904 27775 27954 28563 28610 28611 28612 28613 29400

# As you can see, there are ~60 missing data points, of a total of ~30000 data points
# Other than a couple of examples, they all tend to be spread out, so even when we
# miss one for a particular second, we will have a couple of other data points for
# that second (and in the couple of cases where we don't, we can just use the
# next second)

# plot the data
ggplot(df, aes(x = Sensor2)) + geom_histogram(binwidth = 0.001)

plot

ggplot(df, aes(x = Sensor1)) + geom_histogram(binwidth = 0.001)

plot

# Assuming Sensor2 and Sensor1 are monitoring the same thing, we canw see that
# Sensor2 has a much smaller variance in its values than Sensor1. Furthermore,
# with Sensor2, it's clear what the cut-off values should be. With Sensor1, that's
# not the case. So we will use Sensor2 for our predictions.

df <- df %>%
    filter(!ID %in% missing_data) %>%
    select(Datetime = Date, Sensor = Sensor1)

df %>%
    mutate(On = Sensor > -0.9,
            # find the times when the sensor switched from off to on
            Switched = On != lag(On)) %>%
    filter(Switched) %>%
    mutate(Start = lag(Datetime)) %>%
    select(Start, End = Datetime) %>%
    filter(!is.na(Start))

# A tibble: 167 × 2
   Start                   End                    
   <dttm>                  <dttm>                 
 1 2019-02-01 09:30:22.000 2019-02-02 02:47:54.200
 2 2019-02-02 02:47:54.200 2019-02-02 02:49:58.000
 3 2019-02-02 02:49:58.000 2019-02-02 02:50:02.000
 4 2019-02-02 02:50:02.000 2019-02-02 02:50:36.000
 5 2019-02-02 02:50:36.000 2019-02-02 03:11:22.799
 6 2019-02-02 03:11:22.799 2019-02-02 03:11:58.000
 7 2019-02-02 03:11:58.000 2019-02-02 03:12:00.599
 8 2019-02-02 03:12:00.599 2019-02-02 03:14:04.400
 9 2019-02-02 03:14:04.400 2019-02-02 03:24:10.000
10 2019-02-02 03:24:10.000 2019-02-02 03:26:13.799
# ℹ 157 more rows
Mark
  • 7,785
  • 2
  • 14
  • 34
  • Thank you Mark for your suggestion! But I feel like the filter only works half the time... For instance, if I take [index (3)](https://i.imgur.com/biIq6Ma.jpg), its last value contains a value greater than -0.95 (my cut-off) that should be in the next [index (4)](https://i.imgur.com/FXtMeJr.jpg) ... Same for index 9, its last value greater than my cuff-off should be in index 10... – AnonX Jul 15 '23 at 11:06
  • I don't follow at all...but it sounds like you are rejecting a "good enough" solution in favour of a fantasy ideal solution – Mark Jul 15 '23 at 11:50
  • what is the application of this? – Mark Jul 15 '23 at 11:51
  • I'm probably overthinking, just trying to make the best dataset for my next step... that's probably why I'm nitpicking at these details... – AnonX Jul 15 '23 at 12:16
  • if your model has any kind of robustness, and if your situation isn't literally life or death (i.e. it's not performing surgery or in self-driving cars) a bit of fuzz isn't going to make a meaningful distance. At the end of the day, we're literally talking about fractions of a second – Mark Jul 15 '23 at 12:31