2

Objective

I have a large dataset, df, where I have a Length, Date and Edit Column. My goal is to iterate through a large dataset and find the index, start and end times for a given condition.

Working backwards, I need to get the index or row number where Edit is False *with the condition that the previous Edit is True.This will output an 'End' and the value that is in the Length column.

The Start is generated by going backwards from the 'End' index (Edit is False) and when you come across the next (Edit is False) + 1

 Length        Date                               Edit

  20            1/2/2020 1:00:00 AM               False
  21            1/2/2020 1:00:01 AM               True
  81            1/2/2020 1:00:02 AM               True
  81            1/2/2020 1:00:03 AM               True
  90            1/2/2020 1:00:04 AM               False
  20            1/2/2020 1:00:05 AM               True
  90            1/2/2020 1:00:06 AM               True
  81            1/2/2020 1:00:10 AM               True
  90            1/2/2020 1:00:15 AM               False        
  20            1/2/2020 1:00:25 AM               True

This is my desired output:

Start                   End                   Duration   RowNum      Length 

1/2/2020 1:00:05 AM     1/2/2020 1:00:15 AM   10         8              90
1/2/2020 1:00:01 AM     1/2/2020 1:00:04 AM   3          4              90

Starting backwards, we see that the first End time is at, 1/2/2020 1:00:15 AM, because Edit is False, and its previous Edit value is True. The length is 90, and the RowNumber is 8. The Start would go backwards from 1/2/2020 1:00:15 AM until we come to another Edit is False line plus 1 , so it would be: 1/2/2020 1:00:05 AM

dput

structure(list(Length = c(20L, 21L, 81L, 81L, 90L, 20L, 90L, 
81L, 90L, 20L), Date = structure(1:10, .Label = c("1/2/2020 1:00:00 AM", 
"1/2/2020 1:00:01 AM", "1/2/2020 1:00:02 AM", "1/2/2020 1:00:03 AM", 
"1/2/2020 1:00:04 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:06 AM", 
"1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:25 AM"
 ), class = "factor"), Edit = c(FALSE, TRUE, TRUE, TRUE, FALSE, 
TRUE, TRUE, TRUE, FALSE, TRUE)), class = "data.frame", row.names = c(NA, 
-10L))

This is what I have tried

 library(dplyr)
 library(readr)

 for (i in 1:nrow(df) {


if (df[i] == Edit == "False") {
print(df[rows]) 
}
    else if (df[i] < condition) {
print(df[rows])

}
    }

   mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
   mutate(RowNum = cumsum(!cond)) %>%
   group_by(Length) %>%
   summarize(Start = min(Date),
        End = max(Date),
        Duration = End - Start) %>%

I have a start, I am just unsure of how to put this together. Any help or suggestions are appreciated.

Lynn
  • 4,292
  • 5
  • 21
  • 44

3 Answers3

3

True + False should give 1 (True ==1, False == 0). Basically, one end should be True + False, the other end should be False + True. So you have a window.
Next step is to get rid of the nulls Then look for values in going_forward equal to 1.

df['grouping_forward'] = df.Edit.add(df.Edit.shift(1))
df['grouping_backward'] = df.Edit.add(df.Edit.shift(-1))


(df.dropna()
 .query('grouping_forward==1')
 .assign(Row = lambda x: np.where(x.Edit.eq(0),
                                  x.index,
                                  np.nan),
        Start = lambda x: np.where(x.Edit.eq(1), 
                                   x.Date,
                                   np.datetime64('NaT')),
        End = lambda x: np.where(x.Edit.eq(0),
                                 x.Date,
                                 np.datetime64('NaT'))
    )
 .ffill()
 .query('Edit == 0')
 .drop(['grouping_forward','grouping_backward','Date','Edit'],axis=1)
 .assign(Duration = lambda x: x.End.sub(x.Start).dt.seconds)
  )

    Length  Row     Start                End             Duration
4   90      4.0 2020-01-02 01:00:01 2020-01-02 01:00:04     3
8   90      8.0 2020-01-02 01:00:05 2020-01-02 01:00:15     10
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Thank you @sammywemmy. This Is Python I believe? – Lynn Mar 01 '20 at 09:52
  • 1
    yes. python. the pandas library is used here. and numpy – sammywemmy Mar 01 '20 at 09:52
  • Ok I will try this tomorrow as it is very late now. Thank you – Lynn Mar 01 '20 at 09:53
  • Hi @sammywemmy, I can getting a few errors while parsing. Will this code work for large datasets? Some of the fields in my dataset have NaN values as well – Lynn Mar 01 '20 at 18:55
  • I have imported pandas and numpy, do I need to import any other package? I am getting this error: TypeError: unsupported operand type(s) for -: 'str' and 'str' – Lynn Mar 01 '20 at 18:59
  • Do I need to first establish that False is 0 and True is 1? I am still getting familiar with Python. Would I have to do: df.Edit(False == 0)? @sammywemmy Thank you – Lynn Mar 01 '20 at 19:08
  • 1
    If you are filtering then u do df[‘Edit’]==0 or 1. False is equivalent to 0 and True is equivalent to 1. Ur use case determines ur approach. – sammywemmy Mar 01 '20 at 19:14
  • ok thank you- I was able to do this df.Edit= df.Edit.astype(int) However, I am still getting this error:TypeError: unsupported operand type(s) for -: 'str' and 'str' – Lynn Mar 01 '20 at 19:16
  • Do I need to cast them to integers? @sammywemmy? – Lynn Mar 01 '20 at 19:18
  • 1
    Kindly do a trace to find out which line is giving u their error message. Also is your date column in date format. I assumed it already was – sammywemmy Mar 01 '20 at 19:19
  • No u don’t need to cast them as integers. Leave as is. – sammywemmy Mar 01 '20 at 19:19
  • How can I find if date column is in data format? I will check – Lynn Mar 01 '20 at 19:21
  • I just converted to datetime- I believe it is running now- it is a large dataset at 1.5 millions rows – Lynn Mar 01 '20 at 19:27
  • 1
    Ur best bet Tania is to run the code line by line till u get an erro. Just run line 1, if it works then add line 2 if it fails then u know there is a problem with the line u added and u find out what the error message is and dig into it. That is how we all learn – sammywemmy Mar 01 '20 at 19:28
  • Thank you I am still learning- it works! @sammywemmy I appreciate your time and knowledge, I am still learning, I was researching as well, thank you – Lynn Mar 01 '20 at 19:29
  • 1
    For the date, I will suggest you add a line at the beginning of ur code : df[‘Date’] = pd.to_datetime(df[‘Date’]). U will have to read up more for that. – sammywemmy Mar 01 '20 at 19:30
  • Furthermore, what resources did you use to learn Python please? The code seems more advanced, as I am eager to learn more. Thank you @sammywemmy – Lynn Mar 02 '20 at 17:18
  • 1
    stackoverflow, [a byte of python](https://python.swaroopch.com/), the docs, and everyday practice – sammywemmy Mar 02 '20 at 20:17
  • Ok @sammywemmy !! You are really good! And I am grateful for your helping me with this problem! I will study further – Lynn Mar 02 '20 at 21:47
0

If you use Pandas, you may refer to this question How to iterate over rows in a DataFrame in Pandas?

the index is where the row is, probably the row slice you referred to? and the row is obviously the row. You can select the data field in the row you're interested in.

for 'Edit' data field, when you loop through the dataset, you may write something like:

for index, row in df.iterrows():
    if row['Edit'] == 'False'
        print(index) #where the row is
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Jose
  • 1
0

Using dplyr :

library(dplyr)

df %>%
  mutate(Date = lubridate::mdy_hms(Date),
         gr = lag(cumsum(!Edit), default = TRUE)) %>%
  slice(-c(1, n())) %>%
  group_by(gr) %>%
  summarise(Start = min(Date),
            End = max(Date),
            Duration = as.integer(End - Start), 
            RowNum = n(), 
            Length = Length[n()]) %>%
  mutate(RowNum = cumsum(RowNum)) %>%
  slice(n():1) %>%
  select(-gr)

#  Start               End                 Duration RowNum Length
#  <dttm>              <dttm>                 <int>  <int>  <int>
#1 2020-01-02 01:00:05 2020-01-02 01:00:15       10      8     90
#2 2020-01-02 01:00:01 2020-01-02 01:00:04        3      4     90
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • ok thank you @Ronak I will try this- Can you explain why I do not need to use loops for this? It accomplishes the same thing? Also, will this work for very large million row dataset? – Lynn Mar 01 '20 at 07:37
  • 1
    There must be some way to handle this with loops as well but I think it is not needed here. This should be able to handle for large datasets as well. – Ronak Shah Mar 01 '20 at 07:42
  • Ok I will try this soon – Lynn Mar 01 '20 at 07:43
  • Can you explain what gr is? @Ronak? is this a group term? – Lynn Mar 01 '20 at 07:51
  • 1
    Yes, it is used to define groups. So row 2-5 is in one group and 6-9 in another. – Ronak Shah Mar 01 '20 at 07:54
  • Also, @Ronak, slice(-c(1, n())) %>% this command is telling the code to look at the previous row? can you explain what this is please? – Lynn Mar 01 '20 at 08:00
  • It removes first and last row since those are not part of any of the group. – Ronak Shah Mar 01 '20 at 08:09
  • Hi @Ronak, I am getting this error in my large million record dataset: Error: n() should only be called in a data context . Is this due to my dataset being over 1 million rows? – Lynn Mar 01 '20 at 09:05
  • @TanishaHudson No, that is probably because some functions are masked from other packages. Can you restart R and load only dplyr package and try this or explicitly mention `dplyr::function_name` while using each function. – Ronak Shah Mar 01 '20 at 09:31
  • Ok what do you mean by this? dplyr::function_name can you please give example? The error mentions the use of n() , so should I put dplyr::n() ?? – Lynn Mar 01 '20 at 09:39
  • 1
    Try restarting your R session and load only `dplyr` and try the above code. – Ronak Shah Mar 01 '20 at 09:41
  • I am trying now, I followed as you instructed @Ronak – Lynn Mar 01 '20 at 09:45
  • Ok @Ronak, I do not have an error anymore, but my output is showing all NAs for Length and 0 for Duration. I am not sure – Lynn Mar 01 '20 at 09:51
  • I think the data which you have shared is different from the data that you have. Can you check how the both data are different? – Ronak Shah Mar 01 '20 at 10:25
  • Ok I will @ronak – Lynn Mar 01 '20 at 16:58