1

Since half a year I am into python and all it incredible libraries, such as Panads Dataframes.

I am struggling to get the iteration logic (see attached image) implement in my code. The logic is pretty clear to me but unfortunately am not able to get the snippet coded!

I was wondering if there is someone out, who can give me the right hint?

Thank you very much in advance!

Transparent iteration logic

df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'])
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020'])
  • 1
    Please show the code you have so far, and where you are having specific problems. – Ben Aug 14 '20 at 19:39
  • Thank you Ben, for you interest! I was searching the community quite some time and found some hints, but nothing that considered also the previous iterations. The best I could find is this post https://stackoverflow.com/questions/32237862/find-the-closest-date-to-a-given-date with the following iteration code: return min(df1, key=lambda x: abs(x - df2 )) But as said, I am still in the "unknown" how to code. – marco.linos Aug 14 '20 at 19:50
  • @marco.linos you are propably on the wrong site then, stack overflow is meant for code not an interactive tutorial. If you are that new to python you definetely should start with a tutorial. There are ALOT out there, i used udemy and waited for a sale but even you tube has good ones. – Andreas Aug 14 '20 at 23:30
  • Thank you @Andreas, for your interest comment. Maybe important to mention, I do have beginner experience but no expert knowledge. I did see lots of tutorials therefore I have several python webscrapers running with scrapy, selenium and splash. TabPy server also setup up and in an automatic script up and running! Selenium scrapers are integrated with Tableau. But this particular snippets I was not able to solve by myself searching the web and looking python tutorials. Thank you! – marco.linos Aug 15 '20 at 03:22
  • @marco.linos ok, in this case please explain the underlaying logic in a bit more detail please I think it is very good that you provided a copy paste example and a image explaining the logic, but the logic is not obvious to me. E.g. why does the iteration start with 02.01.2020? Is that a typo? Why are in the second iteration 3 datapoints True and not all? How should it be detected that the timewindow has closed? – Andreas Aug 15 '20 at 08:52
  • Thank you @Andreas for your questions and your curiosity!! :-) Let me try to explain the logic with my use case! - **df1**: Regular update is usually one (1) time per month and is usually at the beginning of each month - **df2**: Multiple updates of a date set during each month - **df3**: The snippet should search and link the closest future date of **df2** in regards to **df1** till there is a most recent date in **df2** available. – marco.linos Aug 15 '20 at 21:30
  • _**To answer you questions @Andreas:**_ - _Why does the iteration start with 02.01.2020?_ -> The start date of df1 is 02.01.2020, because it takes place at the beginning of each month. Could also be 01.01., 03.01. or 04.01. ,,,) - _Why 3 datapoints True and not all?_ -> see explanation df3 - _Detection of the timewindow?_: Timewindow is closed with the newest date in df1. For dates in df2 applies the following logic: df2 <= df1. I very much hope these details added value to my copy paste example for better understanding of my use case! – marco.linos Aug 15 '20 at 21:30

1 Answers1

0
import pandas as pd
df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'], format="%d.%m.%Y")
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020', '03.11.2021'], format="%d.%m.%Y")


lst_df1 = list(df1.sort_values())
lst_df2 = list(df2.sort_values())

dict_df3 = {} 
 
window_start = lst_df2[0]
window_stop = lst_df2[1]
for date in lst_df1:
    while date > window_stop:
            window_start = lst_df2[0]
            lst_df2 = lst_df2[1:]
            window_stop = lst_df2[0]
            
    dict_df3[date] = window_start

df3 = pd.DataFrame.from_dict(dict_df3, orient='index').reset_index()
Andreas
  • 8,694
  • 3
  • 14
  • 38
  • 1
    thank you so much!! . The result is correct. Now I will go step by step through the code to understand the pd logic! Have nice wonderful! – marco.linos Aug 16 '20 at 09:56
  • Glad I could help, could you please also upvote the answer? Gives +10 Reputation. The basic idea is to have ordered dates, then loop through the df 1 dates. If the current date in df 1 is bigger than the first date in df2 cut off the first date in df 2 and try again. This also means that in df2 there has to always be a larger date than in df1. E.g. if you remove the last date of df2 (in the example) loop will crash. – Andreas Aug 16 '20 at 10:24
  • I did already vote up you solution (arrow up) and the following pop up appeared: _Thanks for the feedback! Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score._ I hope they reached you. By stepping through the snippet, I understood the logic. Very clever! Thank you again! – marco.linos Aug 16 '20 at 16:43