0

I have 2 pandas dataframes. One of them contains year information, the id's are always unique in this table:

                        _id               start_date                 end_date               name
0  abdgff374935hgkfeo549353  2018-09-02 00:00:00.000  2019-08-31 00:00:00.000          2018/2019
1  fsjket43538603463u9gregr  2019-09-01 00:00:00.000  2020-07-31 00:00:00.000          2019/2020

My second one contains group information which also has students information in, there are multiple lines with the same group id, as each line is linked to different students, a student id can also be the same on muliple lines as students enter new groups:

           _id        student_start_date         student_end_date         student_id
0  fjkgn4783u54  2018-10-01 00:00:00.000  2019-08-31 00:00:00.000      2gbger9tu9834
1  ngregreit495  2019-09-10 00:00:00.000  2020-07-31 00:00:00.000      ghuti43594353
1  ngregreit495  2019-11-05 00:00:00.000  2020-07-31 00:00:00.000      ghuti43594354
1  ngregreit495  2019-09-02 00:00:00.000  2020-07-31 00:00:00.000      ghuti43594357

What I am trying to recreate is the below table which adds the year name to the groups table:

           _id        student_start_date         student_end_date         student_id     year_name
0  fjkgn4783u54  2018-10-01 00:00:00.000  2019-08-31 00:00:00.000      2gbger9tu9834     2018/2019
1  ngregreit495  2019-09-10 00:00:00.000  2020-07-31 00:00:00.000      ghuti43594353     2019/2020

Does anyone know how I can loop though both the groups table and the years table to find which groups data fits into which year?

This is what I have so far, but it's very slow:

            for i in Years.index: 
                StartDate = Years.loc[i,'start_date']
                EndDate = Years.loc[i,'end_date']
                YearName = Years.loc[i,'name']
                print(YearName)
                for i in groups.index:
                    StudentStartDate = groups.loc[i,'student_start_date']
                    StudentEndDate = groups.loc[i,'student_end_date']
                    GroupName = groups.loc[i,'name']
                    try:
                        if (StartDate <= StudentStartDate <= EndDate):
                            groups.loc[i,'YearName'] = YearName
                    except TypeError:
                        pass
                    try:
                        if (StartDate <= StudentEndDate <= EndDate):
                            groups.loc[i,'YearName'] = YearName
                    except TypeError:
                        pass
  • Rather than combine the dataframes, you could easily calculate year_name. Please see https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python . If you want to use the existing column, SO has a lot of ways. – rajah9 Mar 20 '20 at 11:07
  • Thanks @rajah9, but sometimes my years have different start/end dates, so I need it to check if the student start_date is between my years start and end dates to find the correct year –  Mar 20 '20 at 11:10

2 Answers2

0

If I have understood correctly, you are looking for something like this:


import pandas as pd

# This is the first dataframe in your question
d = {'_id': ['abdgff374935hgkfeo549353', 'fsjket43538603463u9gregr'], 
     'start_date': ['2018-09-02', '2019-09-10'], 
     'end_date': ['2019-08-31', '2020-07-31'],
     'name': ['2018/2019', '2019/2020']}

df_dates = pd.DataFrame(data=d)

# This is the second dataframe in your question
d = {'_id': ['fjkgn4783u54', 'ngregreit495'], 
     'student_start_date': ['2018-10-01', '2019-09-10'], 
     'student_end_date': ['2019-08-31', '2020-07-31'],
     'student_id': ['2gbger9tu9834', 'ghuti43594353']}

df_students = pd.DataFrame(data=d)

# A single line to replace your for loop, which generates the third table in your question
df_students['year_name'] = df_dates[(df_dates.start_date <= df_students.student_start_date) & (df_students.student_start_date <= df_dates.end_date)].assign(year_name = df_dates.name).year_name

print(df_students)

Result:

            _id student_start_date student_end_date     student_id  year_name
0  fjkgn4783u54         2018-10-01       2019-08-31  2gbger9tu9834  2018/2019
1  ngregreit495         2019-09-10       2020-07-31  ghuti43594353  2019/2020
s.dallapalma
  • 1,225
  • 1
  • 12
  • 35
  • Thanks, but I am getting this error when trying your code: ValueError: Can only compare identically-labeled Series objects –  Mar 20 '20 at 12:25
  • Which version of pandas are you using? Mine is 0.24.2, and it works. You can look at the version with ```pip list```. – s.dallapalma Mar 20 '20 at 12:29
  • 0.25.1. I think the error is because in my dataframes there are multiple rows. The years table contains 5 different years, and the groups table contains over 50 groups with multiple students in each one –  Mar 20 '20 at 12:31
  • I have installed the version 0.25.1 and still works. Did you try the snippet in its own script (apart from your code)? – s.dallapalma Mar 20 '20 at 12:36
  • yes it works with your data, but my data must be formatted differently or something? Does your data print out like mine does? –  Mar 20 '20 at 12:40
  • The snippet print out exactly the third table in your question (based on the previous two tables you provided). I assumed your data-frames have the same number of rows. I do not know the actual content of your data-frames – s.dallapalma Mar 20 '20 at 12:42
  • Oh, no the years one only has about 5 rows whereas the groups one has over 50 –  Mar 20 '20 at 12:43
  • Then you should clarify it in your question and provide more data. Try with the first table with all the 5 data and the second with 10 – s.dallapalma Mar 20 '20 at 12:44
  • Sorry, will alter question. Yes that is why the error occurs, because the dataframes are not the same length. –  Mar 20 '20 at 12:48
0

I have a major problem with the last line from your showed students dataframe: the student_start_date is 2019-09-02 while the year start_date is 2019-09-10. That would mean that a student managed to enter a group before the beginning of the year...

If I exclude that line, I can do a pd.merge_asof with a backward direction:

resul = pd.merge_asof(df2.sort_values('student_start_date'), df1,
                      right_on='start_date', left_on='student_start_date')

It gives:

            _id student_start_date student_end_date     student_id                     _id_y start_date   end_date       name
0  fjkgn4783u54         2018-10-01       2019-08-31  2gbger9tu9834  abdgff374935hgkfeo549353 2018-09-02 2019-08-31  2018/2019
1  ngregreit495         2019-09-10       2020-07-31  ghuti43594353  fsjket43538603463u9gregr 2019-09-10 2020-07-31  2019/2020
2  ngregreit495         2019-11-05       2020-07-31  ghuti43594354  fsjket43538603463u9gregr 2019-09-10 2020-07-31  2019/2020

You can now drop the unwanted columns to get the expected dataframe...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Sorry, that must be an anomoly date thats wrong. Will this still work if students start or leave a group during the year and not at the start? What can I do if the student_start_date is null and it needs to look at the student_end_date instead? –  Mar 20 '20 at 13:55
  • @Pippa97 Currently the merge_asof searches the higher `start_date` immediately before or equal to the `student_start_date`. As I have added no further control it erroneously takes the previous year is student_start_date is just before the year start_date, the reason why I have removed that line. But provided the student_start_date is after the year start_date, it should work. – Serge Ballesta Mar 20 '20 at 14:00
  • Thanks @Serge Ballesta, but I'm getting this error - TypeError: No matching signature found. –  Mar 20 '20 at 14:14
  • @Pippa97: I cannot guess what can produce that error. Maybe you could open a new question refering that one. That way, you could provide a short extract exhibiting the problem, the code raising the error (including the `merge_asof`) and the full stacktrace. If you edit this one now, the current answers would become irrelevant... – Serge Ballesta Mar 20 '20 at 14:39