0

I have a code that creates csv files after certain operations with original dataframe:

import pandas as pd

timetable = pd.read_excel('timetable.xlsx')

data = {"stop_id": timetable['stop_id'], "arrival_time": timetable['arrival_time'], 'route_id': timetable['route_id']}

df = pd.DataFrame(data=data) # Create the DataFrame from the data

g = df.groupby(['stop_id', 'arrival_time']).size()

stops = { i[0] for i in g.index }

for stop in stops:
    times = filter(lambda x: x[0] == stop, g.index)
    data = { "stop_id": [], "arrival_time": [], "number": []}
    for time in times:
        data["stop_id"].append(stop) # add the stop_id
        data["arrival_time"].append(time[1]) # add the current time
        data["number"].append(g[(stop, time[1])]) # add its count
    pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)

how should I change the code so that it also appends other columns' values? I have a column route_id which has different values for each unique stop_id, and I want to list these route_id values per each arrival_time row. context: a bus (route_id) arrives at stop_id at a certain arrival_time, but there can be several buses arriving at the same arrival_time, so I want to know which route_id arrived at a certain time

the data: https://docs.google.com/spreadsheets/d/1O6QGWZh0Yp2EcJAnlvIJw0xiCH8T1AY_/edit#gid=640877265

the extract from the data:

route_id    stop_id arrival_time
429         2179    4/6/22 19:40:00
429         2179    4/6/22 08:06:00
429         2179    4/6/22 09:20:00
429         2179    4/6/22 11:12:00
429         2179    4/6/22 12:25:00
429         2179    4/6/22 13:39:00
429         2179    4/6/22 17:56:00
429         2179    4/6/22 19:19:00
441         2179    4/6/22 07:16:00
441         2179    4/6/22 10:37:00
441         2179    4/6/22 14:33:00
  • you need to provide a minimal example of the data – mozway May 24 '22 at 08:17
  • you can groupby time and send route_id's to list, check the link for examples https://stackoverflow.com/questions/22219004/how-to-group-dataframe-rows-into-list-in-pandas-groupby/66018377#66018377 If list is not acceptable, you can join route_id's in a similar manner, - plenty of examples on this web – NoobVB May 24 '22 at 08:19
  • @mozway added link to the question – barney stinson 2 May 24 '22 at 08:28
  • a minimal self-contained example would be better – mozway May 24 '22 at 08:31
  • @mozway added the example (although I find it difficult to describe it without looking at the full dataframe) – barney stinson 2 May 24 '22 at 08:37

1 Answers1

1

quite self explanatory:

import pandas as pd

df = pd.read_excel('timetable.xlsx', converters={'stop_id':int,'route_id':int})

# grouping by stop_id & arrival_time, also  joining  route_id to the sorted list, counting size of each stop_id group
# all ends up in multi-index dataframe, .reset_index applied to flatten it. 
df_grouped = df.groupby(['stop_id', 'arrival_time'])\
    .agg(number=('arrival_time', 'size'), route_id=('route_id', sorted))\
    .reset_index()

#creating .csv per unique stop_id df_grouped dataframe 
for stop in df_grouped.stop_id.unique():
    file_name = 'Stop_ID{0}.csv'.format(stop)
    df_grouped[df_grouped['stop_id'] == stop].to_csv(file_name, index=False)

as per comments, the string option instead of list:

import pandas as pd

df = pd.read_excel('timetable.xlsx', converters={'stop_id':int,'route_id':int})
df.route_id = df.route_id.astype(str) # changing dtype to string before grouping
df_grouped = df.groupby(['stop_id', 'arrival_time'])\
    .agg(number=('arrival_time', 'size'), route_id=('route_id', ', '.join))\
    .reset_index()
for stop in df_grouped.stop_id.unique():
    file_name = 'Stop_{0}.csv'.format(stop)
    df_grouped[df_grouped['stop_id'] == stop].to_csv(file_name, index=False)
NoobVB
  • 989
  • 6
  • 10
  • yes, exactly what I need! small question: the output gives ```route_id``` in square brackets, but I don't need brackets. what should I do? – barney stinson 2 May 24 '22 at 10:33
  • these brackets, shows the data type of `List` as I mentioned in my comment previously, if list is not needed, you can join values in the same manner and will get the string instead of list – NoobVB May 24 '22 at 10:43
  • sorry, new to coding, have no idea how to do it :( could you please give a hint? – barney stinson 2 May 24 '22 at 10:52
  • you are very welcome! P.S. just compare the code of yours, - you will notice, lots of your rows were not necessary as you were doing the same steps twice, like: `data = {"stop_id": ....`, - you were creating dataframe, even if you already created one with `timetable = pd.read_excel('timetable.xlsx')` read a bit more about pandas basics – NoobVB May 24 '22 at 11:05