1

This is my first post on StackOverflow. My apologies if I do something wrong or against the web's rules. My problem: I have a csv file I read using pandas in python. The dataframe has five columns, named [yday, wday, time, stop, N]:

yday is the year's day, from 1 to 365;
wday is the week's day, from 1 to 7;
time is a number from 1-144 (I divide the day in gaps of 10 minutes each, 1440 minutes per day/10 minutes = 144);
stop is a the number of the bus stop (1-4);
N is the number of ppl who enter the bus

Well, I would like to have an entry for each gap, giving 144 rows per day, but I have some missing gaps as you can see: CSV example

My goal is to add new rows to fill all the time gaps, like adding (based on the image given):

320,6,81,1,1 <-- Exists

320,6,82,1,na <-- New 

320,6,83,1,na <-- New

320,6,84,1,na <-- New

320,6,85,1,1 <-- Exists

I tried to index my DataFrame with df.set_index['stop','yday','time'] so I could reindex it with 'time' values from 1 to 144, but it doesn't work. I'm new to Python and i'm getting mad trying to solve this.

Thanks in advance and sorry for my english.

r.ook
  • 13,466
  • 2
  • 22
  • 39
Asier
  • 13
  • 2
  • What issue are you having with your reindex? You just need to expand your index with the missing `time` values, afterwards using `df.reindex` should work: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html – r.ook May 01 '20 at 15:40
  • Does this answer your question? [Reindexing a specific level of a MultiIndex dataframe](https://stackoverflow.com/questions/48179297/reindexing-a-specific-level-of-a-multiindex-dataframe) – r.ook May 01 '20 at 16:29

1 Answers1

0

Apologies for the long code. But it solves your purpose:

#Specify the input csv file path here. I am assuming your csv has only 
#the columns and column names you mentioned in your question. If not you have to 
#modify the below code to reflect your changed columns
df = pd.read_csv("Path_to_your_input_csv_file_here") 

df = df.sort_values(['yday', 'wday', 'time'], ascending=True) #Sort df values based upon yday, wday and time first
df = df.reset_index(drop=True) #Reset the indices after sorting

df2 = df.copy(deep=True) #Make a deep copy of this sorted dataframe

#The below for loop iterates through rows of 'df', finds differences between time values and adds up missing rows to 'df2'
for index, row in df.iterrows(): #Iterate through the rows of df
    if index == len(df)-1:
        break
    else:
        if row["yday"] == df.loc[index+1,"yday"] and row["wday"] == df.loc[index+1,"wday"] and row["time"] < df.loc[index+1,"time"]:
            differences = list(range(row["time"]+1,df.loc[index+1,"time"]))
            for item in differences:
                tempdf = pd.DataFrame([[row["yday"], row["wday"],item, row['stop'], 'NA' ]],columns = df2.columns)
                df2 = df2.append(tempdf)

#Now sort 'df2' based upon yday,wday and time
df2 = df2.sort_values(['yday', 'wday', 'time'], ascending=True)
df2 = df2.reset_index(drop=True) #Reset indices

print(df2)

Output:

    yday  wday  time  stop   N
0    320     6    81     1   1
1    320     6    82     1  NA
2    320     6    83     1  NA
3    320     6    84     1  NA
4    320     6    85     1   1
5    320     6    86     1  NA
6    320     6    87     1  NA
7    320     6    88     1  NA
8    320     6    89     1   1
9    320     6    90     1  NA
10   320     6    91     1  NA
11   320     6    92     1  NA
12   320     6    93     1   1
13   320     6    94     1  NA
14   320     6    95     1  NA
15   320     6    96     1  NA
16   320     6    97     1   1

Cheers!

Karthick Mohanraj
  • 1,565
  • 2
  • 13
  • 28