5

I am currently formatting data from two different data sets. One of the dataset reflects an observation count of people in room on hour basis, the second one is a count of people based on wifi logs generated in 5 minutes interval.

After merging these two dataframes into one, I run into the issue where each hour (as "10:00:00") has the data from the original set, but the other data (every 5min like "10:47:14") does not include this data.

Here is how the merge dataframe looks:

        room       time              con     auth  capacity    %     Count  module    size 
0       B002    Mon Nov 02 10:32:06  23      23       90       NaN    NaN   NaN        NaN`  
1       B002    Mon Nov 02 10:37:10  25      25       90       NaN    NaN   NaN        NaN`  
12527   B002    Mon Nov 02 10:00:00  NaN     NaN      90       50%    45.0  COMP30520   60`  
12528   B002    Mon Nov 02 11:00:00  NaN     NaN      90       0%     0.0   COMP30520   60`

Is there a way for me to go through the dataframe and find all the information regarding the "occupancy", "occupancyCount", "module" and "size" from 11:00:00 and write it to all the cells that are of the same day and where the hour is between 10:00:00 and 10:59:59?

That would allow me to have all the information on each row and then allow me to gather the min(), max() and median() based on 'day' and 'hour'.

To answer the comment for the original dataframes, here there are:
first dataframe:

    time                room    module      size
0   Mon Nov 02 09:00:00 B002    COMP30190   29
1   Mon Nov 02 10:00:00 B002    COMP40660   53

second dataframe:

        room    time                  con   auth  capacity  %     Count
0       B002    Mon Nov 02 20:32:06   0     0     NaN       NaN   NaN
1       B002    Mon Nov 02 20:37:10   0     0     NaN       NaN   NaN
2       B002    Mon Nov 02 20:42:12   0     0     NaN       NaN   NaN
12797   B008    Wed Nov 11 13:00:00   NaN   NaN   40        25    10.0
12798   B008    Wed Nov 11 14:00:00   NaN   NaN   40        50    20.0
12799   B008    Wed Nov 11 15:00:00   NaN   NaN   40        25    10.0

this is how these two dataframes were merged together:

DFinal = pd.merge(DF, d3, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)

Any help with this would be greatly appreciated.

Thanks a lot,

-Romain

RomainD
  • 87
  • 1
  • 10
  • Do you want the first 2 lines of your output to be: `0 B002 Mon Nov 02 10:32:06 23 23 90 90 0% COMP30520 60` , `1 B002 Mon Nov 02 10:37:10 25 25 90 0% COMP30520 60`? And what's the format of 'time' column? (use `df.time` to check) – danielhadar Jul 19 '16 at 11:56
  • 2
    Could you provide what you've done so far? How are you performing the merge and what is the expected result you're aiming for in a more clear manner? – sirfz Jul 19 '16 at 12:00
  • Hi @sirfz, here is how i merged the two dataframes: DFinal = pd.merge(DF, d3, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False) – RomainD Jul 19 '16 at 12:07
  • @RomainD: Please add this code snippet to your question by editing your question. – albert Jul 19 '16 at 12:13
  • Hi @danielhadar, I would like to keep all the records in the final dataframe in order to then generate database tables containing the min(), max() and median() values based on day and hour. Also the 'time' column is an object and follows this format ( %a %b %d $H:%M:%S ) – RomainD Jul 19 '16 at 12:24
  • Are the data of the hourly observations reliable? I mean are there any gaps where there isn't any data for more than 1 hour? – Çağlar Kutlu Jul 19 '16 at 12:32
  • @ÇağlarKutlu, the hourly data is collected on weekdays (on a 2 week period) from 9am until 5pm (this was colleceted manually by someone. The 5 minute interval data is generated through automated reports based on network (for 24 hours per day) – RomainD Jul 19 '16 at 12:38
  • Actually, I was able to fix this by modifying the merging – RomainD Jul 26 '16 at 20:57

3 Answers3

2

Somewhere to start:

b = df[(df['time'] > X) & (df['time'] < Y)]

selects all the elements within times X and Y

And then

df.loc[df['column_name'].isin(b)]

Gives you the rows you want (ie - between X and Y) and you can just assign as you see fit. I think you'll want to assign the values of the selected rows to those of row number X?

Hope that helps.

Note that these function are cut and paste jobs from
[1] Filter dataframe rows if value in column is in a set list of values
[2] Select rows from a DataFrame based on values in a column in pandas

Community
  • 1
  • 1
draco_alpine
  • 769
  • 11
  • 25
1

If I understood it correctly, you want to fill all the missing values in your merged dataframe with the corresponding closest data point available in the given hour. I did something similar in essence in the past using a variate of pandas.cut for timeseries but I can't seem to find it, it wasn't really nice anyways.

While I'm not entirely sure, fillna method of the pandas dataframe might be what you want (docs here).

Let your two dataframes be named df_hour and df_cinq, you merged them like this:

df = pd.merge(df_hour, df_cinq, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)

Then you change your index to time and sort it:

df.set_index('time',inplace=True)
df.sort_index(inplace=True)

The fillna method has an option called 'method' that can have these values (2):

   Method                 Action  
 pad / ffill          Fill values forward  
 bfill / backfill     Fill values backward  
 nearest              Fill from the nearest index value  

Using it to do forward filling (i.e. missing values are filled with the preceding value in the frame):

df.fillna(method='ffill', inplace=True)

The problem with this on your data is that all of the missing data in the non-working hours belonging to the 5-minute observations will be filled with outdated data points. You can use the limit option to limit the amount of consecutive data points to be filled but I don't know if it's useful to you.

Here's a complete script I wrote as a toy example:

import pandas as pd
import random


hourly_count = 8 #workhours 
cinq_count = 24 * 12 # 1day

hour_rng = pd.date_range('1/1/2016-09:00:00', periods = hourly_count, freq='H')
cinq_rng = pd.date_range('1/1/2016-00:02:53', periods = cinq_count,
                            freq='5min')

roomz = 'room0 room1 secretroom'.split()

hourlydata = {'col1': [], 'col2': [], 'room': []}
for i in range(hourly_count):
    hourlydata['room'].append(random.choice(roomz))
    hourlydata['col1'].append(random.random())
    hourlydata['col2'].append(random.randint(0,100))


cinqdata = {'col3': [], 'col4': [], 'room': []}
frts = 'apples oranges peaches grapefruits whatmore'.split()
vgtbls = 'onion1 onion2 onion3 onion4 onion5 onion0'.split()
for i in range(cinq_count):
    cinqdata['room'].append(random.choice(roomz))
    cinqdata['col3'].append(random.choice(frts))
    cinqdata['col4'].append(random.choice(vgtbls))

hourlydf = pd.DataFrame(hourlydata)
hourlydf['time'] = hour_rng
cinqdf = pd.DataFrame(cinqdata)
cinqdf['time'] = cinq_rng

df = pd.merge(hourlydf, cinqdf, left_on=['room','time'], right_on=['room',
    'time'], how='outer', left_index=False, right_index=False)

df.set_index('time',inplace=True)
df.sort_index(inplace=True)
df.fillna(method='ffill', inplace=True)
print(df['2016-1-1 09:00:00':'2016-1-1 17:00:00'])
0

Actually I was able to fix this by:

First: using partition on "time" feature in order to generate two additional columns, one for the day showed in "time" and one for the hour in the "time" column. I used the lambda functions to get these columns:

df['date'] = df['date'].map(lambda x: x[10:-6])
df['time'] = df['time'].map(lambda x: x[8:-8])

Based on these two new columns I modified the way the dataframes were being merged.

here is the code I used to fix it:

dataframeFinal = pd.merge(dataframe1, dataframe2, left_on=["room", "date", "hour"],
                right_on=["room", "date", "hour"], how="outer",
                left_index=False, right_index=False, copy=False)

After this merge I ended up having duplicate time columns ('time_y' and "time_x').
So I replaced the NaN values as follows:

dataframeFinal.time_y.fillna(dataframeFinal.time_x, inplace=True)

Now the column "time_y" contains all the time values, no more NaN. I do not need the "time_x" column so I drop it from the dataframe

dataframeFinal = dataframeFinal.drop('time_x', axis=1)
RomainD
  • 87
  • 1
  • 10