Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.