2

Here's my query:

    WITH desc_table(counter, hourly, current_weather_description, current_icons, time_stamp) AS (
Select count(*) AS counter, CASE WHEN  strftime('%M',  'now') < '30' 
                THEN strftime('%H', 'now')  
                ELSE strftime('%H', time_stamp, '+1 hours') END as hourly, 
                current_weather_description,
                current_icons,
                time_stamp
                From weather_events
                GROUP BY strftime('%H',  time_stamp, '+30 minutes'), current_weather_description
                UNION ALL
                Select count(*) as counter, hourly - 1, current_weather_description, current_icons, time_stamp
                From weather_events
                GROUP BY strftime('%H',  time_stamp, '+30 minutes'), current_weather_description
                Order By counter desc limit 1
                ),
        avg_temp_table(avg_temp, hour_seg, time_stamp) AS (
        select avg(current_temperatures) as avg_temp, CASE WHEN  strftime('%M',  time_stamp) < '30' 
                THEN strftime('%H', time_stamp)  
                ELSE strftime('%H', time_stamp, '+1 hours') END as hour_seg, 
                time_stamp
                from weather_events
                group by strftime('%H',  time_stamp, '+30 minutes')
                order by hour_seg desc
                )

                Select  hourly, current_weather_description
                from desc_table
                join avg_temp_table
                on desc_table.hourly=avg_temp_table.hour_seg

Basically I have some weather data that I group into hour intervals (offset by 30 minutes) and I want to specifically count the number of times I get a particular weather description (and matching icon) in that time interval and select the weather description within that time interval with the highest occurrence (count) (desc_table). Then I want to get the average temperatures within that time segment ((avg_temp_table)(maybe I need a sub-query? To do this avg instead of how I have it) and join the two queries along their hour columns.

I want my anchor to be based off of what hour the query is made (now) and count the occurrences and then the next members will subtract an hour each time and go to the next time interval and count, etc.

Sample data, there will be a lot more rows within each time segment for a regular dataset {current_temperatures, current_weather_description, current_icons, time_stamp}:

"87"    "Rain"  "rainicon"  "2016-01-20 02:15:08"
"65"    "Snow"  "snowicon"  "2016-01-20 02:39:08"
"49"    "Rain"  "rainicon"  "2016-01-20 03:15:08"
"49"    "Rain"  "rainicon"  "2016-01-20 03:39:08"
"46"    "Clear" "clearicon" "2016-01-20 04:15:29"
"46"    "Clear" "clearicon" "2016-01-20 04:38:53"
"46"    "Cloudy" "cloudyicon" "2016-01-20 05:15:08"
"46"    "Clear" "clearicon" "2016-01-20 05:39:08"
"45"    "Clear" "clearicon" "2016-01-20 06:14:17"
"45"    "Clear" "clearicon" "2016-01-20 06:34:23"
"45"    "Clear" "clearicon" "2016-01-20 07:24:54"
"45"    "Rain"  "rainicon"  "2016-01-20 07:44:41"
"43"    "Rain"  "rainicon"  "2016-01-20 08:19:08"
"36"    "Clear" "clearicon" "2016-01-20 08:39:08"
"35"    "Meatballs" "meatballsicon" "2016-01-20 09:18:08"
"18"    "Cloudy" "cloudyicon" "2016-01-20 09:39:08"

Output is a join between the avg temperatures of the time interval (avg_temp_table) with the output of the first aggregate CTE (desc_table) {avg_temp, weather_description, current_icon}:

"87"    "Rain"  "rainicon"
"57"    "Rain"  "rainicon"
"47"    "Clear" "clearicon"
"46"    "Clear" "clearicon"
"46"    "Cloudy" "cloudyicon"
"45"    "Clear" "clearicon"
"44"    "Rain"  "rainicon"
"36"    "Clear" "clearicon"
"18"    "Cloudy" "cloudyicon"

Right now I'm getting a no such column error because my anchor comes from my weather_events table and so do my recursive members. I get the "recursive aggregate queries not supported error" when I change the recursive members from to desc_table. But I don't want to get my recursive members from the desc_table, I want to segment by hour and then go through each hour interval and get the count. I'm guessing I'm doing the anchor incorrectly to begin with also.

M. Smith
  • 379
  • 1
  • 20
  • 1
    You can pass in SQL via rawQuery. It supports basic querries, including DDL. But it may not support the full SQL language. Best bet is to try it and see. – Gabe Sechan Jan 16 '16 at 01:56
  • Well then can you tell me if my query would work for SQL? @Gabe Sechan – M. Smith Jan 16 '16 at 02:31
  • Could you provide an example (a data sample and the expected result) of what you are trying to achieve? Judging by the recursive CTE query you've posted, it seems to me recursion may be unnecessary here. – Andriy M Jan 20 '16 at 14:57
  • @AndriyM edited with sample data and expected result – M. Smith Jan 20 '16 at 15:48
  • @AndriyM I believe you are right. I think I want an ordinary CTE instead. But the documentation does say that the word "recursive" doesn't automatically make it recursive – M. Smith Jan 20 '16 at 16:31

1 Answers1

7

I am still unsure how your desc_table recursive CTE was supposed to pick the highest occurring weather description and its icon for each hour, but that is fine, because, using your verbal description, I think I have figured out a way to do the same without recursion.

First, group the results by hour and description, and count the rows in each group:

SELECT
  strftime('%H', time_stamp, '+30 minutes') AS hour,
  current_weather_description,
  current_icons,
  COUNT(*) AS event_count
FROM
  weather_events
GROUP BY
  strftime('%H', time_stamp, '+30 minutes'),
  current_weather_description

As the next step, group the results of the above query by hour and get the maximum count of events per hour:

SELECT
  hour,
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour

That is still not exactly what you want, because you actually want the description and icon matching the maximum count, not the count itself. Well, this is easy to fix – just add those columns to SELECT without adding them to GROUP BY:

SELECT
  hour,
  current_weather_description,
  current_icons,
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour

You still need to keep the MAX(event_count) in the the query for the trick to work. The reason it works is because in SQLite, when a SELECT statement contains a single MAX or single MIN call, the value of any selected column that is neither in the GROUP BY nor aggregated will be taken from the row matching the said MAX or MIN value. This non-standard extension to SQL is documented in the release notes for SQLite 3.7.11.

So much for desc_table. As for the avg_temp_table CTE, there seems to be nothing wrong with your current method, except I would probably use the GROUP BY expression as the hour definition instead of the CASE expression you are using, for consistency, and time_stamp seems redundant to the results, too. So the slightly modified CTE would look like this:

SELECT
  strftime('%H', time_stamp, '+30 minutes') AS hour,
  AVG(current_temperatures) AS avg_temp
FROM
  weather_events
GROUP BY
  strftime('%H', time_stamp, '+30 minutes')

And now you only need to join the two sets on the hour column and select the relevant columns for the final output:

SELECT
  t.avg_temp,
  d.current_weather_description,
  d.current_icons
FROM
  avg_temp_table AS t
  INNER JOIN desc_table AS d on t.hour = d.hour
ORDER BY
  t.hour

So here you are. And now I would like to address just one issue with regard to the resulting query, which is

Can the join be avoided?

While your approach to the solution – getting descriptions and average temperatures separately and then joining the two sets together – is straightforward and makes perfect sense, it would be nice to avoid the join and do all the calculations simultaneously. That would most likely make the query faster, because the source would be scanned just once. Can this be achieved?

As it happens, yes, it can. The principal difficulty in combining the two parts lies in the fact that the descriptions are obtained in two steps while calculation of the average temperatures is a single-step operation. Simply putting AVG(current_temperatures) into the first CTE's nested SELECT (which groups by hour and description) and then doing AVG over the results in the outer SELECT (which groups by hour) would not be equivalent, mathematically, to doing AVG once over the entire hour group.

What you need, instead, is to remember that AVG = SUM / COUNT. If you get SUM and COUNT in the first step, and then SUM of the SUMs and SUM of the COUNTs in the second step, you can just divide the first outer SUM by the second outer SUM to get your average.

Here is the new desc_table CTE modified to combine both parts of the query (so it is no longer supposed to be a CTE but the complete query), with the necessary changes highlighted in bold:

SELECT
  SUM(total_temp) / SUM(event_count) AS avg_temp,
  current_weather_description,
  current_icons,
  MAX(event_count) AS max_event_count
FROM
  (
    SELECT
      strftime('%H', time_stamp, '+30 minutes') AS hour,
      current_weather_description,
      current_icons,
      COUNT(*) AS event_count,
      SUM(current_temperatures) AS total_temp
    FROM
      weather_events
    GROUP BY
      strftime('%H', time_stamp, '+30 minutes'),
      current_weather_description
  ) AS s
GROUP BY
  hour
ORDER BY
  hour
;

Obviously, the max_event_count column is redundant to the output – and still crucial to the "greatest N per group" method the query is relying on. Personally, I would not worry about one redundant column in cases like this, but if you have strong reasons for its exclusion from the result set, you can use the above query as a derived table (yes, again) and have the outermost SELECT pull all the columns except max_event_count – for instance, like this:

SELECT
  avg_temp,
  current_weather_description,
  current_icons
FROM
  (
    SELECT
      hour,
      SUM(total_temp) / SUM(event_count) AS avg_temp,
      current_weather_description,
      current_icons,
      MAX(event_count) AS max_event_count
    FROM
      (
        SELECT
          strftime('%H', time_stamp, '+30 minutes') AS hour,
          current_weather_description,
          current_icons,
          COUNT(*) AS event_count,
          SUM(current_temperatures) AS total_temp
        FROM
          weather_events
        GROUP BY
          strftime('%H', time_stamp, '+30 minutes'),
          current_weather_description
      ) AS s
    GROUP BY
      hour
  ) AS s
ORDER BY
  hour desc
;

As you can see, the middle-tier SELECT now includes hour as well, which is needed for ORDER BY at the outermost level. (I am assuming here that the order is important for the calling application.)

It only remains for me to mention one difference between the results of both methods. In the first one, AVG(current_temperatures) gives you a floating-point result. In the second one, SUM(total_temp) / SUM(event_count) gives you an integer. As your expected results show integer averages, I guess it should not be an issue. But if you later decide you want more precision for your averages, just keep in mind that you can replace the SUM function in either SUM(total_temp) or SUM(current_temperatures) with the TOTAL function, which returns the same value as SUM but the result is always a real. Dividing a real by an integer yields a real in SQLite, so using TOTAL you will get the same results as with AVG in the first method.

Community
  • 1
  • 1
Andriy M
  • 76,112
  • 17
  • 94
  • 154
  • Wow you are a boss. Thank you! Someone previously told me to try CTEs and I'm very new to SQL so I was trying to make it work. I really appreciate the wonderful answer. – M. Smith Jan 22 '16 at 14:31