0

Currently compiled a dataset with the following data.

Year, runner, total.

Grouped into the following result.

enter image description here

I would like to programatically determine if a runner (can be 1 - 30 or less) represented as 1-4 above, has a total value > 0 in every year with a max of two runners. Return the runner(s) or nothing if no matches found.

Therefore either a value greater than 0 in every year, or with a max of two runners added together has a combined total of greater than 0 in every year between 2015 and 2019. If the query returns more than one result I could write this to a table and return the max value afterwards.

I could write something like add runner 1+2, 1+3, 1+4 etc for every possible combination to a table but this seems overkill. I cannot sum up all years and then use the max funtion for the biggest value as this doesn't mean all the years is greater than zero. Is there a easier/better way of achieving this?

Can this be done with a MySQL query or another language that is worth learning for this kind of data analysis?

Would like to filter unwanted scenarios where either RUN1, or RUN2 has all negative values but still satisfies the requirement of RUN1 + run2 is greater than zero for all years.

Currenly working on where RUN1 is greater or equal to three years where the value is greater than zero but open to suggestions.

enter image description here

  • ***Can this be done with a MySQL query or another language that is worth learning for this kind of data analysis?*** Yes It can be done in SQL; though not simply with this design. 1st must unpivot the runners so it's year runner value. Then we'd cross join this data set so we get every runner to ever other runner combination. Then we'd write a check to see if all values for a given runner are > 0 and have 5 entries (15,16,17,18,19 yrs) OR that when we cross joined the Sum of runner and runner is < 0 for 5). so it can be done but it would take *some* work. – xQbert Mar 31 '22 at 21:33
  • How does one Unpivot? https://stackoverflow.com/questions/15184381/mysql-how-to-unpivot-columns-to-rows – xQbert Mar 31 '22 at 21:38
  • Thanks. I can create a new dataset rather than unpivit. A runner can have multiple results in one year so I can just create a new dataset with the same columns as above but with the sum of each runner total for each year. This pivot was just to make it easier to do it manually. I’ll try the cross join on this data to see if it can create a combination of all runner combinations. How would I limit it to a max two runner combinations as I wouldn’t want it trying to combine more than 2 runners and I would need to know what combination it used for that result. – cathalobrien Mar 31 '22 at 23:00
  • 1
    What would the expected output look like? – FanoFN Apr 01 '22 at 00:43
  • @cathalobrien "how would I limit it to a max two runner combinations" The cross join by it's nature will limit it to Just 2 runners. a Join of Runner to Runner will only be 2 runners; thus the nature of a cross join. all records to all records but only once. – xQbert Apr 01 '22 at 15:03
  • Thanks, that helps in case I need to expand analysis in the future. – cathalobrien Apr 02 '22 at 23:27

1 Answers1

1

Demo Unclear Expected Result

I think this is close.

  • note baseData is a Common table expression just mocking up sample data
    • it can be omitted and the next lines adjusted to reference your tablename.
  • I'm not exactly sure what output you're after
  • how to handle the conditions such as my "2" which when combined with another runner will cause multiple users to be "valid"
  • If it's Runner 1+ Runner 2 value > 0 or Runner 1 > 0 or Runner 2 > 0...
    • if the former need more test data staged to sample edge cases.

Cases I'm not sure how to handle

  • If a runner by themselves fulfills the requirement their existence means all combinations of "2" runners involving that runner would also meet the requirement
  • if two runners together meet the requirement of value > 0 for all years... then the inverse pair would as well (1,2), (2,1) I think I can eliminate this by ensuring matches only occur when Runner1<=Runner2 (Commented out below)

.

with BaseData as (
SELECT 1 Runner, 2015 YR, -8.29 value UNION ALL
SELECT 1 Runner, 2016 YR, -8.29 value UNION ALL
SELECT 1 Runner, 2017 YR, -8.29 value UNION ALL
SELECT 1 Runner, 2018 YR, -8.29 value UNION ALL
SELECT 1 Runner, 2019 YR, -8.29 value UNION ALL
SELECT 2 Runner, 2015 YR, 1 value UNION ALL
SELECT 2 Runner, 2016 YR, 1 value UNION ALL
SELECT 2 Runner, 2017 YR, 1 value UNION ALL
SELECT 2 Runner, 2018 YR, 1 value UNION ALL
SELECT 2 Runner, 2019 YR, 1 value UNION ALL
SELECT 3 Runner, 2015 YR, 1 value UNION ALL
SELECT 3 Runner, 2016 YR, -1 value UNION ALL
SELECT 3 Runner, 2017 YR, -1 value UNION ALL
SELECT 3 Runner, 2018 YR, -1 value UNION ALL
SELECT 3 Runner, 2019 YR, -1 value UNION ALL
SELECT 4 Runner, 2015 YR, -1 value UNION ALL
SELECT 4 Runner, 2016 YR, 1 value UNION ALL
SELECT 4 Runner, 2017 YR, 1 value UNION ALL
SELECT 4 Runner, 2018 YR, 1 value UNION ALL
SELECT 4 Runner, 2019 YR, 1 value),
CTE as (
SELECT A.Runner RUN1, B.Runner RUN2,
sum(case when A.value>0 and A.Runner = B.Runner then 1 else 0 end)  as isValidYear,
       count(A.YR) TotalYears,
       sum(case when A.Runner=B.Runner then A.value else 0 end) as TotalValue,
/*not sure if this should be (A.Value > 0 or B.Value > 0) OR (A.Value+B.Value>0)*/
       sum(case when A.runner<> B.Runner and (A.Value>0 OR  B.Value > 0) then 1 end) as isValid2PersonYR 
FROM BaseData A
CROSS JOIN BaseData B
WHERE A.YR=B.YR
/* AND A.Runner<=B.Runner*/
GROUP BY A.Runner, B.Runner
Order by A.Runner)
SELECT RUN1, Run2, sum(isValidYear) SumIsvalidYear, Sum(isValid2PersonYR) SumIsvalid2PersonYr, TotalYears
FROM CTE
GROUP BY Run1, Run2, TotalYears
HAVING SumisValidYear = Totalyears OR SumisValid2personYR=TotalYears
ORDER BY SumIsValid2PersonYr, Run1

Giving us:

Note the 1st line the runner 1 and 2 is the same, there is no 2nd person, so they did it on their own.

+------+------+----------------+---------------------+------------+
| RUN1 | Run2 | SumIsvalidYear | SumIsvalid2PersonYr | TotalYears |
+------+------+----------------+---------------------+------------+
|    2 |    2 |              5 |                     |          5 | 
|    1 |    2 |              0 |                   5 |          5 |
|    2 |    1 |              0 |                   5 |          5 |
|    2 |    3 |              0 |                   5 |          5 |
|    2 |    4 |              0 |                   5 |          5 |
|    3 |    2 |              0 |                   5 |          5 |
|    3 |    4 |              0 |                   5 |          5 |
|    4 |    2 |              0 |                   5 |          5 |
|    4 |    3 |              0 |                   5 |          5 |
+------+------+----------------+---------------------+------------+

Yes the combination 4,2 and 2,4 are the same, and we could eliminate them as needed: I think this gets you close but I need to understand how to handle situation for runner 2 or if we have a situation where multiple runners would be returned as they "complete" a runner's series, how do you want those?

We could use this result set to exclude all occurrences of a 2 except that of RUN1: 2 matching Run2:2 which would give us just 3,4 and 4,3 which do infect complete each other.

You can see this is "close" but not understanding how you want to handle these situations I stopped here.

If we include the commented out line in the SQL we get:

+------+------+----------------+---------------------+------------+
| RUN1 | Run2 | SumIsvalidYear | SumIsvalid2PersonYr | TotalYears |
+------+------+----------------+---------------------+------------+
|    2 |    2 |              5 |                     |          5 |
|    1 |    2 |              0 |                   5 |          5 |
|    2 |    3 |              0 |                   5 |          5 |
|    2 |    4 |              0 |                   5 |          5 |
|    3 |    4 |              0 |                   5 |          5 |
+------+------+----------------+---------------------+------------+

and these results make sense to me. the questions asked at the top of this response apply.

  • should (1,2),(2,3),(2,4) be in the results? (they only exist because (2) is complete by itself)
  • should results just be (2,2) since we had a runner meet the condition?
  • should results be (2,2) and (3,4) since you want to see all that met the conditions?
xQbert
  • 34,733
  • 2
  • 41
  • 62
  • Feel free to update/adjust values in demo or test edge cases and post updated link with an issue, or ask any questions you have about what this is doing! – xQbert Apr 01 '22 at 15:46
  • Thanks for this, works great. I'm using a.value+b.value as I just need the additional of both be greater than zero, instead of each one. Its works in my test case's to date, with the only issue to note that it returns a combination of RUN1 = 2, and RUN2 = 3 aswell as RUN1 = 3, and RUN2 =2 which is the same thing but I can look into filtering this out in the results. Will keep doing further testing but this is great help, thanks again. – cathalobrien Apr 02 '22 at 23:45
  • 1
    Note you can eliminate some of the duplicates by using < instead of = see 2nd commented out line. Wanted to prove to myself it would work. :) – xQbert Apr 03 '22 at 12:32
  • I want to also try filter unwanted combinations when using A.Value+B.Value>0 where one runner has all negative years but is small enough to still be valid when another runner is greater than zero for all years even after negating the other runners values. Edited my original post with sample scenario if you have any ideas. thanks. – cathalobrien Apr 04 '22 at 00:08
  • https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=d423aab20739afb9eeda419ee94781cf maybe? adding a having clause with conditional aggregration – xQbert Apr 04 '22 at 01:39
  • But I'm not really understanding what you're trying to achieve. – xQbert Apr 04 '22 at 01:48
  • That works, which just gives runner 2 as valid on its own, instead of runner 2, and runner 2 and 15 as valid. I'm wondering if runner two was all negative and runner 15 wasn't, would it still work. Just trying to understand the having statement. – cathalobrien Apr 04 '22 at 12:58
  • SQL isn't top down execution. It's more of an inside out, close to: FROM, Joins, Where, group by, select, Having, Order by (with some possible blurring near group by and select) and not all engies are the same; so this is an approximation. So the having takes place after grouping has occurred so totals can be evaluated/eliminated. – xQbert Apr 04 '22 at 13:24
  • In this case I wanted to eliminate any users who had a count of negative numbers matching the distinct count of years in the query I hard coded the years; but i'm pretty sure we could have used my TotalYears as well https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=4a0dbc01939b31244bc2d4db9aee4f72 You'll notice I added a distinct to the count in the CTE, and I then swapped out the hard coded 5 for the calculated `TotalYears` Test your edge cases, I'm far from perfect and I yet don't fully understand all the requirements. Love the questions! Don't be afraid to ask! – xQbert Apr 04 '22 at 13:26
  • to answer your direct question: substitute the data. In my mind it should work since it's just looking at one of the two users and saying, you can't be all negative for all years and still be valid. or... you can't be "HAVING" a count of all negative years for the results to include you. Those "case expressions" Case when... END are what's helping us get there. Now... there may be an "easier/better way" but I start with something, then make it work, then improve if needed. I think we're on step 3. – xQbert Apr 04 '22 at 13:29
  • If you change your runner 5 values to be -0.5 for all years in your example, you will notice that runner 2 (RUN1), and runner 5 (RUN2) appear as a valid combination. I don't want this to happen as all values for runner 5 has a negative value. SumIsvalid2PersonYr is equal to 5 which I'm happen for this to be something other than 5 or ideally the combination doesn't appear in the results at all. – cathalobrien Apr 04 '22 at 13:50
  • 1
    Gotcha: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=d3da6398e36de3dcbab76a1e7f5b84b3 needed to add an And to the having to account for the 2nd runners totals. So the having now looks like: `HAVING sum(case when A.value <0 then 1 else 0 end)< TotalYears and sum(case when B.Value <0 then 1 else 0 end) < TotalYears` individually ensuring RUN1 & Run2 don't all have negative numbers for all years; or they get tossed out. Let me know if this encompasses all the needs; I'll update my answer accordingly. – xQbert Apr 04 '22 at 14:40
  • 1
    Thanks, that change has been working in my tests. – cathalobrien Apr 06 '22 at 10:56