Good day!
I have a challenge/idea to improve efficiency of one of the most widely used audio listening test formats – “MUSHRA-like” multiple stimulus grading task (all details about it are freely available in ITU document: https://www.itu.int/rec/R-REC-BS.1534-3-201510-I/en).
Picture of Hulti-Gen MUSHRA interface example
Rate the basic audio quality of each stimulus:
I I === I I I I
I I I I I I I
I === I I I I I
=== I I I I === I [Ref-sound]
I I I === I I ===
I I I I I I I
I I I I === I I
50% 75% 100% 35% 5% 50% 35%
Problem
MUSHRA test participants are presented with about 10 different sounds to compare with one another in no particular order. While the interface is great for fine comparisons, it is terrible for the initial stage when you just need to make sense of the presented alternatives and give them reasonable initial grading relative to the references.
For example, a study comparing sound of different headphones could have the following stimuli presented on a single page (which are randomized before presentation):
Stimuli nr. | Ground truth MUSHRA score that we expect to measure |
---|---|
s1 | 95% |
s2 | 90% (top-quality hidden-reference anchor) |
s3 | 85% |
s4 | 85% |
s5 | 75% |
s6 | 60% |
s7 | 50% (main reference “benchmark to beat”) |
s8 | 40% |
s9 | 10% (low-quality hidden-reference anchor) |
Idea
Before throwing people into the “listen and rate how you want” MUSHRA interface, the idea is to introduce a preliminary paired comparison step where a simple algorithm can guide user to roughly rank stimuli before the 2nd half of the process.
Preliminary paired comparisons could have the following reply options:
“Please compare A with B”
◦ A >> B (A much better than B)
◦ A > B (A somewhat better than B)
◦ A ~= B (about equal)
◦ A < B (A somewhat worse than B)
◦ A << B (A much worse than B)
The 2nd half of the process is the normal “MUSHRA-like” interface, only all stimuli will already be pre-sorted and pre-graded based on preliminary results. Participants can then listen to stimuli in sequence and make any adjustments to the automatically proposed score.
Some specifics for the initial paired comparisons:
- We should change only 1 out of two A-B stimuli from one comparison to another. It will reduce listening fatigue compared to both random stimuli in each comparison. Any biases from this logic can be ignored.
- We will scale the ranking results into the MUSHRA score value relative to the included low, mid and high quality anchors with a pre-defined rating.
- Consequently it is important to know how different is one stimuli from another (not only ranking but also the effect size).
- Most lack of confidence, biases, undecided cases and logical errors can be accepted in favor of the least amount of paired comparisons prior to the 2nd half of the process.
Help
The idea is relatively simple, but I am still struggling to define an efficient algorithm – I would appreciate any suggestions for a simple state machine that provides ROUGH ranking of stimuli in the least amount of paired comparisons. Most methods I looked at are made for more complex scenarios and often require quite advanced algorithms that are hard to re-implement from scratch without any libraries. For example I looked into Ranking from pairwise comparisons ; https://en.wikipedia.org/wiki/Elo_rating_system ; https://en.wikipedia.org/wiki/Scale_(social_sciences)#Comparative_scaling_techniques ; https://en.wikipedia.org/wiki/Ranked_pairs and many more pages.
If such rough paired comparison is added to MUSHRA. we can significantly speed up testing by ensuring that all participants use an efficient strategy to arrive at the results; make the whole process more fun and mitigate scaling biases by suggesting initial scores based on a common algorithm.
(so far my concepts are too broken to present for feedback)