Mountain Project Logo

Lies, Damned Lies, and Statistics

Original Post
Brian Adzima · · San Francisco · Joined Sep 2006 · Points: 560

About a year and a half ago I was wondering what was the standard deviation of the grades people assign to climbing routes. This was mostly motivated by curiosity of how this varied with region, climbing style, climbing grade, and source websites.

One sleepless night I pulled the suggested ratings for 40 different routes from UT, CO, WY, WV, and KY off of MP.com. These routes varied from 5.10a to 5.13a and 7 and to 99 suggested ratings (1260 ratings total). I then found the mode of each route's ratings, and then looked at the distribution of ratings in one letter grade increments above and below the mode. For example, by this approach a rating of "-4" could either be a route with a mode rating of 5.11a assigned a 5.10b/c (or 5.10 on the -/+ YDS ranking) or a 5.13a assigned a 5.12b/c (or 5.12). This approach gave the top series of histograms showing the distribution of suggested ratings relative to the mode.

Grades of routes normalized from the mode

Overall the data appears symmetric. However, there may be a slight skew to the data when the routes are broken down further. Trad climbs appear to be rated with a slight skew towards higher grades, and sport climbs with a skew towards lower ratings. Western climbs appear to be rated with a slight skew towards higher grades, and eastern climbs with a skew towards lower ratings. 5.11's seem skewed towards lower grades, while everything else is skewed towards higher grades. However, I am not sure how to test for the significance of the skew, so I won't draw any firm conclusions here.

Next, I prepared normal probability plots of the data, and realized the data was very non-normal. Compared to a normal distribution the deviation decreases much faster. There appears to be 3 regions to the data: the mode, grades within 2 letter grades of the mode, and grades deviating more than 2 letter grades from the mode. This behavior appears even when routes are separated by their grade, type, or location.

I can see a couple of explanations. One is that the non-normal distribution results from three different groups of people rating routes. 1) The Parrots. This is a large group of people suggesting route ratings from guidebooks, MP.com, and other climbers. This would account for the large number of ratings of the mode grade. 2) The people actually thinking about the rating and suggesting a grade they personally think is correct. This group has a standard deviation of 0.9 letter grades, which seems reasonable. I can accept that 95% of the time the route "feels" within 2 letter grades of the accepted grade. 3) The clueless. This is probably 5.14 climbers warming up on 5.10's, newb's on anything "wicked" hard, and people not rating the correct route/pitch when they are lost, off route, or forgetting the upper pitches (Genesis in Eldorado). I know I have had days when I have sent a 5.12 only to fall on a 5.10, or accidentally sandbagged a partner on a 5.10c (well it felt like 5.8!).

A second explanation is that climbing grades be more like a log scale than an arithmetic scale, and this causes the distributions to be non-normal. I would suspect this would cause the data to be skewed towards lower grades, but maybe I am missing something here, neither mathematics nor statistics is my specialty.

MP.com, it has been awhile since I have been here, but I'd be curious for your comments, detailed technical analyses, non-relevant asides,animated .gifs, or general internet hate.

Dan Bachen · · Helena, MT · Joined Mar 2010 · Points: 1,083

Coming from a field (ecology) where data can be a bit noisy , I would say that your distributions of deviation from mode from look normal, especially for a sample size of 1200. An exercise I have found useful to get an idea of whether data are approximated by a distribution is to simulate data with R or a similar stats program using your parameters (n= 40, sd=?) and assumed distribution (Gaussian) and look at histograms to see if they are similar.

I am having a little trouble wrapping my head around your second set of plots.
What are your axis labels?

Could you report your SDs for the plots?

I would be curious to see residual vs. fitted plots for a model of your data with state and climb type as covariates and deviation from expected grade as your response. Not sure how comfortable you are with stats but a mixed effects model with route as a random effect may be a good way to handle repeated measures on route. something like:
DEG ~ State + ClimbType + StatedGrade + (1|RouteName)

Really cool analysis! Also if you are willing it would be great to see the raw data.

SteveF · · Fort Collins, CO · Joined Aug 2007 · Points: 32

I agree with Dylan that 40 routes is kind of a small sample and you're not likely capturing the variability in the population. There can be a fair amount of variation in grading depending not just on location but also on the age of a route. For instance a lot of 'classic' routes in Colorado, like those established by Layton Kor, are pretty sandbagged (at least in my opinion) when compared to the accepted rating of many recently established routes. If you're going to try any kind of regression modelling then I'd suggest including route age an independent variable.

I think your interpretation is pretty reasonable for the data that you have. The parrots, the thinkers, the clueless. Very nice.

Another aspect that would be interesting to look at is people's ratings outside of their home state. If a 5.11 sport climber comes to Vedauwoo for the first time they may be sorely disappointed at their inability to climb 5.8.

It's too bad that MP doesn't have an API or allow web scraping. We could do more analyses such as this. If you're going to do any more of these analyses could you add some more labels? I found these a bit difficult to interpret at first. Try using something like R or python in a jupyter notebook instead of JMP.

Tom Sherman · · Austin, TX · Joined Feb 2013 · Points: 433

You forgot:

4. The people who have no idea on grades, but have a range of comparable experiences. I.E. this 5.10 felt easier than other 5.10's I've been on, it's 5.9/ This 5.10 felt really easier, 5.8+

Really interesting write-up.

Reggie Pawle · · Boston, MA · Joined Nov 2010 · Points: 5

yeah I'm not following the normal distribution plot either.

I'd recommend converting the grades to a numerical scale (aussies have it figured out, man), that way it's easy to do transformations as necessary.

also 40 routes is a good start, but considering the number of dimensions you're looking at (grade, sport v trad, location), you need more to actually draw inferences. or, just stick with one dimension. one big problem I see is that across different grade levels, it's a mighty strong assumption that variance will be constant.

it's also way too soon to jump to explanations :-) start with simple hypothesis first and do some actual tests. the simplest one I can think of is that SD for trad route grades > SD sport route grades. before that though, you need to transform your data to fit some kind of distribution; doesn't have to be normal to do a test!

can you post your data? might be fun for others to wrangle with.

Stagg54 Taggart · · Unknown Hometown · Joined Dec 2006 · Points: 10
Tom Sherman wrote:You forgot: 4. The people who have no idea on grades, but have a range of comparable experiences. I.E. this 5.10 felt easier than other 5.10's I've been on, it's 5.9/ This 5.10 felt really easier, 5.8+ Really interesting write-up.
Isn't all really based on a comparison anyway? And isn't the standard Yosemite?- hence YDS.
michael s · · Denver, CO · Joined Apr 2012 · Points: 80

Big ups for combining statistics with climbing. How did you decide which routes to pick? Was your sample selected according to some scientific methodology?

teece303 · · Highlands Ranch, CO · Joined Dec 2012 · Points: 596

This awesome.

Anchoring bias play a big part in reducing deviation.

Look it up if you've never thought about it. Once a route is assigned a grade, people will, in general, have a very hard time suggesting a grade very far from the assigned grade, for fundamental psychological reasons. And that's true even if they think the assigned grade is just plain wrong.

I've seen this in action. Climber think she is warming up on a 5.9+. The route is VERY hard. Half way up her and the belayer think this is more like 10+ or 11-. Pretty stiff for a 9.

When they figure out what route they were on, it's a 12a. This climber climbs 12a, so she knows what that feels like. But because she was anchored to the belief that the route was a 9+, she was unwilling to de-anchor and stray that far from the assigned rating.

It's a fascinating psychological problem.

So not only are grades somewhat subjective (although, they are nowhere near as subjective as many climbers imply), and a grade will vary by climber (some routes exploit our weakness or play to our strengths, and thus are legitimately different to different climbers), but ALSO, the very act of assigning a number to them makes it difficult to get an "accurate" grade of the route, if that initial number is somehow "wrong."

Dan Bachen · · Helena, MT · Joined Mar 2010 · Points: 1,083

So I've been thinking more about the potential analysis.
It is helpful to explicitly state what hypothesis you want to test prior to starting.

If you are interested in the variation of ratings around the mode to get at how variable ratings are, you may consider calculating the coefficient of variation for each climb (after standardizing ratings around the mode, or converting to a numeric scale (as mentioned OZ grading or even 11a to 11.25). Then run a linear model to get at/ account for the effect of climb type, state, grade, etc. This also gets around the problem of repeated measures on individual climbs. For interpretation you will have to be cautious since I assume that this in not a random sample of climbs or ratings of these climbs and results are observational with no random assignment. At best you could say that any effect observed is valid only for climbs tested.

I think that the sample size of 40 could be enough to draw conclusions from, given how tight the data are in the provided figures, although its hard to tell with all categories lumped together, and a lack of a figure showing the distribution of data within each climb.

Kyle Edmondson · · Unknown Hometown · Joined Nov 2012 · Points: 250

A couple of comments. Your data is non-continuous, and in a very small range (your numbers rarely rose above +/- 2, and never above +/-5) which contributes to the shape of the distribution.
More importantly, I am unclear on what you are trying to show. Simply characterizing the distribution of people's opinions on grades doesn't seem to answer a question. Do you have any specific questions that you are hoping to resolve (or does this set of data raise any from other readers)?

reboot · · . · Joined Jul 2006 · Points: 125
Brian Adzima wrote:I then found the mode of each route's ratings...
Why mode?
michael s · · Denver, CO · Joined Apr 2012 · Points: 80

I will again state that if your sample wasn't scientific, any results are meaningless. This seems to be the biggest problem I have noticed with people doing stats in the "real world".

Dan Bachen · · Helena, MT · Joined Mar 2010 · Points: 1,083

"I will again state that if your sample wasn't scientific, any results are meaningless. This seems to be the biggest problem I have noticed with people doing stats in the "real world"."

Well there goes a majority of the biological/ ecological studies...
I believe (and would be supported by a majority in my field) that observational studies have a place in science, as long as the analysis is interpreted within the context of the methods. For example if a study randomly selects subjects and randomly assigns a treatment causal inference can be drawn between effect and treatment across a larger population. If subject selection is non-random but treatment assignment is the results are interpreted only within the population of study subjects, but the effect of the treatment can still be interpreted as causal. If treatment is not randomly assigned to a randomly selected population, the reverse holds true.
In this case where subjects and treatment are observed but not assigned the results of an analysis could establish a correlation between variables within the population of study subjects. If there is good evidence that the population (climbs used) are representative of a larger population the investigator may choose to postulate that this correlation may be valid across this larger population.

Unfortunately not all questions can be answered in an experimental context, but as long as the investigator correctly identifies the limitations on inference dictated by the study design, the results of observational studies can be useful in a real world context.

SteveF · · Fort Collins, CO · Joined Aug 2007 · Points: 32
Kyle Edmondson wrote:Simply characterizing the distribution of people's opinions on grades doesn't seem to answer a question. Do you have any specific questions that you are hoping to resolve (or does this set of data raise any from other readers)?
I think the interesting questions that can be addressed by this type of analysis are:
1. What is the variability in people's opinion for most climbing routes? This is given that the selected climbs are representative of most routes.

2. How is that variability different for trad vs sport climbs?

3. How does that variability change by the accepted level of difficulty?

4. Is there a spatial pattern to the variability?

I can think of a handful of other related questions, but it seems like these questions are what the original poster was probing.
SteveF · · Fort Collins, CO · Joined Aug 2007 · Points: 32

Also, you always hear people say "Grades are so subjective." But how subjective are they really? Based on these graphs I'd say there is a pretty strong consensus for most routes. If you think a route is much harder or easier than it's rated then maybe you don't have the right beta or you used poor technique.

I often think glassy slabs should be rated harder, but it's likely that if I spent enough time climbing these types of routes then I'd agree with the rating. Route grading is supposed to be more indicative of the physical (i.e. strength) challenge given that you have the proper technique, beta, pro, etc. Right?

michael s · · Denver, CO · Joined Apr 2012 · Points: 80
Dan Bachen wrote:"I Well there goes a majority of the biological/ ecological studies... I believe (and would be supported by a majority in my field) that observational studies have a place in science
I am not referring to an observational study vs. an experiment.

I am referring to a sample that was collected scientifically vs. a sample that was selected because it was easy (aka a convenience sample).

Statistical analysis requires data, and how the particular data points were chosen makes a huge difference. IMHO: Lots of people overlook the importance of how you collect your data is. You can't take a poorly gathered sample and make valid statistical conclusions with it.

@I know I sound like a dick here. This is not personal.
Anonymous · · Unknown Hometown · Joined unknown · Points: 0

All climbing is subjective... there is no physical way to truly grade climbing routes.

At best to get toughness grade on a route you could take say 5000 random people and have them try to climb a route on the same day (temp / humility etc) and than just log how many people could on sight vs make it to the top after falling vs not make it up vs how high up could they get etc. From there you could chart a relative grade that would get something reasonable (aka what we pretty much due)

Reggie Pawle · · Boston, MA · Joined Nov 2010 · Points: 5
michael s... wrote: I am not referring to an observational study vs. an experiment. I am referring to a sample that was collected scientifically vs. a sample that was selected because it was easy (aka a convenience sample). Statistical analysis requires data, and how the particular data points were chosen makes a huge difference. IMHO: Lots of people overlook the importance of how you collect your data is. You can't take a poorly gathered sample and make valid statistical conclusions with it. @I know I sound like a dick here. This is not personal.
the question is how the SD of reported climbing grades varies across different dimensions. you haven't stated any reasons the data is unfit for this question, only that any data that isn't scientifically collected is unfit for consideration. I can think of a handful of companies that have made billions off of statistical conclusions drawn from messy data. the problem isn't messy data, it's failing to account for systematic bias. what potential problems do you see, with respect to the question at hand?

my own guess is that there isn't a lot of systematic bias going on. the population in question is all suggested grades for climbs in the universe. how does the sample population of suggested grades for climbs on mountainproject vary with respect to the population of all suggested grades? I'm gonna guess not a lot. there might be a problem with the 40 routes he chose, 40 isn't a huge number. but generally I don't see much wrong with using mountainproject as a source.
Brian Adzima · · San Francisco · Joined Sep 2006 · Points: 560

I appreciate the positive response, I suspected MP.com had a lot of closet statisticians. Here, are answers to a lot of the questions I read above.

THE DATA
I'd appreciate if anyone had suggestions where I could upload it. In the mean time, if you PM with you address I can send the file.

THE ROUTES CHOSEN
I intentionally chose routes with a large number of ratings. It has been my experience climbing that some of the routes with one or two can be very far different from other routes (a number grade or more). I could not think of any way to search for routes with a large number of ratings. You might thinking looking at routes with a high star number across the US would work, but in practice it did not. The best method I found was to search for routes I knew would have a lot of ratings. Hence, the data is highly skewed towards routes I have climbed, or at least longingly stared at. It is also why the data is only from a handful of states. Finding lots of 5.10's is not too hard, but most of the routes with grades 5.12 and higher do not get a lot of suggested ratings (especially trad routes). Most of the routes are also 1 pitch, to avoid any confusion such as I mentioned with the route Genesis.

WHY THE MODE
When I started on this analysis I was not certain that the data was nominal. Nominal data would imply the difference between a 5.10a, and a 5.10b is the same as a 5.12a and a 5.12b. When the data is not nominal, addition/subtraction becomes meaningless (e.g., 1 jumbo egg+ 1 petite egg does not equal two eggs for a recipe). As such, I took the conservative course, and used the mode rather than the mean for further analysis. This is also I did not think anything could be gained from translating the YDS to a simpler system (like the Aussie grade).

WHY NO SD'S, and WHATS A NORMAL PROBABILITY PLOT
The lower graphs are normal probability plots. If the data was Gaussian, the line(s) would be approximately linear. As the data is not, the standard deviation of the whole set of data is meaningless.

WHY WAS I DOING THIS
I originally set out to compare the standard deviations (or an equivalent parameter from the correct distribution) of the different population sets. Before I could do that I was left with the interesting observation that the data is not normal, and stuck trying to come up with an explanation as to why. The two explanations that come to mind is that there are two (or more) distributions convoluted together, or that a transformation of the data is needed because of the nature of climbing ratings. I sat on this data for awhile hoping I would find a solution, but I did not come up with one. Now I am asking what techniques are useful for potentially useful.

Brian Adzima · · San Francisco · Joined Sep 2006 · Points: 560

The routes:

CO/Trad/5.10 Crack/10a
WY/Trad/Beefeater/10b
UT/Trad/3AM Crack/10b/c
UT/Trad/Supercrack (UT)/10b/c
CO/Trad/Over the hill/10b/c
KY/Sport/Breakfast burrito/10c
CO/Trad/Superslab/10c/d
KY/Sport/Fire and brimstone/10d
KY/Sport/Air ride equipped/11a
CO/Sport/Enchanted porkfist/11a
CO/Trad/Rincon/11a
UT/Trad/Scarface/11a/b
CO/Trad/Center Route/11a/b
CO/Trad/Vertigo/11b
CO/Sport/Feline/11b
CO/Sport/80 Feet of Meat/11b
WY/Trad/Hung like a horse/11b
CO/Trad/Climb off the Century/11b/c
WV/Sport/Aesthetica/11c
WY/Trad/Spectraman/11c
UT/Trad/King Cat/11c/d
WV/Sport/Under the milky way/11d
CO/Sport/Lats don’t have feelings/11d
CO/Sport/Rehabilitator/11d
WV/Sport/Narcissus/12a
KY/Sport/Ro shampo/12a
UT/Trad/Coyne Crack/12a
CO/Sport/Defenseless Betty/12a
CO/Sport/Easy Skankin/12b
KY/Sport/Mercy the Huff/12b
UT/Trad/Slice and Dice/12b/c
UT/Trad/Way Rambo/12b/c
CO/Trad/The Evictor/12c
CO/Sport/Pretty Hate Machine/12c
CO/Sport/Psychatomic/12d
WV/Sport/Apollo Reed/13a
KY/Sport/Twinkie/13a
CO/Sport/Sonic Youth/13a
CO/Sport/Pump-o-rama/13a
UT/Trad/Ruby's Café/13a/b

Sumbit · · My house · Joined Aug 2008 · Points: 0

It may be a typo but Twinkie is 12a. If its not a typo your data might be off.

Guideline #1: Don't be a jerk.

General Climbing
Post a Reply to "Lies, Damned Lies, and Statistics"

Log In to Reply

Join the Community

Create your FREE account today!
Already have an account? Login to close this notice.

Get Started.