Analysis

Analyzing Consistency of Athletes

April 16, 2016

Update (April 16th): Since originally posting I have added an age-weighted component to the numbers, so that newer results have a larger influence than older ones. As you can see from the changes in the numbers, this adds another interesting dimension to the numbers in this post.

I love getting feedback on my analysis and predictions – very often, they trigger some new, interesting way of looking at the data. For example, Linsey Corbin made the following remark to me:

I wish there was a way that your predictions could show consistency. One thing I pride myself on is being fairly consistent across the board.

Thanks for the suggestion, Linsey (and great to see you back to racing)! I have been looking at different ways of attacking this question, here is what I was able to come up with. I will continue to monitor these numbers for upcoming races, ~~maybe~~ and I’ll include them in future predictions.

Deviation

In statistics, there are a number of way to measure how “consistent” a set of data is. The most common way to express variability in data sets is the “Standard Deviation“. StdDev basically measures the distance of data points from the average value – the more “outliers” there are and the further off they are, the higher the standard deviation.

This was my first try of analyzing consistency. The data analysis part is pretty simple, as the function is built into all kinds of programs. However, the results were not very helpful: In essence it helped identify athletes that had one or more sub-standard results, e.g because of walking large parts of the marathon in a race. For example, Lucy Gossage showed up as an inconsistent athlete with a large deviation, but that was almost exclusively a result of her marathon walk resulting in an 11:32 finish in Kona 2014. It also didn’t value “good” results: The difference of a good result to an average – maybe 30 minutes or so – is much smaller than that of a bad result – walking easily adds an hour to the overall time.

Identifying Non-standard Results and Quantifying Consistency

Even when looking at the deviation of results of each athlete did not lead to a good measure, it formed the basis for another way of looking at the data. In the familiar “bell shape” curve of the normal distribution, 68% of results fall within one standard deviation around the average. When looking at the difference between an athlete’s “expected time” and their actual finishing time, roughly 68% of the results are within 20 minutes of the expected time. Based on this I classify results within 20 minutes of the expected finishing time as “normal”, and any result quicker as “better” results and anything slower and DNFs as “sub-par” results.

I can then aggregate all the results of an athlete into a figure like this:

Linsey Corbin: 83% +17% -0% (18)

Older results have less of a meaning than newer, so adding in an aging component gives the following numbers:

Linsey Corbin: 79% +21% -0% (18)

Each part has the following meaning:

Linsey Corbin: Name of the athlete
79%: Fraction of normal race results
+21%: Fraction of “better than expected” race results
-0%: Fraction of “sub-par” race results (including DNFs)
(Note: Technically, Linsey has at least one DNF in her Ironman races – she didn’t finish IM Texas in 2011. This is a limitation in my data – I have only been including DNF’s since 2014.)
(18): Total number of Ironman-distance results (including DNFs)

Average numbers are about 68% of normal results and roughly 15-20% each of better and sub-par results, but these numbers vary wildly between athletes.

Examples

Here are some more numbers from well known athletes – put into different groups. (As I have updated my algorithm a bit since posting for the first time, I am also including the originally posted numbers in [square brackets].)

Stable Athletes

Andy Potts: 100% +0% -0% (13) [originally posted: 100% +0% -0% (13)]
Yvonne Van Vlerken: 84% +0% -16% (23) [originally posted: 91% +0% -9% (23)]
Lucy Gossage: 92% +0% -8% (12) [originally posted: 91% +0% -9% (11)]
Sebastian Kienle: 85% +12% -3% (11) [originally posted: 82% +9% -9% (11)]

These are athletes where predictions are a very good indicator of how they’ll perform on race day – they usually perform on a very similar level from race to race.

Normal Stability

Jodie Swallow: 55% +0% -45% (10) [originally posted: 78% +0% -22% (9) – she has since DNF’d in South Africa]
Caroline Steffen: 92% +8% -0% (20) [originally posted: 75% +25% -0% (20)]
Meredith Kessler: 65% +14% -20% (23) [originally posted: 70% +17% -13% (23)]
Andreas Raelert: 48% +0% -52% (19) [originally posted: 63% +0% -37% (19)]
Luke McKenzie: 51% +30% -19% (26) [originally posted: 62% +23% -15% (26)]

For these athletes predictions give a good indication, but it is also interesting whether there is a higher potential for an “up-side”, better-than-expected result (larger percentage of faster results, e.g. Carolin Steffen) or for a “down-side” result (larger percentage of sub-par results, e.g. Jodie Swallow or Andreas Raelert). For other athletes, the day could go either way (e.g. Meredith Kessler or Luke McKenzie).

Lower Stability

Sarah Piampiano: 41% +47% -12% (14) [originally posted: 50% +43% -7% (14)]
Luke Bell: 23% +5% -72% (26) [originally posted: 38% +12% -50% (26)]
Dede Griesbauer: 41% +18% -40% (26) [originally posted: 32% +32% -36% (25)]
Tim O’Donnell: 14% +63% -23% (11) [originally posted: 27% +45% -27% (11)]
Pete Jacobs: 5% +16% -79% (26) [originally posted: 15% +42% -42% (26)]

Then there are athletes that have a lower fraction of “normal” results. Here it’s also interesting to look at the upside (e.g. Sarah Piampiano, Tim O’Donnell) or downside potential (e.g. Luke Bell). Some athletes’ results are very hard to predict from previous numbers – for example Dede Griesbauer and Pete Jacobs have had a good fraction of great results but also slower, disappointing results.

Continental and National Fastest Times

March 8, 2016

After the fast times at the end of 2015 there has been some discussion about continental and national “records” over the Ironman-distance. Because of doubts about the accuracy of courses, comparing times from different courses is always a bit tricky, but here is an overview of the data I was able to compile.

Please let me know if I missed some older results that are better than the continental and national records in this list!

Continental Records

Female Athletes

Continent	Athlete	Nation	Time	Date	Race
Africa	McEwan, Dianne	ZAF	09:37:45	14.04.13	IM South Africa
Asia/Pacific	Shiono, Emi	JPN	09:23:26	01.03.08	IM New Zealand
Australia	Carfrae, Mirinda	AUS	08:38:53	20.07.14	Challenge Roth
Europe	Wellington, Chrissie	GBR	08:18:13	10.07.11	Challenge Roth
North America	Corbin, Linsey	USA	08:42:42	29.06.14	IM Austria
South America	Monticeli, Ariane	BRA	08:59:08	31.05.15	IM Brasil

Male Athletes

Continent	Athlete	Nation	Time	Date	Race
Africa	Cunnama, James	ZAF	07:59:59	08.07.12	Challenge Roth
Asia/Pacific	Vernay, Patrick	NCL	08:03:46	12.07.09	Challenge Roth
Australia	McCormack, Chris	AUS	07:54:23	24.06.07	Challenge Roth
Europe	Raelert, Andreas	GER	07:41:33	10.07.11	Challenge Roth
North America	Starykowicz, Andrew	USA	07:55:22	02.11.13	IM Florida
South America	Amorelli, Igor	BRA	07:59:36	31.05.15	IM Brasil

National Records

Sometimes, the nation of an athlete is not clear – often athletes are listed with their country of residence (e.g. foreign athletes staying in Boulder), and some athletes have dual citizenships. Please let me know if I have mis-attributed a fast result by an athlete to the wrong country!

Female Athletes

Nation	Athlete	Total	Date	Race
AUS	Carfrae, Mirinda	08:38:53	20.07.14	Challenge Roth
AUT	Wutti, Eva	08:37:36	18.08.13	IM Copenhagen
BEL	Goos, Sofie	08:57:08	29.06.14	IM Austria
BRA	Monticeli, Ariane	08:59:08	31.05.15	IM Brasil
CAN	Naeth, Angela	08:54:55	28.09.14	IM Chattanooga
CZE	Reed, Lucie	08:57:34	06.10.13	Challenge Barcelona
DEN	Pedersen, Camilla	08:56:01	07.07.13	IM Germany
FIN	Lehtonen, Kaisa	08:48:40	04.10.15	IM Barcelona
FRA	Collonge, Jeanne	09:20:51	23.06.13	IM France
GBR	Wellington, Chrissie	08:18:13	10.07.11	Challenge Roth
GER	Wallenhorst, Sandra	08:47:26	13.07.08	IM Austria
HUN	Csomor, Erika	08:47:05	13.07.08	Challenge Roth
IRL	Mullan, Eimear	08:56:51	04.10.15	IM Barcelona
ITA	Niederfriniger, Edith	08:59:45	13.07.08	IM Austria
NED	Van Vlerken, Yvonne	08:43:07	02.11.13	IM Florida
NZL	Martin, Britta	08:56:34	07.12.14	IM Western Australia
SUI	Steffen, Caroline	08:34:51	24.03.12	IM Melbourne
SWE	Lundstroem, Asa	09:02:49	22.03.15	IM Melbourne
UKR	Kozulina, Tamara	09:06:42	13.07.08	IM Austria
USA	Corbin, Linsey	08:42:42	29.06.14	IM Austria
ZAF	McEwan, Dianne	09:37:45	14.04.13	IM South Africa

Male Athletes

Nation	Athlete	Finish	Date	Race
AUS	McCormack, Chris	07:54:23	24.06.07	Challenge Roth
AUT	Weiss, Michael	07:57:39	03.07.11	IM Austria
BEL	Vanhoenacker, Marino	07:45:58	03.07.11	IM Austria
BMU	Butterfield, Tyler	08:05:22	31.05.15	IM Brasil
BRA	Amorelli, Igor	07:59:36	31.05.15	IM Brasil
CAN	McMahon, Brent	07:55:48	16.11.14	IM Arizona
CZE	Ospaly, Filip	07:58:44	02.11.13	IM Florida
DEN	Henning, Rasmus	07:52:36	18.07.10	Challenge Roth
ESP	Rana, Ivan	07:48:43	29.06.14	IM Austria
EST	Albert, Marko	08:08:17	03.07.11	IM Austria
FRA	Chevrot, Denis	08:05:58	07.12.14	IM Western Australia
GBR	Amey, Paul	08:01:29	19.11.11	IM Arizona
GER	Raelert, Andreas	07:41:33	10.07.11	Challenge Roth
LUX	Bockel, Dirk	07:52:01	14.07.13	Challenge Roth
NCL	Vernay, Patrick	08:03:46	12.07.09	Challenge Roth
NED	Van der Marel, Jan	07:57:46	04.09.1999	Almere Triathlon
~~NED~~	~~Diederen, Bas~~	~~08:05:36~~	~~05.07.15~~	~~IM Germany~~
NZL	Brown, Cameron	08:00:12	24.03.12	IM Melbourne
POR	Marques, Sergio	08:05:21	06.10.13	Challenge Barcelona
SLO	Plese, David	08:02:20	04.10.15	IM Barcelona
SUI	Schildknecht, Ronnie	07:59:42	05.11.11	IM Florida
SWE	Nilsson, Patrik	08:08:05	15.08.15	IM Sweden
USA	Starykowicz, Andrew	07:55:22	02.11.13	IM Florida
ZAF	Cunnama, James	07:59:59	08.07.12	Challenge Roth

Notes

There are a few records that need some explanations.

Female African & South African Record

I have listed Dianne McEwan (now Dianne Emery who became a mom in January) as the African record holder, but she herself considers Annah Watkinson’s 9:31 from IM Austria 2015 as the record. Annah has been racing as an age-grouper then, and because of the different race dynamics from the Pro race I’m not counting her result. But Annah has turned Pro this season, so there’s a good chance we will see a new African record this year, maybe as early as IM South Africa!

Male Canadian Record

Lionel Sanders sent me the following tweet after I posted the fastest times:

Lionel is right that Peter finished in 7:51:56 in 1999. However, it is accepted that the marathon in Klagenfurt was short (Peter ran a 2:35:21!) – probably by more than 1k. With Peter being a great athlete and fast runner, one could speculate if he could have finished faster than Brent’s 7:55:48, but I’ve decided to not accept his time as a record.

Male Dutch Record

Jefry Visier, the Operational Director of Challenge Almere, was going through older Almere results and found four times (three by Jan Van der Marel and one by Frank Heldoorn) that were quicker than the one I had from Bas Diederen.

2015 Money Lists

February 16, 2016

This is an except from my free “2015 TriRating Report“. If you’re interested in more information about the 2015 long-distance Triathlon season, you should definitely check it out!

Overall Money List

First, here is an overview of the races I have included in my money list:

Type	Description	Total Prize Money	# of Athletes
Kona	Ironman World Championship (Kona)	$ 650.000	20
Ironman	Full-distance WTC races (not including Kona)	$ 2.271.000	318
70.3 Champs	70.3 World Championship (Zell am See)	$ 250.000	20
70.3s	70.3 races (not including Champs)	$ 2.177.500	400
Challenge	Full-distance Challenge races (including Roth)	$ 360.750	83
Sum	All included races	$ 5.709.250	582

This does not include the $1 million prize that Daniela Ryf collected for winning the “Triple Crown”.

The next table shows the Top 20 athletes – both from the men and women – that have earned the most prize money in the 2015 calendar year from all the races listed above:

#	Name	Sex	Total Money
1	Ryf, Daniela	F	$223.000
2	Frodeno, Jan	M	$213.000
3	Kessler, Meredith	F	$86.000
4	Blatchford, Liz	F	$79.750
5	Raelert, Andreas	M	$77.750
6	Potts, Andy	M	$75.500
7	Joyce, Rachel	F	$73.250
8	O’Donnell, Timothy	M	$67.500
9	Sanders, Lionel	M	$66.500
10	Wurtele, Heather	F	$64.500
11	Van Vlerken, Yvonne	F	$63.250
12	Steffen, Caroline	F	$60.750
13	Don, Tim	M	$58.000
14	Jackson, Heather	F	$57.750
15	Swallow, Jodie	F	$57.500
16	Naeth, Angela	F	$55.000
17	Kienle, Sebastian	M	$52.500
18	Pedersen, Camilla	F	$51.750
19	Piampiano, Sarah	F	$49.250
20	Gossage, Lucy	F	$47.000

Meredith Kessler has made it into third spot without any money from the “big money races” in Kona or Zell Am See.

Ironman Money List

Here are the Top 15 money earners from Ironman races (excluding Kona):

#	Name	Sex	Ironman	Total	Overall Rank
1	Vanhoenacker, Marino	M	$44.000	$44.000	24
2	Van Lierde, Frederik	M	$35.000	$40.250	26
3	Kessler, Meredith	F	$34.000	$86.000	3
4	Van Vlerken, Yvonne	F	$33.250	$63.250	11
5	Hanson, Matt	M	$31.000	$32.000	40
6	Ryf, Daniela	F	$30.000	$223.000	1
6	Frodeno, Jan	M	$30.000	$213.000	2
6	Blatchford, Liz	F	$30.000	$79.750	4
6	Swallow, Jodie	F	$30.000	$57.500	15
6	Naeth, Angela	F	$30.000	$55.000	16
6	McKenzie, Luke	M	$30.000	$36.750	33
6	Monticeli, Ariane	F	$30.000	$35.750	35
6	Symonds, Jeff	M	$30.000	$31.250	43
6	Hauschildt, Melissa	F	$30.000	$30.000	44
15	Sanders, Lionel	M	$29.500	$66.500	9

70.3 Money List

Here are the Top 15 money earners from 70.3 races (including the 70.3 Champs):

#	Name	Sex	70.3 Total	Zell Am See	Other 70.3s	Overall Rank
1	Ryf, Daniela	F	$73.000	$45.000	$28.000	1
2	Frodeno, Jan	M	$63.000	$45.000	$18.000	2
3	Wurtele, Heather	F	$58.000	$20.000	$38.000	10
3	Don, Tim	M	$58.000	$-	$58.000	13
5	Kessler, Meredith	F	$52.000	$-	$52.000	3
6	Sanders, Lionel	M	$37.000	$-	$37.000	9
6	Tisseyre, Magali	F	$37.000	$10.000	$27.000	32
8	Steffen, Caroline	F	$33.750	$-	$33.750	12
9	Aernouts, Bart	M	$33.250	$10.000	$23.250	25
10	Potts, Andy	M	$33.000	$-	$33.000	6
11	Kaye, Alicia	F	$32.500	$7.500	$25.000	39
12	Goss, Lauren	F	$31.500	$-	$31.500	41
13	Swallow, Jodie	F	$27.500	$-	$27.500	15
13	Reed, Tim	M	$27.500	$-	$27.500	51
15	Boecherer, Andi	M	$27.250	$6.500	$20.750	36

It’s interesting to note that Tim Don and Meredith Kessler have almost made it to the top of the list without any money from the 70.3 Championships.

Validating 2015 Predictions

February 15, 2016

I have been publishing Race Predictions for a few years now, so it’s about time to have a look at how “good” my predictions are. When I started in 2011, there were quite a few changes in the algorithm and the parameters to deal with a number of edge cases. During 2015 there have not been any changes, so this is a good data pool for validation.

Data Used

In 2015 I have published predictions for 36 Professional Ironman-distance races, 31 Ironman-branded races by WTC and 5 more Challenge races.There have been a total of 1098 finishes, 688 by male athletes (62.7%) and 410 by females (37.3%). These were posted by 600 different athletes, 382 male (63.7%) and 218 females (36.3). In addition there were 349 DNFs, 244 by males (70%) and 105 by females (30%).

Using my algorithm and the available start lists, I have seeded the participants in each of the races, and predicted 930 finishing times (84.7% of all finishers). There are some cases when I didn’t predict the finishing times, for example when an athlete didn’t have any prior IM-distance finishes or when there was a late entry (and therefore the athlete not included in the start list).

Predicting the Winners

Here’s a look at the places the eventual race winners have been seeded based on previous results and the start lists:

WinnersSeeded

With 36 IM-distance races, there are 72 winners (one each for the male and female race). My algorithm has correctly predicted the winner in 26 races (36%), and another 26 winners were seeded in #2 or #3 (winning frequency of an athlete seeded on the podium: 72%). Only three winners have been seeded higher than 8th: Kirill Kotshegarov was seeded 10th at IM Chattanooga, Mel Hauschildt was seeded 11th at IM Melbourne, and Matt Hanson was seeded 12th at IM Texas. There was also one unrated (and therefore unseeded) winner in 2015: Jesse Thomas won IM Wales in his debut Ironman.

The numbers would be even better when only considering the athletes that finished a race. Only including athletes that actually started increases the frequency of picking the right winner to 39% (and one of the podium picks to win the race to 80%), also discarding athletes not finishing would have yielded 42% and 83% of the winners.

Time Predictions

In my pre-race posts, the finish times are predicted for each athlete that has raced an Ironman race before. The algorithm considers the previous finishing times of an athlete and the course that the race is going to be held on.

The following graph compares the actual finish times to the predicted finish times (each data point is one dot on the graph). Dots towards the upper left are results where the actual finish was faster than predicted, dots towards the lower right are results that are slower than predicted.

The graph shows actual and predicted times between 8 and 12 hours (only 11 faster results/predictions and 15 slower ones are missing).

ActualVsPredicted

I have added a “trend line” that shows the best fit of all the data points, highlighting the fact that most of the data points are pretty close to the “diagonal” (where actual = predicted). Between 8 hours and 10 hours the algorithmic predictions are pretty good on average (maybe predictions are a bit too fast around 8 hours). Towards 10 hours finishing time and especially over 10 hours the predictions are too fast: This is caused by “explosions” that lead to very slow times even for athletes that have been predicted to be relatively fast. To put it another way: Finishing times over 10 hours are most often bad races that are pretty much unpredictable using only data.

Here is another way of looking at how far off the time predictions have been from the actual results:

Difference

The graph shows the number of results in one minute bins of difference between predicted and actual finishing times. Data points towards the right are faster than predicted, they are slower than predicted to the left. Again a trend line smoothes out the statistical “noise”.

A few observations:

In a range roughly between -40 minutes and +40 minutes the graph is pretty symmetric and is very close the normal distribution.
As noted above, there is relatively large number of “explosions” with large negative differences, resulting in a non-symmetrical distribution on the edges of the graph. (There are 49 results that are more than 60 minutes slower than predicted, but only 10 that are more than 60 minutes faster.)

On average, the predictions are -4.7 minutes off the actual finish time (i.e. the actual finish is slower by close to five minutes). An average close to 0 means that on average the predictions are closer to the actual finish. The standard deviation is 31.8, this means that 68% of the time differences are between -36.5 and 27.1 minutes (-4.7 +/- 31.8 minutes). Usually, a smaller deviation corresponds to a “better” prediction.

The standard “statistical” way of measuring the dependence between two data sets is correlation. Correlation is +1 in the case of a perfect linear relationship, −1 in the case of a perfect inverse relationship, and some value between −1 and 1 in all other cases, indicating the degree of linear dependence between the variables. A value around zero indicates that there is less of a relationship (closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables. The predictions and actual finishing times have a correlation of 0.77, indicating a pretty strong dependence between the two data sets.

Comparing Other Prediction Strategies

When trying to predict a completely random event (the classical example is a “perfect dice”), the correlation between the actual events and the predictions won’t be very high. When only working off previous results, anything that happens before a race – for example a good block of training, a slight injury – will influence how fast an athlete is able to go on race day, as will random events during the race (e.g. getting punched during the swim, technical issues on the bike). Therefore a “perfect prediction” (resulting in a correlation of 1) is impossible, and in order to determine whether a correlation of 0.77 indicates a “good predictor” or not, one has to compare the results of my algorithm to other predictions.

I am not aware of anyone other than me publishing time predictions for Ironman races on a regular basis. (Please let me know if there is!) Therefore, I am comparing my predictions to a few much simpler strategies:

Last Finish: “You are only as good as your last race” (prediction = last IM-distance finish)
Best Finish: “My best time is a sub-x” (prediction = fastest IM-distance finish)
Average Finish: “I usually finish around y” (prediction = average IM-distance finish)
Average Last Year Finish: “This season is going great” (prediction = average IM-distance finish in the last twelve months)

Here’s a comparison of the correlation of these different methods and my comments:

	Number of Data Points	Average Difference	Standard Deviation	Correlation to Actual Finish	Comments
Last Finish	943	-0.75	41.14	0.649	Good on average, but wide deviation and lower correlation
Best Finish	943	-21.48	39.59	0.659	Slightly better deviation and correlation, but large average difference
Average Finish	943	-0.84	35.77	0.706	Good on average, but wider deviation and lower correlation than TTR Predictions (still better than last/best finish)
Average Last Year Finish	877	-1.89	35.08	0.706	Almost the same as the Average Finish, but applicable for fewer cases
TTR Predictions	930	-4.70	31.81	0.770	Lowest deviation, highest correlation

Summary

The Prediction Algorithm I use to calculate the expected times in my pre-race posts provides better predictions than simpler prediction strategies. My model certainly has limitations, but the large number of “successful” winner predictions and the high correlation show that the time predictions and the conclusions drawn from them are pretty much valid. I think my analysis is quite good at telling the “data part of the story”.

While the “data part” is an important (and impartial) part of the story, it is still only a part of the story. A coach or teammate that has been able to observe an athlete getting ready for a race has additional (and more current) information available – even if that is not always fully objective.

The tension between past performances, the uncertainty of a future performance, the challenges athletes face in their training and the hard work they put in to be better in their next race .. that’s why I still love following the races!

« Previous
1
…
21
22
23
24
25
…
46
Next »