November 3rd, 2009

Will the real statistics junkie please standup

This is in response to http://nicolas.barcet.com/drupal/en/oct-ubuntu-server-stats

To get details stats on OS breakdown from netcraft to put that 1.4 million Ubuntu web servers in the correct context... you have to purchase their data product. Problem is... they also restrict how you can use that data so even if Canonical purchased it for you to look over..you probably couldn't comment on it publicly. The full Netcraft survey data is problematic in that regard because you really can't have a public discussion.

But we can have a useful discussion about the 2009 Purchasing Survey because they publish their methodology AND they raw survey data.

The purchasing survey  is a really mixed bag of news when you read the whole article and look at the raw data.  The article really begs the question... where is that stated growth of Ubuntu server deployments coming from?  The article specifically makes the claim that windows to linux migrations are stalling and that people are less likely to dump windows for linux.  So where's the Ubuntu serve deployment growth being generated?  Virtualization maybe? Not according to the raw survey results.

If you dig into the raw data...you'll see that exactly one survery respondent(out of 459) said they were using Ubuntu/Debian based KVM for virtualization.  And more sobering only one respondent  (out of 449) said they planned to deploy Ubuntu/Debian based KVM in the next 12 months.  That should raise some eyebrows inside the Canonical fenceline.  Doesn't that survey result run counter to pretty much everything Canonical and its virtualization partners have been saying?  Hopefully they'll repeat these virtualization usage and intent to deploy questions in next year's survey after the next Ubuntu LTS is out and both Canonical and Eucalyptus Systems are pushing Ubuntu server for private deployments.  

But more generally speaking I'm not sure that the Purchasing Survey results are self-consistent enough to be reliable.  For example look at questions 22 and 67.

question 22:   Which server operating systems do you currently have installed? (Select all that apply.)

question 67: Which of the following Linux distributions/operating systems do you currently use on your servers? (Select all that apply.)

The numbers don't compare well across those two questions. There is at best a 10% point discrepancy in the Red Hat deployment percentages between those two questions. There is a similar discrepancy in the CentOS numbers. That's not a good sign for survey accuracy.  If there really is a 10% point error, that potentially wipes out the implied Ubuntu growth in the summary article.  

And I'm not saying that the Ubuntu growth does not exist. What I am saying is that when you look really closely at the survey data.. the survey does not appear to be accurate enough to say anything statistically significant about Ubuntu growth if the noise floor in the survey really is 10%.  The survey summary article consistently overreaches in its conclusions without once commenting on the inherent accuracy limitation of their survey. 

The point I'm trying to make...to everyone.. is that you can't just throw numbers up without considering the accuracy of the methodology.  For this survey in particular... If they can't get Red Hat deployment numbers accurate to 10% between question 22. and 67..the linux distribution with the most respondents and therefore the best statistical accuracy...then you can't really expect the other linux distribution numbers from the survey results to be more accurate than that.

But thankfully the Purchasing Survey do make their methodology and their survey data available so their conclusions can be transparently discussed.  I wish everyone who publish deployment numbers would at least go that far instead of just throwing the numbers out as a PR stunt.

-jef