Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Giving accurate Fedora client counting the 115% effort it deserves. - Jef"I am the pusher robot"Spaleta
ramblings of the self-elected Fedora party whip
Giving accurate Fedora client counting the 115% effort it deserves.
If you are not familiar with the Fedora Client statistics effort take a moment and read:

I'd like to take a moment and talk specifically about how to do a better job at interpreting the total unique IP connections listed here:

There are two competing factors which influence how unique IP counts can be interpreted as client counts.  On the one hand there is the effect of private subnets which map multiple clients to a single IP address. This would lead to the unique IP address count to be an undercount of the actual number of clients.  On the other hand we know we have clients which roam across networks and those clients could easily be counted multiple times in the unique IP logs, leading to the unique IP counts being an over estimate of the actual number of clients.

So which is it in reality? Is the 14 million+ unique IP counts sitting in the Fedora MirrorManager logs an over or under count of reality?

I'm here to tell you friends, that its an undercount..by about 15%.  There are probably about 16 million Fedora clients in the wild in reality. How do I get that?

Easy, I had my buddy Mike "Chops" McGrath do a little data mining of the Smolt logs and come up with an aggregate ratio of Smolt UUIDs to unique IPs.  That ratio can be taken as a scaling factor to convert unique IP counts to unique client counts given the following assumptions.

1) The smolt userbase represents a sampling of the overall client base which is no more likely to be on a private network than the average Fedora client.
2) The smolt userbase represents a sampling of the overall client base which is no more likely to have a dynamic IP address than the average Fedora client.
3) The ratio is reasonably stable over a release cycle timescale, but may be subject to a slowly varying drift.

If those three assumptions hold the ratio of UUIDs to IPs is an adequate scaling factor.  We looked over the last 16 months of aggregate Smolt logging data here is what we found:
Mean Ratio: 1.16
Ratio Stdev: 0.0263

Here's is a graph of the Smolt ratios calculated monthly.

Smolt Correction Graph

I'm pretty confident in the validity of scaling factor. I'm also very pleased to see that the number is greater than 1.  This means that the currently unique IP address statistics we are showing are a conservative estimate of the actual client numbers.  No caveats, no soft-selling.
There are 14 million+ Fedora clients out in the wild and its time we start making that point loudly and confidently.

-jef"Measurement methodology matters"spaleta



13 comments or Leave a comment
From: ext_30327 Date: May 20th, 2009 07:10 am (UTC) (Link)


I know smolt isn't tied to Fedora, but when Anaconda does an install does it record whether it wiped a previous fedora installation, smolt could then pick up that information at firstboot, might help gauge how many new installs are *really* new?

I know I'll have clocked up several smolt IDs that way.
mmcgrath From: mmcgrath Date: May 20th, 2009 01:50 pm (UTC) (Link)

Re: Re-installs

If you upgrade, yes. If you fresh install no. Also though, these numbers would not have come from the smolt database though but http log activity. Meaning that for the one month you switched you may or may not have been counted twice, but for all previous and future months you would have been counted accurately.
jspaleta From: jspaleta Date: May 20th, 2009 03:27 pm (UTC) (Link)

Re: Re-installs

Another reason why the monthly average is useful as a trending metric instead of just taking the average over a large time span. We could probably go back and do it weekly as well to see if there is significant week to week variability.

From: wb8rcr Date: May 20th, 2009 02:32 pm (UTC) (Link)


Jef, I like that you are trying to add a little more objective interpretation to the numbers, but I do question assumption 1.

I expect that folks who have a local LAN with a number of Fedora boxes are more likely to submit smolt profiles than the average user. Just a suspicion, no data. But then, what data do we have that says users who submit smolt profiles are representative?

jspaleta From: jspaleta Date: May 20th, 2009 03:13 pm (UTC) (Link)

Re: Assumptions

Any data to suggest that smolt isn't representative?

I think the fact that the running monthly calculation of Smolts ratio UUIDs to IPs is extremely flat makes a strong case that its representative in average of the entire Fedora client base. The data set even covers the F10 release. What's interesting is that the raw numbers of UUIDs and IPs have a sharp jump at F10 release by a factor of 2 or more, between 10/2008 and 12/2008..but the ratio is very very flat in comparison.

If you were right and home users were more likely consumers of Smolt for some reason..i would have expected to see a spike in the ratio near F10 release as early adopter home users jumped in and did fresh installs. I don't see that. The raw UUIDs go up... the raw unique IPs go up..but the ratio across the release date boundary is flat. I'm very confident its a solid representation.

LANs are everywhere....they are like Elvis Presley in a sense. How many fedora installations actually have a global ip address? Or even a corporate-wide address inside a corporate LAN? LANs within LANs with LANs.

From: bill_mcgonigle Date: May 26th, 2009 10:19 am (UTC) (Link)

Re: Assumptions

When I read the grandparent comment, I didn't get 'home users'. I was thinking 'large corporate LAN's' with NAT boxes.

The reason I know about smolt is because I read Fedora Weekly News. I usually yum upgrade my systems, but I did just do a fresh anaconda install of an F11 machine and nothing asked me to turn on smolt (boo, BTW). On the other hand, it's pretty trivial to turn on smolt across a cluster you manage.

So, the question becomes, "how does a user know to turn on smolt?". From my experience it's more likely to be people who follow Fedora very closely and that those tend to be IT guys, who would tend to be on corporate LAN's behind NAT boxes. And their machines at home, no doubt, but to a much smaller degree.

It's entirely possible I just don't know any home users who are on Fedora and fit the above profile, and that there really is a large population of them out there. So take my selection bias for what you will.

In theory smolt could do a mDNS advertisement and take note of its neighbors to help elucidate the matter.
jspaleta From: jspaleta Date: May 27th, 2009 12:18 am (UTC) (Link)

Re: Assumptions

hmm the mDNS thing is interesting..but smolt devs try very very hard not to leak information that was not explicitly granted by the user. I'm not sure that would be something they would turn on without a lot of debate. Which is why I'm only looking at an aggregate ratio and not doing analysis by network segment block.

From the log information I could probably identify residential users by ip address for large residential service providers like comcast or whatever...and resample their statistics as a subset of the larger population. Or identify corporate subnets by dns lookups by ip address and resample them as a population subset. But I am wary of doing that as it may cross the line in terms of what I feel comfortable doing ethically without explicit permission. I'm trying to avoid doing any analysis where I'm holding any information that gives me personal knowledge about any particular ip address.

From: (Anonymous) Date: May 28th, 2009 12:30 am (UTC) (Link)

Re: Assumptions

You're wise to err on the side of caution.

I think most people know that their IP addresses are public information and that they can be queried. Personally, I'd feel that publishing such a map would cross a line, but if you did the analysis and posted only the summary results, few would feel slighted. Of course, then your analysis data isn't public for scrutiny. So, a tricky balancing act to be sure.

I think the work you're doing is great.
From: (Anonymous) Date: May 26th, 2009 08:16 am (UTC) (Link)

assumptions !!

hi jef,
what I understand is that you assumed a linear relation between the numbers of smolt UUIDs and unique IPs, and you get the slope (or scaling factor) by statistics, then you use this scaling factor to get a number of IPs expressing the real number of users better, am I right ?
I think - as you said - this depends on the validity of your assumptions .. so how could you make sure that they're valid ? especially 1 & 2 ?

Fedora deployment in companies that use private networks are just behind many other well known distributions, either commercial or free .. and for other users ... yes, most of them do use several IPs to connect to the internet..
I just do see they (the assumptions) are not realistic .. what do you think ?
From: (Anonymous) Date: May 26th, 2009 08:27 am (UTC) (Link)

Re: assumptions !!

Sorry, I didn't read your last reply thoroughly ..
But I think, that we have to take data for bit longer period to get a more accurate result, for how many months are your results ? and does smolt and unique IPs numbers get refreshed every month ?
jspaleta From: jspaleta Date: May 27th, 2009 12:11 am (UTC) (Link)

Re: assumptions !!

How long of a period do you want to look over? The analysis in the blog covered 16 months... you can see that in the graph. Each month the ips and smolt ids are tallied from the http logs and the ratio taken. We could go back and do it on a week by week basis too, if there was really a need.

From: ext_190193 Date: May 27th, 2009 06:54 am (UTC) (Link)

Re: assumptions !!

I didn't notice the time legend on the graph ...However, I've noticed something interesting ..
The ratio has jumps on time of new fedora releases.. then it damps gradually until the next release, when it jumps again .. which does make sense ..
So, the question is : does smolt's data still representative ? especially with the fast release rate of fedora ? what I know is that smolt data are sent when new installations is made .. am I right ? so the high jumps of smolt IDs at these times, which are release times, do higher the ratio and eventually the mean for the next 6 month, but doesn't guarantee a full time use of fedora during the same period .. and if fedora had a slower release cycle, would we see a lower mean and ratio ?
maybe ..
jspaleta From: jspaleta Date: May 27th, 2009 08:14 am (UTC) (Link)

Re: assumptions !!

Smolt phones home as a service its not just an install time action.

The "jumps" in the ratio are like 3 percent month to month... the ratio is extremely flat...especially when you consider the jump in actual UUIDs and actually IP counts during the month of F10 release were much much larger.

Here's an example from the date set to show relatively how flat the ratio actually is across months.

2008-10 77822 67310 1.156
2008-12 154358 128506 1.201

2008-12 as a % change from 2008-10 [(2008-12)-(2008-10)]/(2008-10) *100
%change 98% 90% 3.8%

Dude we can't even get political polling results better than 3%.

There is a reason i gave both the mean and standard deviation of the monthly data. The variation in the ratio is small, much smaller than the variation in the raw UUID and IP counts. I could do a graph of the raw data of UUIDs to IPs showing how variable it is in comparison to the ratio. But for now I just show you some numbers to make the point.

2008-03 is the lowest UUID count at 54835
2008-12 is the highest UUID count at 154358
percent change in UUID counts 180%
that's a big swing in monthly UUID counts in that 9 month span.

2008-03 is also lowest IP count at 48003
2008-12 is also highest IP count at 128506
percent change in IP counts 168%
that's a big swing in monthly IP counts in that 9 month span.

The corresponding UUID/IP ratio however only has a percent change of 5.2%

180% and 168% change in the scale of the raw numbers and only a 5.2% change in the ratio of those numbers..across a 9 month span. The ratio is highly insensitive to what the raw UUID and IP counts are doing month to month.

So what about the 13 month release cycle effect? Taking 2008-03 as a low point.

2008-03 54836 48003 1.142
2009-04 115003 97480 1.179
%change 109% 103% 3.2%

One release cycle later and there is only a 3% upward drift in the ratio compared to a 100% upward movement in the UUID and IP counts.

If you want to talk worst case as supported by the data..then worst case we are talking a ratio of 1.14 and that is calculated from 2008-03 data.

We could in fact produce a weekly ratio and put it side by side with the weekly yum stats if people felt it was a worthwhile number to track week by week.

13 comments or Leave a comment