Sat 15 October 2011
GitHub in Numbers
Some times ago, I had got an idea — to create a new service around the GitHub. In the first place it will be useful to those users who have many repositories or following/watching many users and repositories.
My basic GitHub usage pattern is: to follow other coders to see in the News Feed which other coders they are following and which projects they are watching. This way I'm able to discover interesting people and projects on the GitHub. But this approach works only to a paritcular moment, while amount of data is not very big. When I had got a lot of items in my News Feed, I stoped to read it.
The idea is to aggregate items from the news feed and pop out interesting projects. For example, people who you are following instest can be used or people from the 2nd circle.
Moreover, this service should be useful for watching on the changes in your repositories forks. This feature will be useful for those guys who have tens public repositories. I already implemented this functionality for myself as a simple script. This script generates a RSS feed with new commits from the forks.
Also, there will be a funny medals and ratings. Push a dozen commits at the midnight and you'll become a "Midnight code warrior" :)
But before actual implementation of the idea, I decided to calculate some stats about GitHub's usage. I have to estimate how many people will be interested in such project, because I hate to create useless things.
Progress
At first, I wrote a GitHub's profiles and reps fetcher. It takes one login and downloads a his profile and add all users who he follows to the queue. Then it repeats the process for every login in the queue.
Because the rate limit in 5000 reqeusts a hour, my script worked about 2 days.
Totally about 57 thousand profiles were downloaded and 500 thousand repositories were fetched. I expected that there are much more users, but probably these numbers are correct. After all, if my script didn't download somebody then nowbody follows hib and his account probably abandoned.
This graph depicts a fetching progress. At this graph is well seen the moment when queue stopped to grow and started to fall down.
Watch & Follow
One of the most impotent stats for my project is portion of users who are watching a large amount of repositories or following many other coders:
|
Here we see that about 50% of users are watching more than 10 repositories and 7% (~ 4000) — are watching more than 100. Also I'm interested in those 20% (~11000), who are following more than 10 people, definitely they are unable to read through all their news feeds. And certanly those 410 users who follow more then hungred other coders will fall in love with aggregation feature. Myself is in the latter category as I'm following about 331 users.
A number of repositories
Average amount of public repositories is 8, 3 of them are forks of someone's else repository.
|
This graph shows that 60% of users have less than 10 public repositories and 15% does not have repositories at all. But about a quarter of users have from 10 to 100 public repositories. They will appreciate a fork watching feature of my project.
Also, I calculated a part of active repositories. Only 10% of repositories had been pushed at last month.
And more…
As I said before, totally about 500 thousand repositories were downloaded and 40% of them are forks. It is amazing! I thought this number should be much much bigger.
In addition, I estimated how much a 2nd cirle is. Average GitHub user follows 9 people and watches at 33 repositories. But his 2nd circle contains 230 people and 800 repositories. This is average number, but for geeks like me they are much bigger. I have in 2nd cirle 11548 people and 42882 repositories. It is about 1/5 of all GitHub!
And here is how many organisations relative to users:
And some tops, I know, you like them!
Top 20 Companies
| Company | Users |
|---|---|
| 37965 | |
| ThoughtWorks | 75 |
| 65 | |
| Mozilla | 61 |
| Red Hat | 58 |
| Freelance | 56 |
| 40 | |
| Japan | 40 |
| Yandex | 39 |
| Freelancer | 36 |
| Globo.com | 35 |
| Yahoo! | 33 |
| Intridea | 31 |
| 30 | |
| GitHub | 29 |
| Student | 26 |
| Emergya | 26 |
| Pivotal Labs | 24 |
| Microsoft | 23 |
| Engine Yard | 23 |
Top 20 Cities
| City | Users |
|---|---|
| 23657 | |
| San Francisco | 1441 |
| London | 962 |
| New York | 578 |
| Paris | 474 |
| Chicago | 458 |
| Seattle | 457 |
| Tokyo | 430 |
| Berlin | 423 |
| Germany | 417 |
| Portland | 346 |
| Toronto | 317 |
| Boston | 288 |
| Austin | 280 |
| Sydney | 272 |
| Stockholm | 261 |
| Japan | 244 |
| Los Angeles | 230 |
| Brooklyn | 226 |
| Melbourne | 221 |
Top 20 "followers"
| Login | Follows |
|---|---|
| snytkine | 3242 |
| mtsoerin | 1983 |
| webiest | 1903 |
| superfeedr | 1710 |
| charlenopires | 1236 |
| stonegao | 1205 |
| Marak | 1068 |
| speedygonzalez | 1059 |
| tyru | 1022 |
| esneko | 867 |
| josegonzalez | 640 |
| c9s | 556 |
| kanzure | 555 |
| take-cheeze | 517 |
| elliottcable | 495 |
| Sannis | 475 |
| mattn | 462 |
| j2labs | 453 |
| dpree | 446 |
| rkh | 444 |
Top 20 "who followed"
| Login | Followers |
|---|---|
| defunkt | 4005 |
| torvalds | 3803 |
| jeresig | 3466 |
| mojombo | 3248 |
| ryanb | 2737 |
| schacon | 2429 |
| paulirish | 2316 |
| dhh | 2170 |
| wycats | 2044 |
| ry | 2032 |
| rails | 1946 |
| 1802 | |
| jquery | 1767 |
| technoweenie | 1572 |
| pjhyett | 1563 |
| visionmedia | 1554 |
| cyanogen | 1410 |
| douglascrockford | 1380 |
| tpope | 1369 |
| android | 1317 |
Top 20 repository owners
| Login | Repositories |
|---|---|
| gitpan | 21976 |
| vim-scripts | 3735 |
| emacsmirror | 3101 |
| Epictetus | 911 |
| panega | 612 |
| jenkinsci | 602 |
| dev2dev | 504 |
| wave2future | 411 |
| CyanogenMod | 342 |
| MechanisM | 329 |
| rjbs | 325 |
| tokuhirom | 297 |
| rwldrn | 297 |
| aculich | 287 |
| rainly | 282 |
| albertobraschi | 278 |
| idega | 272 |
| rafl | 266 |
| apache | 258 |
| kristianmandrup | 244 |
Top 20 "watchers"
| Login | Watches |
|---|---|
| gitpan | 21976 |
| vim-scripts | 3736 |
| emacsmirror | 3588 |
| stonegao | 2789 |
| abecciu | 2474 |
| igrigorik | 2415 |
| charlenopires | 2339 |
| stan | 2318 |
| matagus | 2160 |
| smtlaissezfaire | 1955 |
| rmetzler | 1916 |
| shanlalit | 1897 |
| willi | 1896 |
| Epictetus | 1821 |
| filipeamoreira | 1812 |
| arden | 1783 |
| andrew | 1746 |
| methodmissing | 1665 |
| rkh | 1571 |
| lgs | 1511 |
The end
Certanly, some other interesting metrics could be calculated, using my database. If you have any ideas feel comment this post or send me an email.
P.S. — I think that my project have a chance to take off and will be useful for few thousand of people around the world. One more thing to think about is how to monetize it to pay rent for servers.
Comments !