Sunday, November 6, 2011

StackOverflow's Programming Language Bias

Using StackOverflow.com, I've always had the impression that its biased towards C#, and .NET in general. That might have been because on the SO tags page C# is number one, with the most questions tagged, or maybe because much of the site itself is built with C# and MVC.NET. So I thought it would be interesting to compare the rankings of the tag popularity on StackOverflow with a leading language popularity index (the TIOBE index).

Method
The "Stack Overflow Representation" represents a ratio of the SO tag count (as a percent of total questions) divided by the TIOBE language index percent. Where a representation of 100% means that the SO tag count is aligned exactly with the TIOBE language index. An "over-representation", greater than 100%, might mean there's a greater number of questions on SO than we'd expect. An "under-representation", lower than 100%, might mean there's not as many questions on SO than we'd expect.

Results
Suprisingly, JavaScript came out to be the most "over-represented" language on SO, by quite a long way at 294%. Could this also be because programming JavaScript is generally quite difficult and will result in people seeking help more often? Following this was C# (which I had expected to be number 1), at 153%. After this, PHP, Ruby and Python were basically fairly balanced at around 100%. The most "under-represented" major language would definitely be C at 11%. Three other major languages which seemed to be a bit under-represented, below 50%, were C++, Java and Objective-C.


Data Sources:

38 comments:

  1. I'd probably suggest that the over-representation of languages like javascript and c# on stackoverflow is due to them being (i) crap languages, so people need more help understanding them, or (ii) most often used by people who are novice programmers. C++ programmers, imo, usually know what they're doing, for example, and so don't need SO's help. The only surprise for me is why PHP isn't featured more highly on SO, since (i) it's the crappiest of the crap, and (ii) used a lot by people who don't know what they're doing.

    ReplyDelete
    Replies
    1. PHP ?= Crap! No way, dude. It's awesome.
      When I develop with C++, I usually don't know what I'm doing. I don't go to stackoverflow.com, because I find it to be a bit stuck-up. I'll go to cplusplus.com, which is friendlier. I think that applies to a lot of novice C++ programmers, hence the under representation.

      Delete
  2. Simon: Are you retarded? PHP is the crappiest of the crap?
    You are aware that Multi-billion dollar companies have based their entire product on it?

    Please Die in a fire :D

    ReplyDelete
  3. "c# is over-represented because it is a crap language!"

    Right...

    ReplyDelete
  4. Insanemal, why are you being rude and talking about violence because someone disagrees with you about a programming language?

    ReplyDelete
  5. While PHP is extremely popular, and used by extremely large companies like Facebook and Zynga, it's a relatively inferior language. Its type coersion is terrible, the concepts it melds are relatively non-contiguous, the associativity of its ternary operator is backwards, and its runtime performance is abysmal to the point Facebook wrote a PHP to C++ compiler in order to improve he performance.

    PHP is the epitome of the "worse is better" aesthetic. Please don't try to twist that into an argument that it's in some way good simply because of its popularity. To do so falls into the logically fallacious realm of argument ad populum.

    ReplyDelete
  6. "The only surprise for me is why PHP isn't featured more highly on SO, since (i) it's the crappiest of the crap"

    Because there is no strong correlation between crap languages and over-representation of a language on SO?

    ReplyDelete
  7. You can base multi-billion dollar companies on crap languages. It doesn't mean the language isn't crap, just that they've learned to work effectively within the limitations.

    ReplyDelete
  8. Ratios over 100% are possibly overemphasized in your first graph since ratios under 100% are sandwiched between 0% and 100%.

    See also example 10 in :
    http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

    ReplyDelete
  9. It would be interesting to factor in "Language Mastery" to this analysis, but I have no idea where you'd gather the statistics from. At a guess I would sat that there are not that many weak programmers noodling around with:

    Ada - they have been rejected as 'unfit' by the Department of Defense
    Assembly - they have always been too frightened to learn about their CPU
    C - they have long since migrated to Java or are trying to write iPhone Apps
    Lisp - they were alienated by the ( )s, leaving it to be used by AI researchers
    Lua - they can't really make mistakes with this as it is virtually idiot-proof

    So, without any real evidence... Lua must be the least difficult language.

    ReplyDelete
  10. I would guess that JavaScript is overrepresented because it is (basically) a monopoly on web UIs. Web-based front ends are extremely popular in all fields these days, and while the back end might be written in any number of the other languages, chances are relatively high that you're also going to have to do some front-end development, and web UIs are only getting more popular. Consider all of the traditionally desktop applications that are moving their UIs to web-based (SabNZBD+ comes to mind immediately.)

    Therefore, the Ruby on Rails programmers, the Python/Django stackers, and the poor, poor Java/JSP devs are probably also going to be bringing JavaScript questions to the board.

    And with the advent of node.js and competitors, JavaScript is also starting to encroach on the server-side realm...

    ReplyDelete
  11. @Simon

    According to your logic, Visual Basic should be the number one represented language.

    ReplyDelete
  12. A couple of widely used, but often hated, languages are missing: cobol and fortran. Both of them are actually more represented than ada or rpg

    ReplyDelete
  13. Actually, I propose that the issue isn't languages at all, it's libraries. Javascript and C# have the largest and most complex libraries that ordinary programmers encounter in a given day (namely, the DOM and .NET). These have accreted without competition from day one, and in the case of Javascript, have multiple independenly derived implementations, so they tend to be difficult.

    Calling Javascript "a crap language" doesn't help. It's a fairly ordinary implementation of Scheme-with-labels, and its core is easily grasped by any programmer within a day. It's the lack of tooling and the complexity of the environment that makes it challenging.

    Javascript and C# also have the headache that they're User Interface languages. People who program in them are programming for other people, not for operating systems and servers. People are more complicated and challenging than computers.

    ReplyDelete
  14. The reason FORTRAN is not included is that Real Programmers don't ask questions on SO, they consult the JCL manual.

    ReplyDelete
  15. Leave it up to the .net evangelists to turn a perfectly good blog post into "here's some vague reason php sucks." You know, of the list, people are having more trouble with C# than any other language other than Javascript (which we expect). Maybe if you idiots spent more time writing code, and less time complaining about languages you don't even understand the basic concepts in, the numbers would look a little different.

    ReplyDelete
  16. The amount of questions correlates with the lack of contemporary and sufficient documentation. C for example is more than established and well-known. Hence many jobs, but fewer open questions.

    There is some question bias, because some languages are more popular and precious to spare time developers, like anything not Java.

    ReplyDelete
  17. Javascript is probably overrepresented as there's likely a lot of questions like "this js works in Firefox but not IE 6" and the N different implementations (across versions) of javascript.

    There should be some way of collapsing those types of questions to reduce the total number of questions.

    ReplyDelete
  18. Given that the TIOBE index is garbage, I think judging the ratio of real projects on a public open source repo by it is completely backwards. You might better ask the question: based on the GitHub distribution, where is the TIOBE index totally out of whack?

    Or you could concede that they are likely both not representative of anything general and thus comparing them tells you nothing.

    ReplyDelete
  19. Looks like you missed R. With over 7600 tagged questions on stackoveflow.com, it should be ahead of Lua and after C.

    ReplyDelete
  20. @Ariel F - yes, but this current data was only looking at the top 20 TIOBE index languages, and on TIOBE, R didn't feature in its top 20.

    ReplyDelete
  21. I agree with Alex. The TIOBE index's methodology is garbage. There's no evidence it has any relationship to the real world number of programmers.

    What your numbers show is not SO's programming bias, but simply that SO's ratios are different from TIOBE's. There's no anchor to the real world here. One cannot say who is over nor under representing.

    It's still interesting data, but there's nothing to support the conclusion. Throw in some more cross-language sites like Github and the scattering of language use would be interesting to look at, but without an anchor to reality there's no way to judge who is biased.

    NOTE: Github's language detection algorithms are not always correct, and sometimes things get a bit overrepresented... https://github.com/gitpan

    ReplyDelete
  22. Nice.

    My punt at explaining your findings: I would hazard a guess that SO over representation is inversely correlated with quantity and quality of (official) documentation.

    I'd also be a little wary that you don't do anything with framework tags. There are many frameworks which imply a language, and SO only allows five.

    ReplyDelete
  23. @willr, interesting theory ... goes some way to explain why well established languages like RPG, COBOL, FORTRAN, C, etc. don't feature much on StackOverflow, given the significant number of developers using them. Could also be that these devs tend to be older and maybe just don't head straight online for a solution to their problem as many younger devs do?

    ReplyDelete
  24. JavaScript is probably over-represented because a lot of Web Designers try to use it without the most basic programming experience or training.

    ReplyDelete
  25. PHP, as a language, is crap even though large companies are using it. People working with PHP in large companies agree on this too.

    ReplyDelete
  26. @Lars, hehe - unfortunately you could say the same about things like COBOL and VB - they're not pretty, but they get the job done and they have carved out a niche for themselves in certain classes of applications.

    ReplyDelete
  27. PHP was several flaws, but, was there, where other languages weren't. And I wish there where other languages, because there are so many websites done in PHP, that are very difficult to maintain, today (even if got paid) !!!

    There is also the point of how is a programming language used, web programming, desktop proramming, cross-platform, cross-os or single platform or single o.s. ...

    ReplyDelete
  28. I have about a dozen Java technology related forums in my bookmark list, ranging from JavaRanch to the JBoss and Spring sites. The fact that SO is lacking in Java is no shock. Java had a number of solid forums before SO.

    And I wouldn't be surprised if there were many C programmers on usenet or on mailing lists. When I was writing C about 15 years ago, Usenet and email lists were the best ways to get help. I wonder if that is the same now.

    ReplyDelete
  29. Did you consider posts tagged with .NET (but not a language tag)?

    Of course, the majority of questions on SO are about libraries (often 3rd-party libraries) and tools, regardless of language tags. A lot of questions tagged C# should really be tagged .NET since they are questions about libraries or tools that any .NET language can use.

    ReplyDelete
  30. The measure of "over-representation" is certainly a combination of many things but I suspect the two primary things it is measuring are language growth of fresh developers (rather than growth of usage of the language which TIOBE measures) and also the lack of other sources of really good online documentation.

    For example:

    Javascript is not any harder than many of the other languages on the list, but for Javascript, StackOverflow is one of the best sources of documentation online. There is currently a flood of new developers joining the Javascript community.

    ReplyDelete
  31. So what if they ask more questions ? Does that make them any less programmers ? I bet you never posted a single question about your language.

    Isn't what the web is for ? Making information more accessible , you tend to forget that sometimes even a great developer often need others opinions or advice .

    And to correct you , .Net or more specifically c# is one of the best generak purpose languqge and if you can use msdn efficiently you'll find your name less online.

    I believe that regardless of anything. A great developer will write a good program using any language. Perhaps if you spent more time developing software than trying to engage in language politics you'd be great too.

    ReplyDelete
  32. A multi-billion company doesn't have to use the best tools to make a lot of dough.

    Just ask anyone on Wall Street or currently unemployed person.

    Sure Facebook uses PHP, but is facebook rock solid, stable enough to run a power plant? Yeah, I thought so.

    ReplyDelete
  33. I'm surprised that a lot of people say JavaScript isn't hard ... that may be so for just the core language by itself, but I've found that when you try and do anything significant with it involving the web/DOM/browser-compatibility or other libraries such as jQuery you can quickly run into problems.

    The other thing to consider is that the majority of questions on SO tagged with JavaScript would probably be relating to an associated library or browser issue, so that may have inflated the total number of JS questions asked.

    ReplyDelete
  34. @Qwertie, the language from TIOBE was just matched with the nearest tag on StackOverflow. I only used one tag per language, and didn't want to get into the complexity of (C# + .NET - VB.NET), etc.

    ReplyDelete
  35. Javascript is the most used language in the planet, obviously more people are going to have questions about it and some parts of it sometimes don't make much sense unless some one explains them to you.

    ReplyDelete
  36. @climboid, yeah, JavaScript has also become known as "the duct tape of the internet" (or was that perl?). Most if not all web application development includes some bits written in it. That and the fact that the people using it are often not experts in JavaScript.

    ReplyDelete