My goal for OSCON today is simple: find ways to skip out on my grad school stats class. See, I've already spent years in lecture halls laboriously working out proofs of bell curves representing birthday distribution of rabbit populations and their population trends when decimated by a nearby pack of wolves with frickin' laser cannons strapped to their heads. Throw an equation with sigmas and epsilons in front of me, and my brain clicks off in a Pavlovian reaction and starts mechanically droning through a MATLAB computation. However, (1) my PhD program says I need stats to do research and (2) my team at work wants to analyze metrics on the open source communities we work with, so this seems unavoidable.

It's not that I dislike stats; I just need to be able to play with real numbers and scenarios, and eventually I'll peel back layer after layer and end up with my nose in a stats book muttering excitedly about kernel density estimates like the math geek I am. So I'm mentally assembling the syllabus of what I actually want to learn about metrics and statistics as I go around OSCON. Basically, I'd like to be able to answer questions about a community with some known level of certainty by collecting and analyzing (typically fairly large) data sets from and about them. I  suspect this will involve math, software tools, and visual design and presentation. I've broken it into five points.

  1. Turning a community into a set of questions. What things can be measured and are most helpful to measure? How do you gauge what the most useful questions would be for a given community, with a knowledge of how much it would cost and how long it would take to get those answers? This likely involves a fair deal of reading on other research studies and metrics initiatives to see what others have measured in the past and what methodologies they've used, along with community interviews and observations to see what questions haven't been answered.
  2. Turning a set of questions into a set of numbers. This is mostly a tool question. How do you instrument communities to (automatically, one hopes) collect the data that you need? What APIs, bots, scrapers, queries, etc. are out there, how do you use them and hack them, are there emerging best practices on architectures and designs and formats for all this? Tools will come from both open source development practitioners (release managers who run bugzilla queries to monitor code quality and development activity, documentation leads who want to know which mediawiki pages are most frequently viewed and edited, etc) and academics doing larger-scale, longer-term studies on the activities of software communities in general, who've made homebrew open source tools to collect their data. This is where I expect the two worlds (open source and academic) to overlap the most.
  3. Turning a set of numbers into a a set of meaningful calculations. Basically, math. What mathematical tools are out there for analysis, what are the tradeoffs between them, how do you use and verify and calculate them? Insert your standard graduate-level statistics-for-researchers class here. Or perhaps statistics-for-businesspeople-who-like-math. That would make a fascinating flavor that might actually be more down-to-earth. Recommendations on texts and/or classes are welcome for this!
  4. Turning a set of meaningful calculations into compelling visuals. Again, tools. Animations, visualizations, how to put things in a browser. Javascript, Processing, HTML5 and Canvas; GNUPlot, Octave, R? (Okay, perhaps those last three belong in the #3 category.) Maybe there are other things out there, though. Maps? Clouds? Sparklines? NFL-style drawn commentaries over realtime video? Clearly I'll need to narrow down this category considerably; clearly this will be driven by the sort of data and results I'm trying to visualize... but there's a lot of fascinating work in data visualization these days, and I'm eager to explore that and to see how much of that sphere is based on open tools (I will prepare to be somewhat depressed).
  5. Bringing meaningful calculations, compelling visuals, and the story they tell back to the community. How to present these things in actionable ways to both hacker and suit audiences, how to make them "shiny" enough to look good but not so shiny as to make them look untouchable. How to design metrics work so that others get involved in it; how to make it hackable, how to make it continuous and itself a public, open, collaborative process instead of a magic report o' numbers that pops out every so often from behind closed doors. It's not necessarily a linear steps-one-through-five process, just as "making an open source project" is not... well, if you make the whole dang thing and then announce "I'll put the GPL on it!" you're Doing It Wrong. This storytelling and participation outreach needs to be threaded into the project from Day One. (And this post is an attempt to start down that road, in case I do end up picking up on this during the school year.)

I realize these are all vast in scope and anything I do in a semester will but be a surface pass, and I've accepted that. If I do an overview now and learn how to work with each of these tools (mental, software, and otherwise), I'll be able to dive in deeper when needed, and I'll have a better view of what I don't know.

Ideas, thoughts, suggestions and so forth are very, very welcome. I'm quite new to the metrics/data/stats/research space, other than forced marches through the Abstract Proofs Go Here territory in high school and then again in engineering college, and I know that many people reading this post will have far more relevant experience that they've just used in different domains. What books should I read, what tools should I check out, what sorts of starter questions and data sets would be interesting to capture and analyze? (I have some thoughts, but will withhold them for a bit so as to not overly bias initial comments.)