A Project Puppy update is long overdue: our first public data was released last week after working through privacy/licensing/permission issues, and that journey warrants its own storytelling down the line -- but right now I'm exhausted enough that I'm just going to put it out there in all its unexplained messy glory, and profusely thank Jon Stolk for his extraordinary courage in offering to be the first person to go through this. I know I'm putting out something that begs more questions than answers, so... please ask them, because we'd love to answer.

So, that journey of making the data transparent... I'm trying to think about the best way to explain it. I think what I shall do is to try and reverse-engineer it into the clear steps I'm attempting to apply into my own small "radical transparency research" project (more on this as it comes up; I've mostly not had time to truly start it). The main purpose of this second experiment, which I will call "Project Kitten" for amusement purposes, is to take more time to experiment with and really clarify and walk through the copyright, publishing, and ethics/IRB concerns of doing radically transparent research.

Here's the raw crazy idea, which I expect to have all sorts of things wrong with it that I'll find out by doing.

If there are cultures that operate with radical transparency, and individuals within those cultures consent to (or even request for) a completely open research process... why can't I, as a researcher, abide by that request? If I...

  1. Got permission to interview someone and use their anonymized, locked-in-a-vault transcript in full for research (basically, "normal research procedure," and then
  2. interviewed them, we're still in the middle of a normal research procedure. Okay. Let's keep going.
  3. Next, I'd assign copyright of the resulting transcript to them, while maintaining "normal research rights" to use the "anon/locked" versions of the data in the stuff I'm working on for publication. This is a bit unorthodox, but outwardly not a huge deal; if the interviewee decides to stop here, we've still done things the "normal research way," it's just that transcript copyright ownership is actually clear (as opposed to rather fuzzy, which seems to be the case for most research interview transcripts).
  4. Now, if the interviewee wants to go farther... things start looking weird. As the copyright holder, the interviewee now has the choice to release their interview data under an open license. They can choose to release only a subset of it, they can choose to edit it before release, even to the point of anonymizing it to their satisfaction before putting it out there... it's their call. (All sorts of questions and alarm bells should be going off now for any researchers who've read this far. What if it's anonymized but someone guesses correctly? How can we be sure that these people really understand what they're getting into?)
  5. Okay. Still with me? We now have an unambiguously public fork of the data, complete with paper trail, that can be used by anyone for anything (after the relevant IRB deems it a public/exempt data set): by interviewees for project marketing (ding ding concern bell!), by researchers (original team or not) who can then perform their coding/analysis/etc in public, etc. The latter is what I am hoping for; I want to "open up" the black box of how "engineering education research is done," to let people listen in on conversations of this type and see what it looks like to think that way. Discourse exposure. Interesting conversations.
  6. And to throw even more interestingness into the mix: the original research team has the original private version of the data (which may be anywhere from "bitwise identical" or "only vaguely reminiscent" of the public version, and may include things completely excluded from the public dataset). Now. Research group: what can you do with that mix of information?

All sorts of really interesting deep chewy issues here, like how this process affects what people say, questions of coercion, what risk prediction and mitigation looks like in this case, how to make sure the public stuff doesn't identify the anonymized stuff if an interviewee doesn't want it to, etc. Fun to consider, interesting to work through, probably a painful set of IRB convos... but I've got time, and I am learning to be patient.

I'm talking with my advisor and slowly (very slowly) moving through talking with Purdue's IRB as well as with other researchers who've looked at open communities; this blog post is in part a way for me to have something to point them to when I email them (hi!) This is a naive statement, but it looks to me right now like most of the other researchers who've done this have followed standard procedure -- anonymized and locked-down data, working how the IRB expects researchers to work. The rationale I sense is "because I didn't realize there might be a different possibility for this earlier on, and don't have time to look into it now because I want to finish my thesis," but I could be wrong -- I hope I am wrong, very wrong, and wish to be corrected! But that is the reason I'm hurtling into this process first year, when it's a side project that does me no harm if it fails.

This post is poorly structured, poorly written, and nowhere near as well-explained as I would like. But: release early, release often. Hopefully someone, someday, can help me turn this into a clearer writeup. But for now, I think this is somewhat expected; we're exploring unfamiliar terrain, and this sort of terrain is usually shrouded in fog.

Welcome to the fog.