On the relative openness of text/document formats: .txt and .csv

My friend and research colleague Todd Fernandez writes: I know the ODF (Open Document Format) is generally the preferred format from an open documents standpoint. My question is whether you [and other Free/Libre groups] would consider .txt files in Unicode and .csv (comma-separated value) data files considered equivalently open?

First of all, I can't answer for FSF or other groups -- so what you're getting is the Mel answer to "are .txt and .csv files open?" This email reply is getting so long that I'll turn it into a blog post.

The Mel answer is "yes." I consider .txt files in Unicode and .csv to be "open." In fact, I personally prefer to use them over ODF, which I'm rarely able to open with the software pre-installed on most computers I encounter.

In the paragraphs that follow, I'll get into more detail behind my reasoning. To define "open" and "free," I'll loosely use the Open Source Definition (OSD) and the Free Software Definition, noting that these were created for software and need to be adapted for a discussion on file formats.

In discussing these file formats, I'll cover .txt first, since .csv builds on top of .txt. The file extension".txt" can mean a lot of things, and each of those things has different levels of open-ness (from a more legal-ish open standards standpoint, see definitions above) and accessibility (from a "how many people are able to read/write them on their computers with their current software" standpoint). I personally care about both.

ASCII and ANSI

You asked specifically about Unicode, but ".txt" can also mean ASCII, which is/was an American-developed standard for text information. I'll start here, since ASCII was where Unicode began (in fact, ASCII's 128 characters are Unicode's first 128 characters).

ANSI X3.4-1968 is the document describing this encoding. It is a standard that's widely used and published, lots of programs can access it, and you can use it in your programs without too much effort and without licensing costs (as far as I can tell; I can't easily find the exact legal status). Basically, if I translate the Open Source Definition from software to file formats, I can't find evidence that ANSI X3.4-1968 itself contradicts any of its criteria. I know this is different than proving that it absolutely meets all these criteria, but this is good enough for me.

As an interesting side note: ASCII is an ANSI standard (American National Standards Institute, hence the ANSI- prefix on its standards document). ANSI's definition of "open" is about "do stakeholders have access to the consensus decisionmaking process that forms the standards?" and "are licensing fees and getting permission to use the standard at a reasonable and not overly burdensome level?" rather than "are there no licensing fees and permissions needed at all?" The latter corresponds to part of the 4 requrements for "freedom" according to the FSF, so it's possible for something to be "open" according to ANSI but not "free" according to the FSF.

It would be interesting to go through and do a more rigorous look on whether ASCII's legal/licensing criteria meets the Four Freedoms. I'm not a lawyer, but I'd be interested in what a lawyer would say.

UTF/Unicode

".txt" can also mean UTF, or Unicode Transformation Format; this is what your email asked about. The "A" in ASCII stands for "American," and "American" in this case meant "monolingual," meaning that if you wanted to type something outside ASCII's 128-character, heavily-biased-towards-American-English set, you were flat out of luck. Unicode took the first 128 characters of ASCII, then... kept on going. Unicode is a more internationally-savvy superset of and successor to ASCII.

UTF-8 and UTF-16 are two Unicode variants in common use. The numbers refer to the number of bits per character, which you can think of as "UTF-16 has more letters in its gigantic international meta-alphabet than UTF-8." UTF standards are developed by the Unicode Consortium, which I keep mistakenly typing as the "Unicorn Consortium" (which would be kinda awesome).

Unicode's copyright permissions language seem very, very similar to the 4 requirements of the FSF for freedom. Since ASCII is a subset of Unicode, this makes me even more comfortable saying that ASCII is also "open." However, I am not a lawyer, nor am I using "open" in a legally rigorous sense here -- remember, I am a non-legally-trained engineer going "yeah, I don't see anything that contradicts the definitions made for Open and Free software, if we were to translate it to file-formats-land."

CSV

CSV is a format that's layered atop plaintext (ASCII, UTF, whatever). In other words, you use plaintext to write a CSV document. CSV itself is not formally specified, which means it's a free-for-all, and... you can use it for whatever, because you're pretty much making it up. It's just that you're just making it up in the same way lots of other people have made it up.

Then again, "official standards" are just a group of people who have made things up and have agreed to stamp the label of "official" on their work; it's still a social construct that depends on how many other people agree with them. (I can make something an "official" standard according to Mel, but if nobody else agrees with me, my standard is useless.)

Anyway, I'm not sure if that qualifies CSV as "open," but it's certainly not "closed." To me, CSV is just as open as whatever underlying plaintext (.txt) format it's using. But again, I'm not a lawyer, don't work for the FSF, etc. This is just one hacker's opinion, and I'd love to hear what others think.