A brief survey of data languages

For a while now I’ve been trying to find a general human readable data format for C2. I’ve created some simple languages in the past, such as NASL [PDF 69Kb], but maintaining a custom language is too time consuming for me to consider anymore. So, I’ve been shopping around for a new language to use for data storage and wanted to share some of my findings. For my examples I am using the geometry data of a textured square.

Requirements
A good data language satisfies a few requirements.

1) Has a C/C++ parsing library
2) Has a python library, making writing tools easy
3) The language is simple, without features that complicate it’s use as a data language
4) The language is easily human readable and writable

XML
First up is XML, everyone’s goto human readable data format.

<geometry>
  <vertexbuffers>
    <Position>
      (-1.0 -1.0 0.0 1.0)
      (-1.0  1.0 0.0 1.0)
      ( 1.0 -1.0 0.0 1.0)
      ( 1.0  1.0 1.0 1.0)
    </Position>
    <TexCoord0>
      (0.0 0.0)
      (0.0 1.0)
      (1.0 0.0)
      (1.0 1.0)
    </TexCoord0>
  </vertexbuffers>
  <indexbuffers>
    <mesh0 mode="triangle_strip">
      0
      1
      2
      3
    </mesh>
  </indexbuffers>
</geometry>

XML (24 lines, 429 chars)

It seems that everyone uses XML for everything, so why shouldn’t I? XML libraries exist for all programming languages. As a data language, however, I find it to be much too complicated. There are too many ways to specify data (as attributes, as text between tags, etc). The syntax also leaves something to be desired and dealing with the DOM to get at data is no treat either.

JSON
Next is JSON a data serialization language that originally comes from Javascript.

{
  "vertexbuffers": {
    "Position": [
      [-1.0, -1.0, 0.0, 1.0],
      [-1.0,  1.0, 0.0, 1.0],
      [ 1.0, -1.0, 0.0, 1.0],
      [ 1.0,  1.0, 1.0, 1.0]
    ],
    "TexCoord0": [
      [0.0, 0.0],
      [0.0, 1.0],
      [1.0, 0.0],
      [1.0, 1.0]
    ]
  },
  "indexbuffers": {
    "mesh0": {
      "mode": "triangle_strip",
      "data": [
        0,
        1,
        2,
        3
      ]
    }
  }
}

JSON (27 lines, 441 chars)

Many JSON parsing libraries exist for both C/C++ and python. The JSON python library conveniently parses directly into a python dictionary making the data easily accessible. JSON is also much more human readable than XML. While it’s definitely a step up from XML it isn’t perfect. When writing JSON files I found that the double quoted value names and commas between value statements made writing JSON files error prone. Escaping string characters is also necessary and troublesome. If I didn’t need to write data files by hand JSON would be a good solution but as is it doesn’t quite fit my needs.

DS Object Template
The next language is inspired by object templates used in Dungeon Siege created by Gas Powered Games.

[geometry]
{
  [vertexbuffers] {
    Position = [
      [-1.0, -1.0, 0.0, 1.0],
      [-1.0,  1.0, 0.0, 1.0],
      [ 1.0, -1.0, 0.0, 1.0],
      [ 1.0,  1.0, 1.0, 1.0]
    ];
    TexCoord0 = [
      [0.0, 0.0],
      [0.0, 1.0],
      [1.0, 0.0],
      [1.0, 1.0]
    ];
  }
  [indexbuffers] {
    [mesh0] {
      mode = "triangle_strip";
      data = [
        0,
        1,
        2,
        3
      ];
    }
  }
}

Object Template (28 lines, 447 chars)

This language solves many of the problems I have with JSON: value names aren’t quoted; semi-colons end every value statement; and separators aren’t required after subsections. An extra feature is that sections are separated out using brackets making it a bit easier to parse visually. The major con is that this is not a standard format. There are no libraries that will parse these data files for C/C++ or python. I created a parser using Boost Spirit to see if it would be worth developing further but it didn’t seem to be.

YAML
The last language I looked at was YAML. Billed as “…a human friendly data serialization standard for all programming languages,” it did not disappoint.

vertexbuffers:
  Position: [
    (-1.0 -1.0 0.0 1.0),
    (-1.0  1.0 0.0 1.0),
    ( 1.0 -1.0 0.0 1.0),
    ( 1.0  1.0 1.0 1.0)
  ]
  TexCoord0: [
    (0.0 0.0),
    (0.0 1.0),
    (1.0 0.0),
    (1.0 1.0)
  ]
indexbuffers:
  mesh0:
    mode: triangle_strip
    data: [
      0,
      1,
      2,
      3
    ]

YAML (22 lines, 333 chars)


YAML has libraries for both C and python. The YAML python library even parses data directly to a python dictionary just like JSON. But by far the biggest benefit of YAML is it’s syntax. Indentation is used to define blocks so all those spurrious braces are gone. String quoting is not required so escaping characters is irrelevant. I could write XML, JSON, Nvidia Cg, or just about anything inside a YAML block and not have to worry about escaping it. JSON and XML can’t touch these features. All of this makes writing a YAML file by hand a breeze making YAML the only language that satisfies all of my requirements.

Conclusion
If you’re using XML or another human readable format for data specification YAML is definitely worth a look. It’s easy to read and write, easy to parse and has just enough features to define any human readable data you need.

One Response to “A brief survey of data languages”

  1. […] never really been satisified with any existing data languages (see my last post).  They never do all the things I want or are too general, complicated or cumbersome to use.  So, […]

Leave a Reply