Yep, it’s another binary format decoding project. I just about had my fill with nVault, but decided to give this a whack, since there doesn’t seem to be a working script for decoding the NBT format using PHP.
As with the nVault project, if all you care about is the resulting code, it can be found in my Subversion repository for Minecraft, along with my other in-progress code.
July 30th, 2011: I’ve moved this project to a github repository. All future updates will happen there.
NBT is the format used by Minecraft to store data such as world chunks, level information, etc. It’s a fairly simple format unto itself, but the process was much more complicated than expected, for PHP.
An NBT file is, to summarize, a GZIP-compressed binary data format which stores information in the binary form of data primitives. It supports a number of different types, as well as three distinct types of list. While not the most elegant format, it gets the job done, and offers a reasonable degree of flexibility.
Rather than describing my process step-by-step, I’ll just point out the major hurdles I encountered in this process. Hopefully this information will justify some of the seemingly-inelegant or inefficient aspects of my code, as well as teach you a little something about PHP’s strengths and limitations.
Pack & Unpack
Part of PHP’s standard library are two related functions called pack and unpack. Two sides to the same coin, one takes in scalar values and packs them into the specified binary format, while the other does the opposite, reading in binary data and casting it as a particular type.
These functions are (as the underlying source code indicates) stolen from Perl, in terms of concept and implementation. However, they aren’t complete. A few finer-grained packing codes are absent, which would otherwise have provided the ability to pack and unpack specific endiannesses for, for instance, signed shorts.
To get around this absence of functionality, I had to learn the way that these types were represented. Being a bit spoiled by loosely-typed scalar languages like PHP, I hadn’t had to understand how the data is actually laid out in memory. It took me a while to wrap my head around endianness representation and such.
When storing data in memory which is larger than a byte, there are two major types of “endianness” which various operating systems will use. The first is Big-Endian, which means that the most significant byte is stored first; the opposite for Little-Endian, which stores the most significant byte last.
Most systems in use today use Little-Endian, but the NBT format specifically uses Big-Endian for all of its values. This causes the problem of needing to specify such explicitly for each unpacking and packing operation.
For signed shorts, ints and longs, I decided to largely ignore this problem, by unpacking the values as unsigned shorts, ints and longs, which unpack can handle with a specific endianness, and then calculate the signed version from that unsigned value.
For floats, I simply detected the system’s endianness, and reversed the byte order if necessary.
In PHP, there is an explicit 32-bit limit for integers, even on 64-bit systems (I think). This issue is expected to be resolved in PHP 6, but that doesn’t help us much now. What this means is that long integers, which are 64-bit, can only be represented as signed, which fits the value in the 32-bits provided, and the signing is handled else-wise.
To get around this, I was forced to use GMP, which allows one to perform arithmetic operations on arbitrary-precision values (very big or very small) using strings as the allocation means. What this means for us is that we can store integers which are larger that 32 bits can represent, and perform math on them.
While this all seems very obvious now that I lay it out, there isn’t much documentation to be had on the subject, which I intend to remedy (being on the PHP doc team). It has taken me weeks of frustration, testing and dumb luck to come up with the solution you see here, which is probably why nobody has come out with anything similar yet (to my knowledge).
Right now I’m not going to license the code, and I’ll simply say that it’s public domain. If someone’s project can benefit from the code, have at ‘er. I don’t care to add any restrictions.