Comparing the vorbisfile and opusfile decoding APIs

2023-08-11

Hi!

The Vorbisfile library is a layer on top of libogg and libvorbis that is intended to provide an easy, high-level API for working with audio encoded as ogg/vorbis. It's like 20 years old, and was written by very cool people at Xiph.org.

The opusfile library is a layer on top of libogg and libopus that is intended to provide an easy, high-level API for working with audio encoded as ogg/opus. It's like 10 years old, and was written by very cool people at Xiph.org.

That's kind of interesting, isn't it? Broadly the same folks, working on broadly the same problem, but with a decade long difference in experience.

For the Tangara open hardware music player (coming soon i promise you will hear when im selling them), I recently implemented vorbis and opus decoding using these two libraries, and I thought the differences between them were really instructive. Let's look at the differences I found notable!

If you'd like to follow along in the docs, here are the links:

- libvorbisfile

- libopusfule

bytes vs. frames

Let's look at the basic read function; the function you're probably most interested in.

For vorbis, we have:

long ov_read(OggVorbis_File *vf,
             char           *buffer,
             int             length,
             int             bigendianp,
             int             word,
             int             sgned,
             int            *bitstream);

Parameters

vf A pointer to the OggVorbis_File structure--this is used for ALL the externally visible libvorbisfile functions.

buffer A pointer to an output buffer. The decoded output is inserted into this buffer.

length Number of bytes to be read into the buffer. Should be the same size as the buffer. A typical value is 4096.

bigendianp Specifies big or little endian byte packing. 0 for little endian, 1 for b ig endian. Typical value is 0.

word Specifies word size. Possible arguments are 1 for 8-bit samples, or 2 or 16-bit samples. Typical value is 2.

sgned Signed or unsigned data. 0 for unsigned, 1 for signed. Typically 1.

bitstream A pointer to the number of the current logical bitstream.

Return Values

[i removed the error codes bc they're boring -- jacqueline]

0 indicates EOF

n indicates actual number of bytes read. ov_read() will decode at most one vorbis packet per invocation, so the value returned will generally be less than length.

For opus, we have:

int op_read(OggOpusFile *_of,
            opus_int16  *_pcm,
            int          _buf_size,
            int         *_li)	

Parameters

_of The OggOpusFile from which to read.

_pcm A buffer in which to store the output PCM samples, as signed native-endian 16-bit values at 48 kHz with a nominal range of [-32768,32767). Multiple channels are interleaved using the Vorbis channel ordering. This must have room for at least _buf_size values.

_buf_size The number of values that can be stored in _pcm. It is recommended that this be large enough for at least 120 ms of data at 48 kHz per channel (5760 values per channel). Smaller buffers will simply return less data, possibly consuming more memory to buffer the data internally. libopusfile may return less data than requested. If so, there is no guarantee that the remaining data in pcm will be unmodified.

_li The index of the link this data was decoded from. You may pass NULL if you do not need this information. If this function fails (returning a negative value), this parameter is left unset.

Returns

The number of samples read per channel on success, or a negative value on failure. The channel count can be retrieved on success by calling ophead(of,li). The number of samples returned may be 0 if the buffer was too small to store even a single sample for all channels, or if end-of-file was reached. The list of possible failure codes follows. Most of them can only be returned by unseekable, chained streams that encounter a new link.

Now right off the top, there's so much to say about these. Obviously the docs are way nicer, and opus now just says "hey this is the endianness i want, deal with it yourself". But let's look at the input and output buffers specifically.

libvorbisfile wants you to give it a buffer in bytes, and returns how many bytes it put in there.

libopusfile wants you to give it a buffer in samples, and tells you how many samples per channel it put in there.

This difference is huge. I'm kind of mad that I only saw this after I'd already rewritten most of tangara's audio pipeline to operate in terms of 'frames' (one sample per channel) instead of bytes.

Let's take a brief diversion to understand why it's such a big deal...

Streaming audio FAST

Whether you are streaming audio to an external DAC over I2S, or to an on-chip Blueooth controller, there is a common pattern you will likely follow. You will have some number of buffers, and a DMA controller that is copying bytes out of whatever the 'active' buffer is. You will need to refill each buffer after the DMA controller is done with it, and you will need to do this fast enough that the DMA reads don't overtake you.

On Tangara, I've implemented this as a FreeRTOS StreamBuffer containing PCM samples, and an ISR, triggered by the DMA controller, that moves samples from this StreamBuffer into the next available DMA buffer.

This works great! But what happens if, god forbid, the DMA buffer does catch up to us? What if there aren't enough bytes in our StreamBuffer to fill a DMA buffer?

There's three possibilities; 1 terrible, 1 bad, 1 fine.

Option 1: We're short by a fraction of a sample. Maybe I made my StreamBuffer 1000 bytes long because that's a nice round number and I've never used a computer before. And then I've got 16 bit samples, meaning I can fit at most 62 and a half samples into the buffer at once.

In this case, We will send half a sample to our audio device, and then from that point on every write to the DMA buffer will be misaligned by 1 byte. This damages the ears.

Option 2: We're short by a whole sample, trying to send stereo audio. We end up flipping our two channels!

This is obviously bad, but at least we don't damage anyone's hearing.

Option 3: We're short by one sample per channel -- a 'frame'. In this case, (and the previous) the user will hear an unpleasant burp, but then if our decoder catches up then the audio stream will resume as normal.

So what?

Now you might, quite rightly, be thinking that most of these examples aren't plausible. If you're normal, you'll be allocating your buffers as powers of two, e.g. 1024 bytes instead of 1000 bytes. You can come up with your own justification for why you've done this; mine is "powers of two have good vibes for programming".

Because of the way maths works (a power of two divided by two is still divisible by two), except for very small buffers, you should always end being able to fit a whole-sized stereo frame into such a buffer. So what's the problem?

The problem is that the vorbisfile version of this API makes it look like there's a bunch of really confusing edge cases you need to worry about, which can flow on to affect how you handle your audio pipeline. If I think my codec could output half a sample, or an uneven number of samples per channel, then all of a sudden I need to work out where I'm buffering that data between calls to ov_read.

Now of course it probably won't do this. But according to the function signature it could. And let me tell you, you don't make it long writing C code with the attitude that edge cases involving byte buffers are okay to ignore.

Compare this with the opusfile file version, which is extremely explicit that it will only ever give you whole frames, and so you only need to think about and worry about whole frames. This in turn nudges you to design the rest of your system in terms of whole frames (rather than in terms of bytes, but it just happens to also be in terms of frames because of a single power-of-two-sized kBufferSize constant you pulled out of your head).

Now that I've explained it, maybe it doesn't sound like such a big deal. Maybe it's actually kind of obvious. Oh well.

Seeking

libvorbisfile has many different ways of getting your position within the current stream, and seeking to a new position.

ov_raw_tell() // Returns the current offset in raw compressed bytes
ov_pcm_tell() // Returns the current offset in samples
ov_time_tell() // Returns the current offset in seconds

ov_raw_seek()
ov_pcm_seek()
ov_time_seek()

ov_raw_total()
ov_pcm_total()
ov_time_total()

libopusfile... has less ways!

op_raw_tell()
op_pcm_tell()

op_raw_seek()
op_pcm_seek()

op_raw_total()
op_pcm_total()

Where'd seeking in terms of seconds go? The docs for op_pcm_total explain:

Users looking for op_time_total() should use op_pcm_total() instead. Because timestamps in Opus are fixed at 48 kHz, there is no need for a separate function to convert this to seconds (and leaving it out avoids introducing floating point to the API, for those that wish to avoid it).

I know that the rationale given is that it's trivial to calculate the total time yourself given the fixed sample rate, but to me this also feels like yet another nudge towards thinking about audio streams in more effective way.

For Tangara, I've mostly been implementing the notion of position within a stream in terms of samples (obviously in the UI we turn it into a real time, of course). The main advantage of this is that it allows us to track duration and position within a stream all in fixed point (the ESP32's FPU is... limited[0]), which is faster and more accurate than the obvious alternative approach of trying to accumulate some kind of floating point timer.

API design that uses its limitations to prompt you to write better code... it's good, folks!

Nobody care about multichannel streams

Both the vorbis and opus codecs support streams with a very large (to me this is more then 2) number of channels. libvorbisfile leaves dealing with that mostly up to the application developer.

libopusfile makes the (surprising?) decision to include downmixing to stereo as a part of the API:

op_read_stereo(OggOpusFile *_of,
               opus_int16  *_pcm,
               int          _buf_size)

This function is intended for simple players that want a uniform output format, even if the channel count changes between links in a chained stream.

This is so, so smart.

See, as a user of the codec, I cannot express how much I don't give a shit about channels > 2. I am not going to bother to deal with them correctly. I am going to ignore them, or refuse to play the stream outright, or my code is going to break in a new and exciting way if you give it a multichannel file.

But you, the hypotethetical codec author; you do care about them. Because you thought they were a valuable addition to your codec, and you want people to use your codec for multistream use cases.

Providing a function like this in your API lets us both be happy, and makes opus that much easier to recommend as a general-purpose audio codec. Even if you have some weird multichannel use case, you can have some confidence that most players are going to downmix your extra channels reasonably if they don't have explicit support.

That is all the big things I spotted

Thank you for reading this. I hope you had a nice time. Maybe I will write another blog post, or maybe I won't. Who knows.

Want to leave a comment? Too bad.


[0]: It's fast enough! The main issue is that only one of your two cores can use the FPU at a time (because there's only one FPU). This makes performance tuning a little awkward, because tasks you'd otherwise want on opposite cores might end up getting forced together due to both using floats.