Comparing the vorbisfile and opusfile decoding APIs
2023-08-11
Hi!
The Vorbisfile
library is a layer on top of libogg
and libvorbis
that
is intended to provide an easy, high-level API for working with audio encoded
as ogg/vorbis
. It’s like 20 years old, and was written by very cool people
at Xiph.org.
The opusfile
library is a layer on top of libogg
and libopus
that
is intended to provide an easy, high-level API for working with audio encoded
as ogg/opus
. It’s like 10 years old, and was written by very cool people
at Xiph.org.
That’s kind of interesting, isn’t it? Broadly the same folks, working on broadly the same problem, but with a decade long difference in experience.
For the Tangara open hardware music player (coming soon i promise you will hear when im selling them), I recently implemented vorbis and opus decoding using these two libraries, and I thought the differences between them were really instructive. Let’s look at the differences I found notable!
If you’d like to follow along in the docs, here are the links:
bytes vs. frames
Let’s look at the basic read
function; the function you’re probably most
interested in.
For vorbis, we have:
long ov_read(OggVorbis_File *vf,
char *buffer,
int length,
int bigendianp,
int word,
int sgned,
int *bitstream);
Parameters
vf
A pointer to the OggVorbis_File structure–this is used for ALL the externally visible libvorbisfile functions.
buffer
A pointer to an output buffer. The decoded output is inserted into this buffer.
length
Number of bytes to be read into the buffer. Should be the same size as the buffer. A typical value is 4096.
bigendianp
Specifies big or little endian byte packing. 0 for little endian, 1 for b ig endian. Typical value is 0.
word
Specifies word size. Possible arguments are 1 for 8-bit samples, or 2 or 16-bit samples. Typical value is 2.
sgned
Signed or unsigned data. 0 for unsigned, 1 for signed. Typically 1.
bitstream
A pointer to the number of the current logical bitstream.Return Values
[i removed the error codes bc they’re boring – jacqueline]
0
indicates EOF
n
indicates actual number of bytes read.ov_read()
will decode at most one vorbis packet per invocation, so the value returned will generally be less thanlength
.
For opus, we have:
int op_read(OggOpusFile *_of,
opus_int16 *_pcm,
int _buf_size,
int *_li)
Parameters
_of
The OggOpusFile from which to read.
_pcm
A buffer in which to store the output PCM samples, as signed native-endian 16-bit values at 48 kHz with a nominal range of [-32768,32767). Multiple channels are interleaved using the Vorbis channel ordering. This must have room for at least_buf_size
values.
_buf_size
The number of values that can be stored in_pcm
. It is recommended that this be large enough for at least 120 ms of data at 48 kHz per channel (5760 values per channel). Smaller buffers will simply return less data, possibly consuming more memory to buffer the data internally. libopusfile may return less data than requested. If so, there is no guarantee that the remaining data in pcm will be unmodified.
_li
The index of the link this data was decoded from. You may pass NULL if you do not need this information. If this function fails (returning a negative value), this parameter is left unset.Returns
The number of samples read per channel on success, or a negative value on failure. The channel count can be retrieved on success by calling ophead(of,*_li). The number of samples returned may be 0 if the buffer was too small to store even a single sample for all channels, or if end-of-file was reached. The list of possible failure codes follows. Most of them can only be returned by unseekable, chained streams that encounter a new link.
Now right off the top, there’s so much to say about these. Obviously the docs are way nicer, and opus now just says “hey this is the endianness i want, deal with it yourself”. But let’s look at the input and output buffers specifically.
libvorbisfile
wants you to give it a buffer in bytes, and returns how many
bytes it put in there.
libopusfile
wants you to give it a buffer in samples, and tells you how
many samples per channel it put in there.
This difference is huge. I’m kind of mad that I only saw this after I’d already rewritten most of tangara’s audio pipeline to operate in terms of ‘frames’ (one sample per channel) instead of bytes.
Let’s take a brief diversion to understand why it’s such a big deal…
Streaming audio FAST
Whether you are streaming audio to an external DAC over I2S, or to an on-chip Blueooth controller, there is a common pattern you will likely follow. You will have some number of buffers, and a DMA controller that is copying bytes out of whatever the ‘active’ buffer is. You will need to refill each buffer after the DMA controller is done with it, and you will need to do this fast enough that the DMA reads don’t overtake you.
On Tangara, I’ve implemented this as a FreeRTOS StreamBuffer
containing PCM samples, and an ISR, triggered by the DMA controller, that moves
samples from this StreamBuffer
into the next available DMA buffer.
This works great! But what happens if, god forbid, the DMA buffer does catch
up to us? What if there aren’t enough bytes in our StreamBuffer
to fill a DMA
buffer?
There’s three possibilities; 1 terrible, 1 bad, 1 fine.
Option 1: We’re short by a fraction of a sample. Maybe I made my StreamBuffer
1000
bytes long because that’s a nice round number and I’ve never used a computer
before. And then I’ve got 16 bit samples, meaning I can fit at most 62 and a
half samples into the buffer at once.
In this case, We will send half a sample to our audio device, and then from that point on every write to the DMA buffer will be misaligned by 1 byte. This damages the ears.
Option 2: We’re short by a whole sample, trying to send stereo audio. We end up flipping our two channels!
This is obviously bad, but at least we don’t damage anyone’s hearing.
Option 3: We’re short by one sample per channel – a ‘frame’. In this case, (and the previous) the user will hear an unpleasant burp, but then if our decoder catches up then the audio stream will resume as normal.
So what?
Now you might, quite rightly, be thinking that most of these examples aren’t plausible. If you’re normal, you’ll be allocating your buffers as powers of two, e.g. 1024 bytes instead of 1000 bytes. You can come up with your own justification for why you’ve done this; mine is “powers of two have good vibes for programming”.
Because of the way maths works (a power of two divided by two is still divisible by two), except for very small buffers, you should always end being able to fit a whole-sized stereo frame into such a buffer. So what’s the problem?
The problem is that the vorbisfile
version of this API makes it look like
there’s a bunch of really confusing edge cases you need to worry about, which
can flow on to affect how you handle your audio pipeline. If I think my codec
could output half a sample, or an uneven number of samples per channel, then
all of a sudden I need to work out where I’m buffering that data between calls
to ov_read
.
Now of course it probably won’t do this. But according to the function signature it could. And let me tell you, you don’t make it long writing C code with the attitude that edge cases involving byte buffers are okay to ignore.
Compare this with the opusfile
file version, which is extremely explicit that
it will only ever give you whole frames, and so you only need to think about and
worry about whole frames. This in turn nudges you to design the rest of your
system in terms of whole frames (rather than in terms of bytes, but it just
happens to also be in terms of frames because of a single power-of-two-sized
kBufferSize
constant you pulled out of your head).
Now that I’ve explained it, maybe it doesn’t sound like such a big deal. Maybe it’s actually kind of obvious. Oh well.
Seeking
libvorbisfile
has many different ways of getting your position within the current
stream, and seeking to a new position.
ov_raw_tell() // Returns the current offset in raw compressed bytes
ov_pcm_tell() // Returns the current offset in samples
ov_time_tell() // Returns the current offset in seconds
ov_raw_seek()
ov_pcm_seek()
ov_time_seek()
ov_raw_total()
ov_pcm_total()
ov_time_total()
libopusfile
… has less ways!
op_raw_tell()
op_pcm_tell()
op_raw_seek()
op_pcm_seek()
op_raw_total()
op_pcm_total()
Where’d seeking in terms of seconds go? The docs for op_pcm_total
explain:
Users looking for
op_time_total()
should useop_pcm_total()
instead. Because timestamps in Opus are fixed at 48 kHz, there is no need for a separate function to convert this to seconds (and leaving it out avoids introducing floating point to the API, for those that wish to avoid it).
I know that the rationale given is that it’s trivial to calculate the total time yourself given the fixed sample rate, but to me this also feels like yet another nudge towards thinking about audio streams in more effective way.
For Tangara, I’ve mostly been implementing the notion of position within a stream in terms of samples (obviously in the UI we turn it into a real time, of course). The main advantage of this is that it allows us to track duration and position within a stream all in fixed point (the ESP32’s FPU is… limited[0]), which is faster and more accurate than the obvious alternative approach of trying to accumulate some kind of floating point timer.
API design that uses its limitations to prompt you to write better code… it’s good, folks!
Nobody care about multichannel streams
Both the vorbis and opus codecs support streams with a very large (to me this is
more then 2) number of channels. libvorbisfile
leaves dealing with that mostly
up to the application developer.
libopusfile
makes the (surprising?) decision to include downmixing to stereo
as a part of the API:
op_read_stereo(OggOpusFile *_of,
opus_int16 *_pcm,
int _buf_size)
This function is intended for simple players that want a uniform output format, even if the channel count changes between links in a chained stream.
This is so, so smart.
See, as a user of the codec, I cannot express how much I don’t give a shit about channels > 2. I am not going to bother to deal with them correctly. I am going to ignore them, or refuse to play the stream outright, or my code is going to break in a new and exciting way if you give it a multichannel file.
But you, the hypotethetical codec author; you do care about them. Because you thought they were a valuable addition to your codec, and you want people to use your codec for multistream use cases.
Providing a function like this in your API lets us both be happy, and makes opus that much easier to recommend as a general-purpose audio codec. Even if you have some weird multichannel use case, you can have some confidence that most players are going to downmix your extra channels reasonably if they don’t have explicit support.
That is all the big things I spotted
Thank you for reading this. I hope you had a nice time. Maybe I will write another blog post, or maybe I won’t. Who knows.
Want to leave a comment? Too bad.
[0]: It’s fast enough! The main issue is that only one of your two cores can use the FPU at a time (because there’s only one FPU). This makes performance tuning a little awkward, because tasks you’d otherwise want on opposite cores might end up getting forced together due to both using floats.