waveform

Generate Audio Waveforms - Optimizing Performance with Summary Files

by Daniel Elmalem
#Waveform#Audio#FFmpeg

While in the case of a video stream you can pick a frame to have a snapshot at a specific time, it isn’t the case with audio formats. Sound is the oscillation/vibration of molecules over time, it’s by definition a sensation and therefore not something easily visualized. In this article i’ll cover one of the main representations of audio file: the waveform.

An audio waveform is a graphical representation of the amplitude (volume) of a signal over time. It typically appears as a series of peaks and troughs on a graph, with the horizontal axis representing time and the vertical axis representing amplitude. The waveform provides a visual depiction of the sound’s intensity, allowing you to see the variations in volume, frequency, and other characteristics of the audio signal. It’s a fundamental tool for analyzing and visualizing audio data, commonly used in music production, sound editing, and other audio-related applications.

It is notably use in applications such as Audacity, Swound, Logic Pro, Ableton and others.

The essence of Audio Signals

When dealing with a waveform, we're essentially examining a sound source across a specific duration. Typically, this data is sourced from an audio file or resides within an in-memory buffer. Audio content is commonly stored in two primary formats: compressed and uncompressed, often referred to as PCM. Uncompressed file formats like .wav or .aiff retain the signal's amplitude as it is in the file. Conversely, formats like .mp3, .flac, or .m4a employ compression algorithms to efficiently package the content, similar to how one would zip a regular file. In summary, digital audio recording involves sampling the signal 'x' times per second (e.g., 44,100 times for CD quality) and preserving this information in the file for subsequent use in driving speaker membranes. This data in an audio file encapsulates the signal's amplitude, which is precisely what we use to generate a waveform.

waveform

However we still face two problems:

The case of compressed audio files

When dealing with a compressed audio file, the initial step is to decompress its content to access the amplitude data. However, this process presents certain challenges, primarily in terms of CPU usage and memory management, especially when dealing with large files. To address this, a straightforward solution is to convert the MP3 file into a WAV file and save it to the disk for subsequent processing. This approach, while effective, comes with a requirement for proper code management to ensure cleanup. Additionally, if the same file needs to be reopened or rendered in the future, it will necessitate repeating the resource-intensive decompression operation.

Coping with too much informations

Creating a waveform for visualization purposes doesn't require as many samples as playing the audio back. Processing a large volume of data is both resource-intensive and time-consuming, particularly for sizable files. This becomes particularly apparent when we consider that we have more audio samples than pixels available for drawing the waveform. This is the reason why software applications such as Ableton Live and Audacity often generate summary files to address this challenge. To know more about it, I invite you to read about Audacity BlockFiles.

“If Audacity is asked to display a four hour long recording on screen it is not acceptable for it to process the entire audio each time it redraws the screen. Instead it uses summary information which gives the maximum and minimum audio amplitude over ranges of time. When zoomed in, Audacity is drawing using actual samples. When zoomed out, Audacity is drawing using summary information.” -- Audacity / James Crook

How to generate an audio summary information file ?

From well-established platforms like the BBC or SoundCloud, the need for efficiently rendering audio waveforms is pervasive. This typically involves pre-computing a summarized representation on the server side as soon as the audio file is initially encountered. This pre-calculated summary can then be readily loaded by the client, often preceding the actual download and playback of the audio file.

The techniques employed to create these summary files and manage their data may vary, but the underlying principle remains consistent: the objective is to reduce data volume by grouping samples together and deriving a value that encapsulates the characteristics of a specific time window. One approach is to compute the average value of that window. To do so we sum the values of a defined number of samples (e.g., 256) and then divide by that number to obtain the average amplitude over that time segment. An alternative method, inspired by Audacity, is to capture both the minimum and maximum values for each window. While this approach is less storage-efficient (we'll end up with twice the data size), it offers enhanced data resolution, enabling the creation of more detailed waveforms.

Fortunately, the BBC's R&D group provided a free and open-source command-line tool to generate summary files in both binary and JSON formats. While it can also create waveform images, I'll not cover it as this feature may not align well with modern responsive designs.

bbc / audiowaveform

There exists many libraries to achieve the same. I picked this one since it's well documented, open source, free and actively maintained. Also, even though it's written in C++, it's pretty easy to wrap it to be used from an another programming language. Note that the BBC released few JS tools for data consumption, which, even if not employed directly, can serve you as a valuable references for comprehending the underlying mechanisms.

To follow allong I would advice to establish a web service capable of receiving audio files (like GCS or Amazon S3). These files can be copied locally and then summarized using the audiowaveform tool to JSON or binary data:

$ audiowaveform -i input.mp3 -o result.json

The choice of window size, typically set at 256 samples by default, depends significantly on the intended use of the waveform and the average duration and sample rates of the audio files in question. Grouping 256 samples together in the context of a 44.1KHz audio file still yields over 172 data points per second. However, for very brief audio snippets, this window size may prove insufficient. To illustrate the problem let's take the example of a short cymbal sample, where the entire file is under 2 seconds and the cymbal hit itself lasts around 500 milliseconds. After rendering the waveform in a Flutter app (also possible to render it in a web canvas) it's evident that adaptability in window size is essential.

As you can see the resolution is not good. Let's use smaller window of 128 samples now:

That's better but still too low for some applications that require to zoom in. Let's now use a window of 64 samples and notice the improved resolution:

Please note that while the BBC tool is a user-friendly and demonstrative choice, it's just one of many freely available tools at your disposal. Your selection might depend on the specific requirements of your use case. For instance, if you prefer normalized averages, other existing tools like SoX or FFmpeg can be suitable alternatives. If zooming functionality isn't a concern, I would suggest you opt for a fixed resolution that accommodates the majority of your files, such as 800 points per file, and compute averages dynamically based on the source length.

Additionally, it's recommended, as the BBC tool does, to store the data as integer values rather than using floating-point numbers. Specify the bit depth for your data so that you can perform data conversion during rendering, maintaining a consistent range of -1/+1. This approach significantly reduces file size and enhances parsing speed. Whether you choose a binary or other format, ensure that you store the file in a gzipped format to minimize network transfer time and improve efficiency.

Conclusion

Here are the main points to keep in mind that I learned while developing swound.com.

  • Pre-calculate waveforms on the server side, with open source libraries such as the BBC’s or SoX/FFmpeg.
  • Tweak the sampling window based on your use cases or the file duration.
  • Use modern cloud solutions such as PubSub, SQS, and on-demand cloud functions to seamlessly handle large volume of files while maintaining cost-effectiveness.