Sound, along with the other four senses of human, plays important role of transferring data from one to another.
Sound in engineering has many applications. By transcribing an audio recording to a computer-recognizable data, one can further process the data with various computer aided tools.
For example, Shazam, one of the main service provider of music identification tool, has announced that its monthly active user count of their service has surpassed over 100 million. Shazam also recently formed a partnership with Apple, and now offers their service through Siri. (Try asking Siri: “What song is this?”)
In this posting, I will be presenting the project I did in ECE462 – Sensory Communications course at University of Toronto. You can expect to learn the note separation technique (extracting individual instance of notes in the input recording) I used in the project, and how my team analyzed the extracted notes in frequency domain to determine their note names.
The objective of the project was to transcribe the input audio recordings of a digital piano to their relative note sequences. Although it won’t be covered in this post, the transcribed note sequence was compared against a database to return the name of the melody from the input recording, if the melody existed within the database.
Decaying instruments include families of chordophones (i.e. Guitar, harps, piano, etc.) and percussion (i.e. xylophone) instruments. These instruments have highest sound intensity at the time of flicking strings/striking keys of the instrument, which decays exponentially over time.
Common non-decaying instruments include woodwind instruments (i.e. Saxophone, flute, clarinet, etc.). When played with constant breath, these instruments’ sound intensities are somewhat equal over time.
Decaying instruments depend on the player of the instrument to flick strings (for chordophones) or strike keys (for percussions) for producing sound. In this posting, strike will refer to the moment when the player begins to make a new sound. It is important to note that the player is free to press more than one keys in a single strike.
For simplicity, we will be working with few constraints below:
Two techniques will be introduced in this posting: Note separation and Frequency detection.
Since a recording may contain multiple instances of sound from numerous strikes, we rely on this algorithm to extract individual instance of the strikes.
The note separation will be done in time domain to utilize the exponentially decaying shape of the decaying instrument’s signal. We will be creating an envelope that represents the original signal, and locate starting points of the strikes by finding local maximums of the envelope.
By transforming the recording input to the frequency domain using Fast Fourier Transform, we can analyze the frequencies present in the recording.
Since the name of each notes has unique fundamental frequency, we can look for a frequency bin (range of frequencies) with highest energy to find the name of the notes.
You may find this table from Wikipedia useful for matching the frequencies to their relative names.
As it was mentioned above, the sound signal from the decaying instruments has exponentially decaying shape over time.
Figure 1: Plot of sound signal. (A4 note played in a digital keyboard)
The Figure 1 shows a Matlab plot of the note A4 from a digital piano (decaying instrument). As one can observe, it has the highest peaks at the beginning of the note and the peaks decrease exponentially over time.
The y-axis of the plot refers to the sound intensity of the signal at time t. This indicates that sound from the instrument is loudest at the time of the strike. This behavior is expected, as we hear the sound being faded out over time when we press a key from the piano.
As we did not set the maximum length of the input recording, we expect it to have multiple instances of strikes. To be able to detect individual notes, we must first separate each strike from one another.
Let’s start from the base case: recording with single strike.
Figure 2: Magnitude plot of A4 sound signal.
Consider Figure 2, which shows magnitude plot of the sound sample shown before in Figure 1. As humans, we can easily detect the starting point of the strike – Our eyes can easily detect the sudden amplitude jump at the marked location.
Can we use the same approach in the algorithm?
We can detect the sudden amplitude jumps by finding local maximums in the plot. However, even though the plot in the Figure 2 might look like a bar graph, sound has a wave-like property and its amplitude oscillates very quickly over time. One can see this more clearly by zooming into the plot.
Figure 3: Zoom Box - Area where the plot will zoom into.
Figure 4: Zoomed-in plot of the magnitude plot of note A4. 6 of the peaks have been marked.
Figure 3 shows the area where I have zoomed in, and Figure 4 shows the zoomed-in plot.
As it is shown, the plot is a high frequency sinusoid. If we were to find local maximums, we would be ending up with many false data points. In the Figure 4, I have labelled few of the peaks that would falsely detected as starting point of the strike.
It is apparent that we need to simplify the data we are given with. If we can somehow average the peaks of the sinusoid and find a smooth curve that ‘envelopes’ the original plot, it can be used to detect overall behavior of the original plot. You can see an example of such envelope in Figure 5 below.
Figure 5: Envelope of the sound signal. The envelope is shown in red line, and is located above the original signal.
We can use a Gaussian Filter to find the envelope. By convolving a Gaussian Curve to our plot, we can filter out the high frequency component of the plot. Although details of convolution operation will not be covered in this post, following points should be enough to understand this operation:
This means that if we want the envelope to keep overall shape of the plot, we must keep the low frequency components of the plot. And we can do so by convolving a Gaussian Curve with large standard deviation to the plot.
Figure 6: 8000-point Gaussian window and convolution operation code done in Matlab.
Gaussian curve I used is shown in Figure 6. The size of the curve was determined from trial and error that forms the best envelope.
Figure 7: Result of the convolution of the Gaussian window and the original siganl.
Red curve in Figure 7 shows the computed envelope when the Gaussian curve is convoluted to the signal. After down sampling process, we can now find the starting point of the strike by locating local maximums.
The same procedure can be applied to recording input with multiple instances of stikes. Figure 8 to 10 below shows the algorithm being applied to multiple instances of strikes.
Figure 8: Matlab plot of input signal with 11 strikes.
Figure 9: Input signal in its magnitude form.
Figure 10: Computed envelope using the 8000-point Gaussian window from Figure 6.
Now that we’re able to find out the starting points of each strikes, we will now find the ending points, in order to extract individual instance of strikes from the recording.
For ending points, we can simply take the midpoint of consecutive starting points. For the last strike in the recording, we will set its ending point to end of the recording.
Figure 11: The input recording will be extracted to several segments. The duration of the segment will be the starting point and the ending point of each strike. At the end, number of segments that the algorithm return should be same as the number of strikes present in the input recording.
Figure 11 shows the overall process of finding duration of the first note.
Figure 12 below shows the extracted segments from the input recording. Each of these segments will be presented to the Frequency Detection algorithm.
Figure 12: The extracted segments from the input recording. The segments are shown in green.