The high-level flow is to listen to nearby audio, determine tempo and musical scale, and then synthesize some melodies to match. The first step is to analyse the microphone audio to extract useful information using the Fast Fourier Transform.

The FFT lets us determine which frequencies exist in a raw audio waveform. The result is a set of

*linearly*spaced frequency bins which tell us how much of each frequency range is present in the waveform.

The frequencies the FFT can detect is inherently limited by the duration and sample rate of the waveform data that is processed. For example, if we read 20ms of waveform data we can process frequencies into (1 / 1000 * 20 * SampleRate) discrete bins. With a sample rate of 44100 samples per second (44khz), this gives us 882 bins of data. Each bin represents data within a frequency range of 50hz (44100 / 882). Due to the Nyquist Limit, only the lower half of these bins will be usable, giving us 441 bins of usable data. We can use these bins to determine which musical notes are present based on how much of their frequency we detected.

frequency bins |

Humans perceive sound on a

*logarithmic*scale, which is not cleanly represented in our linearly spaced frequency bins from the FFT. Musical notes in low octaves are closer in frequency than musical notes at higher octaves, which means we lose accuracy in estimating notes.

For example, to determine C7 (2093hz), we examine the bin containing 2050-2100hz data. For the adjacent note C#7 (2217hz), we can check the bin for 2200-2250hz (reference). In our linearly spaced 50hz bins, these are several bins apart, so each note is cleanly stored in a single bin.

When we examine lower octaves, we can see the frequency difference between musical notes is smaller. For example, A2 and A#2 are 110hz and 116hz respectively, but with a bin spacing of 50hz, both these notes will end up mostly in the same frequency bin - we cannot distinguish them.

spacing of notes across frequency bins |

With our precision of 50hz bins, it would seem that anything below A5 (880hz) or G#5 (830hz) is not accurately detectable. Any given note will contribute to adjacent bins with a lesser amount, which means we can still detect lower notes than this if we interpolate between several adjacent bins. For a 50hz bin spacing, notes can still be accurately detected down to steps of 15-20hz or so (around C4).

We can increase the number of frequency bins (and thus reduce the spacing, giving increased resolution) by increasing the length of sample data. This will decrease responsiveness since we need to wait longer for data before processing, so a good balance of parameters is necessary.

There are a few details worth paying attention to when implementing your waveform processing. Apply a Hanning Window to your sample data to prevent Spectral Leakage. Take care that your data at all steps is within value ranges that you expect. Work with input data of -1.0f to 1.0f, and use log() to display your FFT results in a more natural form for visualization/debugging. Retrieve local maxima of curves from the data of several adjacent bins for better estimation of low-frequency notes.

To determine musical scale from this data, simply tracking notes and trivially checking for best-match sets of notes is sufficient. Tracking notes over a much longer time period than your sample data lets you reach the correct musical scale within reasonable time. As always, parameters need tuning to balance responsiveness to scale changes and accuracy of scale detection.

So that's the general overview of how things are put together. The current musical note detection is fairly accurate for clean notes within a limited octave range. Human whistling and ocarina-style instruments give clean note detection. Guitars and pianos produce Harmonics over multiple frequencies which needs to be accounted for.

Next time I'll cover some of the trickery involved in generating synthesized audio data to play specific notes and melodies!