Early adventures in audio processing

I like pretty much any vaguely pop or rock song with a piano in it. Marianne Scholem (a friend of mine) sings songs while playing the piano, and occasionally she uploads recordings to her Soundcloud for those of us in the non-Canberra community.^* One of these songs is really, really, good:

^*There's also an album to be released soon, though Coming Down won't be on it.

Look, I know that few people admit to sharing my taste in music. I still think the song is great. The opening bars of the intro are full-on piano goodness, and apart from a couple of (deliberate?) stutters, the song stays at that high quality start to finish. It should catapult Mezz to stardom.

Anyway, I pestered her for a copy of the song because the Soundcloud tag is "#rough recording" and... it would be nicer to listen to without all the breath noises. I wanted to try to iron it out a little, using (and maybe learning a bit of) Audacity.

After a couple of weeks of working at it on and off, I'm pretty happy with the results. It's obviously not at a level of "ready to be put on an album", but to my untrained ear, and on my cheap laptop speakers and headphones, it's an improvement on the original:

Your mileage may vary. For those who don't think I warped the piano sound too much, the rest of this post explains what I did. I had essentially zero previous experience in audio processing, and what follows is partly a bad introduction to audio, and partly a thrilling narrative of how I learned about Audacity function parameters and Nyquist programming to solve a niche problem that none of you are likely to encounter.

For people^* who want to play at home, I've created two downloadable excerpts.

^*The empty set takes the plural.

Excerpt 1:

Excerpt 2:

Things that don't work

The obvious plan of attack to remove the breath sounds is to separate the track into vocal and piano tracks, silence the vocal track wherever you hear breathing, then mix them back together. Can we isolate the vocal track? The apparently most promising hope is that the vocals are centre-panned in the stereo track, i.e., they are common (and equal) in both left and right channels, while the piano is a little off-centre. I don't know the recording process well enough to understand how this works, but if (say) there are two microphones, then you can imagine that sound coming from different parts of the piano reach the two microphones at different times.

We can represent sound as a wave. Letting \( L(t) \) and \( R(t) \) be the left and right channels, we could denote by \( V(t) \) a common vocal part, and \( P_L (t) \) and \( P_R (t) \) the hoped-for asymmetric piano parts in left and right channels. The hope is that (not writing the time-dependence explicitly)

is a good approximation to what's in the actual track. If the approximation is good, then we can eliminate the vocals by creating a new mono track

Does this work? Audacity has this feature ready-to-run in the Effect menu, "Vocal Remover (for center-panned vocals)". Running it on Excerpt 1, it does indeed remove the main vocal track, leaving just a ghostly choir of echoes which bounced off different walls of the recording room to hit the microphone(s) a little late and off-centre:

The quality of the piano sound has deteriorated (this is not so clear with my laptop speakers, but clear in my headphones), and there are a couple of static-y noises, but it's still recognisably the piano part of the song. Your brain may then leap enthusiastically to the idea that if we have the full track, and have a good approximation to the piano track, then we should be able to subtract the latter from the former to approximately isolate the vocals.

Alas, the maths is not kind to this intuition. Subtracting the mono track \( M \) from either channel of the original, or an average of the two channels, or both, always leaves something in addition to the common vocal part \( V \). You can try writing

or some such, but nothing isolates \( V \). This is a dead-end.

There are commercial software packages that purport to isolate vocals, but I'm cautious about them and haven't risked my money. (It does seem to me like a problem that could be solved with enough careful study of the waveforms typical of a human voice and a piano, so I don't completely write off the possibility that these programs work. I just haven't tried them.)

Low-pass filters

Now we get on to the main tool that I used to reduce the breath sounds: the low-pass filter. The idea is that the unwanted noise has a fairly high frequency. Or, it has different frequency components, most of which are much higher than the notes being played on the piano. Middle C is around 261.6 Hz, and I usually set the cutoff for the low-pass filter at Audacity's default of 1000 Hz. (That the default value was actually what I wanted might be a coincidence, or perhaps I've stumbled onto a common problem and the Audacity writers know roughly what frequency is often used for this sort of thing. I honestly don't know.)

Trying to write my own filter

The low-pass filter is not how I (never having learned any signal processing) imagined it. It's not the case that all frequency components above cutoff (in the "stopband") are removed and all frequency components below cutoff (in the "passband") are left alone. My imagining would go something like: break the sound into short intervals, take a Fourier transform, remove unwanted frequency components, inverse Fourier transform. I tried writing some code to do this; the results are weirdly interesting in a "not actually what we want" kind of way:

There must be better implementations of an FFT-based filter out there, but I abandoned this path and just went with what Audacity does. For completeness, my (not particularly tidy) R code follows, but you can skip over it without missing anything that's assumed in the rest of the post.

library(tuneR)

smoothstep = function(x) {
  return(x*x*(3 - 2*x))
}

fftshift = function(v) {
  n = length(v)
  if (n %% 2 == 0) {
    shift_v = v[c((n/2 + 1):n, 1:(n/2))]
  } else {
    shift_v = v[c((ceiling(n/2)+1):n, 1:ceiling(n/2))]
  }
  return(shift_v)
}

ifftshift = function(v) {
  n = length(v)
  if (n %% 2 == 0) {
    shift_v = v[c((n/2 + 1):n, 1:(n/2))]
  } else {
    shift_v = v[c(ceiling(n/2):n, 1:floor(n/2))]
  }
  return(shift_v)
}

sound = readMP3("excerpt1.mp3")
# I'm only using the left channel here, but everything can be
# duplicated if you want stereo.
y = sound@left
y = y/max(abs(y))
sr = sound@samp.rate

dt = 1/sr

# In seconds:
t_interval = 0.2
t_crossfade = 0.02

# In samples:
N_crossfade = floor(sr * t_crossfade)
N = floor(sr * (t_interval + t_crossfade))

# The frequency resolution in Fourier space:
df = 1/(dt*N)

if (N %% 2 == 0) {
  f = seq(-N/2 * df, (N/2 - 1)*df, by=df)
} else {
  f = seq(-floor(N/2)*df, floor(N/2)*df, by=df)
}

num_slices = floor((length(y)/sr) / t_interval)

# Vector which will hold the newly-created waveform:
y_filtered = numeric()

for (i in 1:num_slices) {
  # My terrible naming here is that y2 is the current interval
  # being analysed.
  y2 = y[((i-1)*N+1):(i*N)]
  
  fft_y2 = fftshift(fft(y2))
  mag_fft_y2 = abs(fft_y2)
  phase_fft_y2 = Arg(fft_y2)
  
  # Frequency components below f1 are retained; components above f2
  # are totally removed; smoothstep decay between f1 and f2.
  f1 = 1200
  f2 = 1800
  
  filter_freqs = f - f
  filter_freqs[which(abs(f) < f1)] = 1
  indices_2 = which((abs(f) >= f1) & (abs(f) < f2))
  temp_f = f[indices_2]
  filter_freqs[indices_2] = 1 - smoothstep((abs(temp_f) - f1)/(f2 - f1))
  
  # Apply the filter to the magnitudes of the FFT and inverse-transform:
  new_mag_fft_y2 = filter_freqs * mag_fft_y2
  fft_y2_filtered = ifftshift(new_mag_fft_y2 * exp(1i*phase_fft_y2))
  y2_filtered = Re(fft(fft_y2_filtered, inverse=TRUE)/length(fft_y2_filtered))
  
  if (i == 1) {
    y_filtered = y2_filtered
  } else {
    # Crossfade with prev_y2_filtered
    start_cross = length(prev_y2_filtered) - N_crossfade + 1
    end_cross = length(prev_y2_filtered)
    
    cross_t = (1:N_crossfade) / N_crossfade
    
    crossed_y = cross_t * y2_filtered[1:N_crossfade] + (1 - cross_t) * prev_y2_filtered[start_cross:end_cross]
    
    # Indices refer to the ever-growing y_filtered:
    before_cross_indices = 1:(length(y_filtered) - N_crossfade)
    
    # Indices refer to the newly created segment y2_filtered:
    after_cross_indices = (N_crossfade + 1):length(y2_filtered)
    
    y_filtered = c(y_filtered[before_cross_indices], crossed_y, y2_filtered[after_cross_indices])
  }
  
  prev_y2_filtered = y2_filtered
}

# This NA thing is rubbish but the algorithm doesn't work that well
# anyway so I don't care about the NA's at the end.
y_filtered = y_filtered[-which(is.na(y_filtered))]

y_filtered = y_filtered / max(y_filtered)

filtered_sound = Wave(y_filtered, samp.rate=sr, bit=32)
filtered_sound = normalize(filtered_sound, unit="32")
writeWave(filtered_sound, "excerpt1_fft.wav")

Audacity filters

The actual low-pass filters that Audacity applies are a sort of simulation of what an real filter made out of resistors and inductors and capacitors would do. Instead of breaking the signal into discrete chunks, the filter operates continuously. I haven't tried learning how it works (another time!), but the rough idea is that with the right combination of circuit components that work with AC currents, the output:input amplitude ratio of the signal will vary with the frequency. The first-order Butterworth filter has amplitudes decaying at 6dB per octave above the cutoff frequency. (It is not a discontinuously sloping curve – at the cutoff frequency itself, the signal is reduced by 3dB.) When I learn about RLC circuits I will have to try to understand why it's 6dB and not something else.

Higher-order Butterworth filters can have a roll-off of 12dB, 18dB, 24dB, ..., of which Audacity gives us the options of 6, 12, 24, 36, and 48. Forty-eight decibels per octave is a pretty steep drop, so, returning to the goal of this whole post, the idea is to apply such a filter during the time the breath noise is audible. The selection in Audacity:

And the result, post-filter:

The breath sound has been successfully removed, but otherwise this is pretty terrible. Firstly, the piano sound changes dramatically when the filter is applied (and it changes back afterwards), and this is very jarring to the ear. Secondly, there's a very audible click artefact when the filter kicks in.

The click artefacts are the easiest to deal with. Rather than having a filter apply immediately, thus risking discontinuities in the waveform, we introduce a short crossfade between the original track and the filtered track. Implementing these crossfades is a solved problem, and all we need to do is follow the instructions at the link to add an extra Nyquist plug-in to Audacity. (This is not as intimidating as it might sound: download the .ny file, copy into the plug-ins directory, and Audacity will automatically add it to the Effect menu when it next loads.) The author of this plug-in says that "around 20 milliseconds is usually sufficient to avoid clicks." Here is the same filter applied as earlier, this time with the 20ms crossfade:

Part of the problem is that the breath starts soon after the piano keys are struck. While the sound feels quite odd while the filter is on, perhaps it won't be quite so bad when it doesn't start mid-note. The start of each note is usually quite clear in the waveform: it's where the amplitude starts to rise. From the earlier screenshot, I only need to have the selection start a few pixels to the left to get the start of the piano note. The result:

This is starting to sound promising. OK, the piano sounds weird for a beat, but (at least to me) it's a big improvement on starting to sound weird in the middle of a beat.

Keeping the piano sounding like it should is a challenge, and not one that I solved completely. When an instrument's string vibrates, we hear the note as the fundamental frequency, but various harmonics are also generated, and these harmonics are part of what makes any instrument sound more interesting than a tuning fork. If we try to damp the high frequencies to remove the breath sound, we damp some of these harmonics, and this is an unavoidable trade-off using a low-pass filter.

(One more involved possibility which I haven't tried: work out, either by ear or by Fourier transform, the note being played on the piano. Then instead of a low-pass filter, apply multiple band filters that only let the fundamental and harmonic frequencies from the piano note pass.)

The approach I adopted was to use Audacity's default roll-off of 6dB per octave. This is not enough to remove the breath sounds entirely, but it is enough to make them less prominent and less distracting to me, while not breaking the piano sound too badly:

But we can do better! The breath sound in this exceprt covers most but not all of two eighth-notes on the piano. The second of those eighth-notes will need to be sacrificed to the filter, but the first can be partially salvaged. In one of the earlier filter attempts, I applied the filter mid-note, and the change was dramatic and jarring to the ear. In general, this jarring is less severe with a 6dB/octave rather than a 48dB/octave roll-off, but it's still noticeable and undesirable.

A better thing to try is to increase the length of the crossfade in. (The crossfade out stays at 20ms – the breath usually ends just before the next note, so we want to go back to the original sound as quickly as possible.) The idea is that the start of the piano note – the loudest part – remains relatively unfiltered, but by the time the breath noise starts, the filter is being applied fully. The long crossfade in means that the transition to filtered-piano is not so jarring. In the excerpt I've been working with, I can apply a 70ms crossfade in, though elsewhere in the full track I used up to a 500ms crossfade.

(And one extra step: to try to keep the volume constant, immediately following the filter, I amplify the selection by a little bit. Around 1.2dB amplification seemed to work pretty well, though if you listen to my full edit near the top of this page, I think I needed a bit more amplification near, e.g., the 3:05 mark.)

Here is my final version of excerpt 1:

I'm happy with this. The first of the two breath-overlapping piano notes sounds almost right to my ear, and the only thing properly wrong is the second of those piano notes. But it's only a little eighth-note and I don't notice the filter if I'm not paying attention.

The plug-in I linked to earlier doesn't handle differing in and out crossfade times, so I learned some Nyquist programming to implement this feature. Apparently Nyquist and Audacity have a normal language style, but all the examples I saw were in the older Lisp style, and so I learnt a little Lisp. What follows is a terribly written plug-in, which assumes that you enter sensible parameters and doesn't bother checking them for compatibility with one another and with the length of the current selection.

The basic Lisp syntax is usually (operation thing1 thing2 ...). So to calculate 2 + 3, you'd write (+ 2 3). The variable s refers to the sound in the current selection; all the various functions can be looked up in the Nyquist Reference Manual. To run this plug-in, copy the text, save to a .ny file in the Audacity/Plug-Ins directory, and launch Audacity.


;nyquist plug-in
;version 3
;type process
;name "Low-pass, amplify..."
;action "Applying filter..."
;info "David Barry, based on plugin by Steve Daulton\n"

;control cutoff "Cutoff freq (Hz)" real "" 1000 0 10000
;control amp "Amplification" real "db" 0.0 -24.0 24.0
;control xfade-in "Crossfade filter in (ms)" real "" 0 0 500
;control xfade-out "Crossfade filter out (ms)" real "" 0 0 500

;; lowpass_then_amplify.ny by David Barry, based on work by Steve Daulton.
;; Released under terms of the GNU General Public License version 2:
;; http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
;;
;; For information about writing and modifying Nyquist plug-ins:
;; http://wiki.audacityteam.org/wiki/Nyquist_Plug-ins_Reference

;;; envelope for crossfading
(defun envelope (dur1 dur2)
  (let ((d (get-duration 1)))
    (abs-env 
      (control-srate-abs *sound-srate*
        (pwlv 1 dur1 0 (- d dur2) 0 d 1)))))

(setq xfade-in (max (/ xfade-in 1000.0) 0))
(setq xfade-out (max (/ xfade-out 1000.0) 0))

(sim (mult (envelope xfade-in xfade-out) s)
    (mult (diff 1 (envelope xfade-in xfade-out))
        (mult (db-to-linear amp) (lp s cutoff))))

The above plug-in is what I used in its final form, and the next part of this post is just applying it to Excerpt 2, which has four breaths to remove/reduce:

In all cases I use the 1000 Hz cutoff, 1.2 dB amplification, and 20ms crossfade out. The only parameter being varied is the crossfade in.

The first breath starts near the end of the first piano note in this selection:

The breath occurs late in the note, so I used a 500ms crossfade in.

The second breath occurs soon after the start of a piano note; 100ms crossfade in, and since the breath ends before the next piano note, the filter is quite seamless:

The third breath starts just before the loud piano note in this selection:

This is not ideal, since it means that the loud piano note is fully filtered. I used a 400ms crossfade in.

The final breath occurs a little after the piano note in the following selection; I used a 180ms crossfade in, but even so I can hear the distortion of the piano:

Generally speaking, I think it's harder to make the filtering subtle when the piano is loud, but even in these cases I prefer the filtered sound to the original. Excerpt 2, post-filtering:

Good enough for me not to be embarrassed by putting it on the web.

(Working through this excerpt carefully for this post, I think it's actually an improvement on what's in the corresponding part of the full track in the embedded Soundcloud near the top of this page. But I'm not much of an audiophile, and I can live with the blemishes that I uploaded.)

Miscellaneous comments

Careful listeners will have noticed that I made two cuts to the original track (the first is pretty subtle, around 2:08; the second is more obvious at 4:01). When cutting audio in Audacity, highlight the selection you want to remove, then hit the Z key, then do the cut. The extra step makes Audacity adjust the bounds of the selection so that the waveform is near zero on both channels, thus avoiding a click artefact from a discontinuity. (There's a click at 2:07, which almost matches up with my cut, but it is present in the original audio.)
Sometimes I get annoying sounds (not clicks) at the crossfade out. I don't know why, and sometimes I have to fiddle with the end of the selection quite a while before I get something that sounds sufficiently un-annoying to move on.
When editing a track, I just sit down, work my way through, and 30 minutes or an hour later I'm done. But perhaps when listening to it back afterwards, I find a mistake: the most common one was me not realising that I'd filtered out some vocals. In such cases I don't want to undo a dozen carefully-worked filters, but instead just paste over the bad edit with the corresponding part of the original track.

I figured I could do this in R: load the original track and the edited track, and just copy over the relevant part of the wave vector (with a crossfade to avoid clicks). Since the original and edited tracks are of different lengths (I made a couple of cuts; and sometimes an export to MP3 seemed to produce a bit of extra silence at the start), this requires doing a search through the vector to find which indices you want to copy over. I wrote a simple loop over a range of possible start points, comparing each proposed interval with the interval I wanted to match. I thought R would take ages to perform this loop (44100 samples per second!), but it is actually very fast by R standards, running in a couple of seconds.
I'd never paid close attention to the sound coming out of my computer before this exercise, and I'd never noticed a difference between MP3 files at 128kbps and 256kbps. But I eventually noticed that my early efforts, which I exported at 128kbps, were introducing weird artefacts to the sound, almost like someone was rustling paper (very regularly!) during some of the high notes. Subsequently I saved WAV files and exported to MP3 at 256kbps.
I don't have much of a conclusion. Sound is quite complicated and sometimes unintuitive; Coming Down is a really good song.

Posted 2014-08-13.