What methods can be used to identify and remove echo from an audio system?

Background

I am designing a system that will have a single small microphone and speakers for use in a phone type setting. Easiest example I can give is a Skype conversation where you are using your computers speakers and a desktop microphone.

I am worried about the audio from the speakers getting picked up by the microphone and sent back to the original person. I used to hear this happen all the time in the early days of VoIP conversations, but hardly hear it any more.

My assumption is that groups have come up with ways to cancel out the echo, but how do they do it?

Approaches

My first thought was to just simply subtract the signal being sent to the speakers from the microphone signal, except with this method you have to be concerned with the delay. I am not sure how to determine what the delay is with out some sort of pre-calibration, which I would like to avoid. There is also the issue of how much to scale the signal by before subtracting it.

I next thought about doing some sort of correlation between the speaker signal and the mic signal in order to determine the likelihood of the mic signal being an echo as well as being able to determine the actual delay. This method was able to work alright when I was playing with some recorded signals, but there seemed to be far to large of latency in computing the correlation to be useful in real-time system. Also the adjustable volume on the speakers made it difficult to determine if something was actually correlated or not.

My next thought that there must be someone on the internet who has done this before with success, but didn't find any great examples. So I come here to see what methods can be used to solve this type of issue.

Answer

You are correct. Many methods of echo cancellation exist, but none of them are exactly trivial. The most generic and popular method is echo cancellation via an adaptive filter. In one sentence, the adaptive filter's job is to alter the signal that it's playing back by minimizing the amount of information coming from the input.

Adaptive filters

An adaptive (digital) filter is a filter that changes its coefficients and eventually converges to some optimal configuration. The mechanism for this adaptation works by comparing the output of the filter to some desired output. Below is a diagram of a generic adaptive filter:

enter image description here

As you can see from the diagram, the signal $x[n]$ is filtered by (convolved with) $\vec{w}_n$ to produce output signal $\hat{d}[n]$ . We then subtract $\hat{d}[n]$ from the desired signal $d[n]$ to produce the error signal $e[n]$ . Note that $\vec{w}_n$ is a vector of coefficients, not a number (hence we don't write $w[n]$ ). Because it changes every iteration (every sample), we subscript the current collection of these coefficients with $n$ . Once $e[n]$ is obtained we use it to update $\vec{w}_n$ by an update algorithm of choice (more on that later). If input and output satisfy a linear relationship that does not change over time and given a well-designed update algorithm, $\vec{w}_n$ will eventually converge to the optimal filter and $\hat{d}[n]$ will be closely following $d[n]$ .

Echo cancellation

The problem of echo cancellation can be presented in terms of a adaptive filter problem where we're trying to produce some known ideal output given an input by finding the optimal filter satisfying the input-output relationship. In particular, when you grab your headset and say "hello", it's received on the other end of the network, altered by acoustic response of a room (if it's being played back out loud), and fed back into the network to go back to you as an echo. However, because the system knows what the initial "hello" sounded like and now it knows what the reverberated and delayed "hello" sounds like, we can try and guess what that room response is using an adaptive filter. Then we can use that estimate, convolve all incoming signals with that impulse response (which would give us the estimate of the echo signal) and subtract it from what goes into the microphone of the person you called. The diagram below shows an adaptive echo canceller.

enter image description here

In this diagram, your “hello” signal is $x[n]$ . After being played out of a loudspeaker, bouncing off the walls and getting picked up by the device’s microphone it becomes an echoed signal $d[n]$ . The adaptive filter $\vec{w}_n$ takes in $x[n]$ and produces output $y[n]$ which after convergence should be ideally tracking echoed signal $d[n]$ . Therefore $e[n]=d[n]-y[n]$ should eventually go to zero, given that nobody is talking on the other end of the line, which is usually the case when you’ve just picked up the headset and said “hello”. This is not always true, and some non-ideal case consideration will be discussed later.

Mathematically, the NLMS (normalized least mean square) adaptive filter is implemented as follows. We update $\vec{w}_n$ every step using the error signal of the previous step. Namely, let

$\vec{x}_n = \left ( x[n], x[n-1], \ldots , x[n-N+1] \right)^T$

where $N$ is the number of taps (samples) in $\vec{w}_n$ . Notice what samples of $x$ are in reverse order. And let

$\vec{w}_n = \left ( w[0], w[1], \ldots , x[N-1] \right )^T$

Then we calculate $y[n]$ via (by convolution) finding the inner product (dot product if both signals are real) of $= \vec{x}_n$ and $= \vec{w}_n$ :

$y[n] = \vec{x}_n^T \vec{w}_n = \vec{x}_n \cdot \vec{w}_n$

Now that we can calculate the error, we’re using a normalized gradient descent method for minimizing it. We get the following update rule for $\vec{w}$ :

$\vec{w}_{n+1} = \vec{w}_n + \mu \vec{x}_n \frac{e[n]}{ \vec{x}_n^T \vec{x}_n}= \vec{w}_n + \mu \vec{x}_n \frac{\vec{x}_n^T \vec{w}_n - d[n]}{ \vec{x}_n^T \vec{x}_n}$

where $\mu$ is the adaptation step size such that $0 \leq \mu \leq 2$ .

Real life applications and challenges

Several things can present difficulty with this method of echo cancellation. First of all, like mentioned before, it is not always true that the other person is silent whilst they receive your “hello” signal. It can be shown (but is beyond the scope of this reply) that in some cases it can still be useful to estimate the impulse response while there is a significant amount of input present on the other end of the line because input signal and echo are assumed to be statistically independent; therefore, minimizing the error will still be a valid procedure. In general, a more sophisticated system is needed to detect good time intervals for echo estimation.

On the other hand, think of what happens when you’re trying to estimate echo when the received signal is approximately silence (noise, actually). In absence of a meaningful input signal, the adaptive algorithm will diverge and quickly start producing meaningless results, culminating eventually in a random echo patter. This means that we also need to take into consideration speech detection. Modern echo cancellers look more like the figure below, but above description is the jist of it.

enter image description here

There are plenty of literature on both adaptive filters and echo cancellation out there, as well as some open source libraries you can tap into.

Notes

Thursday, May 25, 2017