Non-negative matrix factorization for audio separation

Thursday, April 6, 2017

Non-negative matrix factorization for audio separation - why does it work?

Non-Negative Matrix Factorization aims to factorize a matrix $\mathbf V$ into the product of two matrices, $\mathbf V = \mathbf W\mathbf H$, where $\mathbf W$ represents a set of basis vectors and $\mathbf H$ their activations or weights. Applied to audio source separation, $\mathbf V$ is the spectrogram (which is known to us), and $\mathbf W$ is the basis non-negative spectra, $\mathbf H$ is their temporal activations.

This seems to be me to be able to express a spectrogram as a sum of different frequency bins and their magnitudes - which would help to filter or separate different pitches (seemingly doing the job of a normal band-pass filter?). I do not understand how this would actually separate, say, a piano and a singer singing the same or similar pitch.

Answer

Recall that the columns of $\mathbf{W}$ can be thought of as "basis" vectors (or elements of a dictionary - the building blocks of any signal) and elements in each column in $\mathbf{H}$ give the corresponding weights (that vary over time). This allows us to decompose the spectrogram based on not just frequency components but also temporal onset information; so it does more than what a simple bandpass filter or comb filter would do.

Non-negative matrix factorization of a spectrogram will not magically separate a piano and a singer singing in the same pitch. However it gives a useful approximation in terms of a sum of weighted basis vectors which (hopefully) can be split up into contributions from different sources because it's unlikely for each source to occupy exactly the same frequency bin at exactly the same time instant.

More concretely, let $\mathbf{V}$ be $M\times N$, $\mathbf{W}$ be $M \times K$ and $\mathbf{H}$ be $K \times N$. So we have $M$ frequency bins, $N$ time samples and $K$ decomposed components. If $\mathbf{w}_i$ are the columns of $\mathbf{W}$ and $\mathbf{h}_i$ are the rows of $\mathbf{H}$ we can write: $$ \mathbf{V} \approx \sum_{i=1}^K \mathbf{w}_i \mathbf{h}_i^T. $$ If we know there are only two sources in the recording (piano and singer) we can try to split up their contributions by choosing subsets of the columns of $\mathbf{W}$ and corresponding subset of the rows of $\mathbf{H}$. So if $S \subset \{1,\ldots,K\}$ then the part contributed by the piano can be denoted by: $$ \mathbf{V}_{\mbox{piano}} = \sum_{i \in S} \mathbf{w}_i \mathbf{h}_i^T $$ and the part contributed by the singer is: $$ \mathbf{V}_{\mbox{singer}} = \sum_{i \in \{1,\ldots,K\}\setminus S} \mathbf{w}_i \mathbf{h}_i^T $$

In reality, we will likely end up with a decomposition which never achieves this separation exactly. That is, there will be $\mathbf{w}_i$'s that have contributions from both the singer and the piano making it difficult to separate the two.

Here's a Python notebook showing this procedure for a mixture of drums and guitar: http://nbviewer.jupyter.org/gist/ingle/93de575aac6a4c7fe9ee5f3d5adab98f (Or if that doesn't work, here: https://gist.github.com/ingle/93de575aac6a4c7fe9ee5f3d5adab98f)

Note that the NMF algorithm only generates a decomposition. It cannot select subsets of $\{\mathbf{w}_i\}$ corresponding to each source. In the Python example, there's a manual step of figuring out which $\mathbf{w}_i$'s sound most like the guitar v/s drum. One can perhaps automate this step by noting that the drum's $\mathbf{w}_i$ vectors have more stuff in lower frequencies.

Analyzing each column (frame) of the spectrogram using a bank of bandpass filters is another way of decomposing the spectrogram. However, note that the decomposition generated by NMF is low rank i.e. parsimonious. In the Python example, it was much easier to manually select subsets of 16 columns of $\mathbf{W}$ corresponding to the two sources. With a bank of bandpass filters we would have had to turn many more knobs (# filters, locations of pass-bands for each frame) and the number of parameter combinations can grow very quickly.

References:

https://ccrma.stanford.edu/~njb/teaching/sstutorial/part2.pdf

http://musicinformationretrieval.com/nmf.html

Notes

Thursday, April 6, 2017

Non-negative matrix factorization for audio separation - why does it work?

No comments:

Post a Comment

periodic trends - Comparing radii in lithium, beryllium, magnesium, aluminium and sodium ions