PESCE: A system for isolating the singing voice in polyphonic recordings


My thesis is titled Singer Identification in Polyphonic Music. A part of my thesis is the material which will be published in the IEEE Trasactions on Speech and Audio Processing as "Singing voice identification using spectral envelope estimation."  In this paper, we present the composite transfer function as a method for identifying singers.  On a database of twelve singers performing without accompaniment, we are able to achieve 95% classification accuracy in a baseline case.

The next step is to address the "polyphonic" part of the thesis title.  One of my hypotheses is the idea that separating the singing voice from other instruments will improve the performance of a singing voice classifier.  To this end, I have developed PESCE, a system for isolating the singing voice in polyphonic recordings.  Effectively, PESCE identifies the fundamental frequency of the singing voice by relying on common frequency modulation in voice partials. From this estimate, it is straightforward to identify the instantaneous amplitude and frequency of the partials and resynthesize (as desired).  PESCE stands for "Peak-Edge-Strand-Complex Extractor," which indicates the sequence of steps required to identify the final fundamental frequency estimate.

PESCE was designed to work rather than to embody some set of unified theoretical principles.  If approached from an optimizing perspective, the complexity of the problem rapidly becomes astronomical.  As such, PESCE is a composed of a series of (often greedy) heuristics for finding a reasonable solution to the voice separation problem.

PESCE was designed for a somewhat constrained set of signals.  Namely, it works the best on the classically-trained voice, accompanied by piano.  Still, we expect that it will work in broader contexts that involve relatively simple accompaniment, like guitar-accompanied folk music. The class of signals on which PESCE works well is still under investigation. 

Let's look at an example of how PESCE works.  In this example, we'll apply PESCE to a sound clip of Thomas Hampson singing a phrase from Schubert's Winterreise.  The first step is the computation of the spectrogram of the signal.  The spectrogram is shown in the following figure.

Once we've computed the spectrogram, we then compute estimates of the instantaneous amplitude, frequency, and bandwidth for every sample of the spectrogram.  We use these estimates, in conjunction with the spectrogram itself, to identify some set of peaks.  Generally, we want peaks which are not too close (in frequency), and which are likely to correspond to an actual partial in the signal. We use a bandwidth-based peak elimination method to remove many of our excess peaks.  On the following figure, the peaks are displayed as black dots.

At this point, most traditional speech processors would perform a McAulay-Quatieri peak-tracking to isolate individual partials.  However, this method typically fails for signals that have significant frequency modulation, like vibrato in the singing voice.  Thus, we adopt a graph-based method for identifying partials.  Generally, for each peak we identify a set of candidate "next peaks" and hypothesize the existence of edges between the present peak and each of these candidates.  Then we score an edges by determining the minimum (interpolated) value on the spectrogram along a line between the two peaks.  Edges are then removed according to a series of heuristic rules designed to produce at most one incoming and one outgoing edge from each peak. The following figure displays the selected set of edges along with their scores.  Lower scores correspond to "stronger" edges.

Edges are then connected into strands.  Following the notion of a synchrony strand as introduced by Martin Cooke, we define a strand as a sinusoid with time-varying amplitude and frequency.  The edge selection performed in the previous step allows easy strand identification -- namely, any series of peaks connected by edges form a strand.  The following two figure displays the strands colored by instantaneous amplitude.

The final step in PESCE is to combine strands into harmonic complexes, henceforth "complexes."  To do this, we rely on the common frequency modulation of harmonics from a voice signal.  For each pair of overlapping strands, we compute a subset correlation score.  Our correlation score is computed as a function of the normalized correlation between the log frequency of the strands (higher is better) and the length of the strands (longer is better).  This score is then computed for all connected subsets of the overlap between the two strands, and the subset with maximum score is chosen.  Using a subset-based score such as this allows the system to "correct" for errors in the edge selection and strand formation steps.  Strands are combined sequentially by iteratively choosing the highest-scoring pair and combining them to form a complex.  The formed complex has some computed fundamental frequency, and it can be combined with strands and other complexes in the same way.  This iterative combination proceeds until a score threshold is reached.  The following figure displays the resulting complexes.  The bright red lines correspond to the fundamental frequency of each complex, while the darker lines show the strands which were combined to produce those complexes

.

The calculated fundamental frequency of the resulting complexes allows us to estimate the instantaneous amplitude and frequency of the voice signal from the original spectrogram.  These estimates can then be resynthesized to produce a rendition of the singing voice without the accompaniment.  The original and resynthesized sounds can be heard below.


hampson.wav


hampson_resynth.wav

 


Back to main page.