The Team

Background

Identifying and differentiating the voices of individual singers is a task that comes naturally to humans. Consider the following recordings of two singers:

Singer 1

Singer 2


Clearly, Singer 1 and Singer 2 are not the same person. Even after hearing only a short excerpt of each singer, we can distinguish between the two. We can deduce information about the singers, like their sexes or perhaps even their voice types. The task is somewhat more challenging with shorter samples given out of context, like the three below.

Mystery Singer 1

Who is it?

Mystery Singer 2

Who is it?

Mystery Singer 3

Who is it?

Even out of context, humans can usually differentiate between different singers' voices by the timbre, or quality, of the voice. One of the most important components of what a human listener perceives as timbre is the spectral envelope of the singer's voice. The peaks of that to identify a singer is particularly useful because, as Khine, Nwe, and Li explain, "studies suggest that timbre is invariant with an individual singer."

This project explores the application of machine learning to develop an automated singer identifier, largely following the methodology of Wakefield and Bartsch. There many applications for this kind of automated singer identification, including:

  • Source differentiation in audio recordings
  • Identification of a singers presence on an unlabeled track.

One of the problems in trying to develop such a classifier is the question of how to quantify the timbral features of an audio recording for training and classification. In this project, we compare the performacne of compositie transfer functions (CTFs) as proposed by Wakefield and Bartsch and Mel frequency cepstral coefficients, as used by Logan.

Data Set

We recruited volunteer singers to be recorded and included as part of our data set. We recorded a total of 11 classically trained singers from Northwestern University Bienen School of Music consisting of 3 sopranos, 4 mezzo-sopranos, 2 tenors and 2 baritones.

Each singer was asked to sing:

  • The first 5 notes of a major scale.
  • Repeated in the low, middle and high regions of the singer’s range.
  • Repeated for each of the 5 common Italian language vowels [a], [e], [i], [o], and [u]

A single scale sounded like this (recorded at 44.1KHz, 16 bits/sample):

These scales were then manually split into single-pitch, 1-second samples, totalling 825 samples (75 per singer).

Download our data set (66MB)

Workflow

Before machine learning can occur, the necessary features must be extracted from the raw audio files.

  1. The Matlab Signal Processing toolbox was used to calculate a spectrogram for each signal.
  2. The formant was calculated using either an approximation of the CTF or code from Ellis for generating MFCCs.
  3. Principle component analysis was performed on the outputs to generate data readable by
  4. LibSVM.

Learning Method

  • The WEKA software package was used to perform learning tasks.
  • LibSVM was used with quadratic classifiers.
  • Ten-fold cross-validation.
  • Samples were classified on three different levels:
    1. Individual singers
    2. Voice Type
    3. Sex
  • Performance was measured as the percentage of correctly classified samples.

Results

Composite Transfer Functions vs. Mel Frequency Cepstral Coefficients

The plots above show confusion matrices for the CTF approximation method (left) and the MFCC method (right). It is clear that MFCCs were more successful at classifying our samples.

Classification Levels

The box plots shown right summarize our learner’s ability to classify data samples at one of three levels: individuals (top left), voice type (top right), and sex (bottom left). As the classification becomes more general, the machine is more successful. There are several interesting conclusions to be drawn:

  • Very good at identifying sex
  • Voice type and individual identification are more challenging

Further investigation revealed that the majority of voice type and individual errors occurred amongst the female samples, as can be seen in the individual confusion matrix and the voice type confusion matrix (bottom right). The reasons for this are of great interest to us.

Discussion

Sparse spectra:

As Wakefield and Bartsch explain, the female voice typically has a higher fundamental frequency, such that the spectral envelopes generated by the female singers are sparser than those of the male singers. These sparse spectra likely contribute to our difficulty in classifying female voices. The plot below shows the CTF of the two audio samples given for a soprano and a mezzo-soprano and demonstrates how the similarities could cause confusion.

Soprano1 in red

Mezzo1 in blue

Voice maturity

The age and maturity of our sample set must also be considered in evaluating the results. A young woman’s voice can continue to change and develop into her 30’s. As such, it is not uncommon for young female singers to change voice types early in their career. Our vocalists are all under the age of 30 and our youngest singer, Mezzo3, is currently considering transitioning to soprano repertoire. Thus our results are more vulnerable to voice type discrepancies than a sample set of fully developed vocalists.

Future research opportunities:

  1. Could machines trained on matured singers help voice teachers identify the voice type of young students, particularly pre-mature females?
  2. How will a learner perform when the sample set contains more specific fachs such as coloratura soprano, lyric soprano, spinto soprano, and soubrette?