Skip to main content
CrossCoach
Sign inRequest access
A lone studio microphone standing in near-total darkness, the air around it alive with a glowing violet-to-magenta waveform that radiates outward like a voice made visible, the only light in the frame.
Reading · Speaker comparison

Voice on Trial: Forensic Speaker Comparison

A threat left on a voicemail, a ransom call, a confession caught on a jail phone, and a voice the prosecution says is the defendant’s. It feels like the most human evidence in the room, because you can hear it. That is the danger. These six moves are how a cross-examiner tests a voice identification, and how a careful examiner hands the court a number it can actually use.

15 min readBased on the the forensic voice-comparison literature
I

The voiceprint was never a print

In 1966 Lawrence Kersta walked into a courtroom as the first person ever to testify as a voice identification expert, and he told the court that his "voiceprints" were as reliable as fingerprints. Poza and Begault record the foundation he stood on: a study published in Nature built on novice subjects, high school girls, who scored a 0 to 3 percent error rate matching single words spoken in isolation from twelve speakers. They spell out why that number means almost nothing. The set was closed, the subjects could roam freely among the exemplars until they found the best matches, and there was no forced choice from an open set where you do not know how many matches exist. Kersta also manufactured and sold the spectrographs and profited from certifying the examiners who used them. The word "voiceprint" did the real work. It smuggled in fingerprints: a unique, fixed, infallible pattern. Voice is none of those things.

The science never caught up to the brand. Oscar Tosi's Michigan State research, funded by the Justice Department in 1968 to settle the controversy Kersta started, reported false identification rates of 2 to 6 percent and false elimination rates of 5 to 12 percent under laboratory conditions. Tosi then argued that real cases would do better, the "Tosi Extrapolation," because experienced examiners decline the hard ones. Poza and Begault take that apart: the decision to proceed rests on a subjective read of data quality, with no objective criteria, so no amount of experience guarantees fewer false identifications. Real conditions cut the other way, with imperfect channels and higher intra-speaker variability driving error up. When the National Academy of Sciences reviewed the whole field in the late 1970s, it concluded that available error-rate estimates "do not constitute a generally adequate basis" for a court to judge the reliability of aural-visual voice identification. Poza and Begault, writing in 2005, say that statement still holds.

In darkness, a glowing magenta spectrogram of a voice on one side and a faintly glowing concentric ridge pattern on the other, a wide black gap of empty air holding them apart so they never merge.
Fig. 1 · A spectrogram beside a print-like ridge pattern. The word "voiceprint" tried to fuse them; the science, and even the court, kept them apart.

Set United States v Williams (2d Cir. 1978) beside that and feel the gap. The court affirmed spectrographic voice evidence, relying on Tosi's 6.3 percent false identification rate and on an analogy to handwriting and gun-barrel striations. It called the critical step "the simple step of visual pattern-matching, a step easily comprehended and evaluated by a jury." That is precisely the move Begault and Poza warn against: pattern matching by eye is a subjective gestalt judgement, and there is no scientific data showing a trained examiner aurally discriminates voices better than a layman. The expert Lundgren claimed thirteen matches against a ten-match standard set by his own trade association. The one thing the Williams court got right appears in its footnotes. A spectrogram, it wrote, "has been often called a voiceprint. We avoid the term as potentially leading to an unwarranted association with fingerprint evidence."

In the box, "I matched the voiceprints" is the sentence that ends a career, because the method behind it was validated on schoolgirls in a closed set and the only word that ever made it sound like science was the one the court itself refused to use.

A spectrogram has been often called a "voiceprint." We avoid the term as potentially leading to an unwarranted association with fingerprint evidence.
United States v Williams, 583 F.2d 1194 (2d Cir. 1978), n.5
Challenge 01 · Put it to the test

Show me the study

Counsel holds up the witness’s report, where the spectrograms are said to "match," and asks for the evidence behind the method.

The question

"You testified that the spectrograms ‘match.’ Can you point me to a single published study, on case-realistic recordings, that establishes the error rate of your eye-matching method, or did the National Academy of Sciences conclude in the 1970s that no such adequate basis exists?"

Your answer
II

Your number answers the evidence, not the verdict

In 2009 Geoffrey Stewart Morrison argued in Science and Justice that forensic voice comparison is mid-paradigm-shift, the same one DNA went through in the 1990s, away from the expert who declares a match and toward a likelihood ratio. The likelihood ratio answers one question: how much more probable are the differences between the suspect recording and the offender recording if the same person spoke both than if two different people did. An LR of 100 says the evidence is 100 times more probable under the same-speaker hypothesis; an LR of 1 over 1000 says it is 1000 times more probable under different speakers. The numerator is similarity, the denominator typicality, and similarity on a common feature means little, because a random pair would match just as well.

The danger is the next sentence out of your mouth. An LR is the probability of the evidence given the hypotheses, not the probability of the hypotheses given the evidence; saying it is 100 times more likely they are the same speaker inverts the conditional and commits the prosecutor's fallacy. Reaching guilt needs Bayes' theorem and a prior that belongs to the court, not to you. Morrison's 2013 calibration tutorial shows early systems produce a score, not an interpretable LR, until calibration shifts and scales it; in one Chinese-vowel example calibration dropped the cost Cllr from 1.802 to 0.750, where lower is better, rescuing a badly overconfident system. Counsel will also ask: human or computer. Morrison splits the field into auditory listeners, acoustic-phonetic measurers, and fully automatic systems, and in Cambier-Langeveld's 2004 to 2005 exercise, twelve analysts examined the same samples and only four reported a likelihood ratio. Know which camp your number came from, and state clearly what it is and is not.

An LR is the probability of the evidence given the hypotheses, not the probability of the hypotheses given the evidence; saying it is 100 times more likely they are the same speaker inverts the conditional and commits the prosecutor’s fallacy.
Morrison (2009), on the likelihood-ratio paradigm
In darkness, a single glowing violet arrow of light points cleanly one way while a hot-magenta arrow points back against it, the reversed arrow brighter and wrong, illustrating an inverted conditional.
Fig. 2 · The likelihood ratio runs one way: evidence given the hypothesis. Flip the arrow and you have the prosecutor’s fallacy.
Challenge 02 · Put it to the test

Does 100 mean it’s him?

Counsel reads the likelihood ratio back from the report and offers a tidy, wrong restatement of it.

The question

"You testified to a likelihood ratio of 100, so it is 100 times more likely my client is the speaker on the recording, correct?"

Your answer
III

Ask for the Cllr on a test set that looks like your case

In a 2012 New South Wales fraud trial, a forensic practitioner told the jury there was "much to support a hypothesis that the defendant is the speaker," based on what she heard and saw in spectrograms of "no" tokens. She offered no number for how often she got cases like this right. Enzinger and Morrison went back and built a statistical system using the very same features she relied on, second-formant trajectories in /o/ and mean fundamental frequency, then tested it under simulated case conditions: a GSM mobile call degraded through an AMR codec on one side, a reverberant police interview room on the other. The result was a Cllr of 0.834. A system that ignores the recordings entirely and always returns a likelihood ratio of 1 scores a Cllr of exactly 1. Her chosen features, run properly, were barely better than saying nothing.

That number is the whole argument. Morrison's 2011 paper sets out why Cllr, the log-likelihood-ratio cost, is the right yardstick for a likelihood-ratio system. Lower is better, 1 is the score of a useless system, and it is gradient: a likelihood ratio of a million in favour of a contrary-to-fact hypothesis is penalised far more harshly than a 10 in the wrong direction, because the confident-and-wrong answer does more damage to a verdict. Crucially, Morrison hammers one point: the measured validity "depends both on the system and the test set." His worked example: if the offender recording is a one-minute mobile call and the suspect recording is five minutes of microphone-recorded interview answers, every test pair must be a one-minute mobile call against a five-minute interview recording. A Cllr from pristine studio audio tells you nothing about a case fought over a noisy phone call.

The same Enzinger and Morrison study shows how much the system itself matters once conditions are fixed. On identical test data, the automatic MFCC system (a GMM-UBM) scored a Cllr of 0.332, vastly better than the 0.834 from the acoustic-phonetic system on the same features. Performance is not a property of "voice comparison." It is a property of this method on this kind of recording, and you cannot read it off a CV.

A dark monitor in a recording booth glowing with two smooth cumulative curves crossing near the centre, a Tippett plot, one curve wrapped in a softly glowing magenta credible-interval band, the only light in the frame.
Fig. 3 · The Tippett plot, the number made visible: two cumulative curves and the credible-interval band that says how far the system can be trusted.

Kelly and colleagues in 2019 ran the commercial VOCALISE system through forensic_eval_01, a publicly released case-like dataset (223 test recordings, 61 speakers, 111 same-speaker and 9720 different-speaker comparisons). Same software, six configurations, and the Cllr-pooled ranged from 0.246 for a condition-adapted x-vector setup to 0.462 for the built-in versions. Feeding the system case-relevant data to adapt it nearly halved the cost. "Off the shelf" and "tuned to the case" were not the same tool.

So the cross-examination has a single anchor. Ask the examiner: what is the Cllr of your system, measured on a validation set whose recordings match the duration, channel, language, and speaking style of the recordings in this case? Ask to see the Tippett plot, which shows the cumulative proportion of same- and different-speaker test comparisons against log likelihood ratio, with the credible-interval bands Morrison insists on. If the answer is experience, training, or a "trained eye," you have a witness describing a method whose error rate, in the words of the Angleton ruling Enzinger and Morrison quote, "is unknown and may vary considerably, depending on the conditions of the particular application." Daubert's question was whether the technique "can be (and has been) tested." A number on case-realistic data is the only answer that survives scrutiny.

A system which always responded with a likelihood ratio of 1, and therefore gave no information to assist the trier of fact in making their decision, would result in a Cllr-pooled value of 1.
Enzinger & Morrison 2017, on the acoustic-phonetic system’s 0.834
Challenge 03 · Put it to the test

What’s your Cllr?

The witness has concluded the recordings offer "substantial support" for the prosecution. Counsel asks what that conclusion is worth.

The question

"You concluded the recordings offer ‘substantial support’ that my client is the speaker. What is the Cllr of your method, measured on test recordings that match the duration, channel, and speaking style of the two recordings in this case, and if you have never measured it, how is the jury to know whether your method performs better than a coin that always says ‘same speaker’?"

Your answer
IV

Three places the number falls apart

A likelihood ratio looks like a fact. It is the product of three judgements, and a cross-examiner who understands that can take your number apart in front of the jury without ever disputing your maths.

The first judgement is the relevant population. The denominator of your LR asks how typical the offender's voice is among "other speakers," and you have to decide who those speakers are. Hughes and Foulkes showed in 2015 just how much that choice moves the result. They ran the same /eɪ/ vowel data from New Zealand English through systems built on different reference populations: one Matched to the offender's social class or age, one Mismatched, one Mixed. Validity tracked the choice. For class, the Matched system gave a Cllr of 0.513 while the Mismatched gave 0.659, the worst of the set; for age, Matched 0.561 against Mismatched 0.712. Worse for the witness box, individual comparisons swung hard. The root-mean-square difference between Matched and Mismatched different-speaker LRs reached 1.118 on a log10 scale, more than an order of magnitude on a single comparison. And the bind they name is exactly this: "one cannot know for certain the sociolinguistic community to which the offender belongs." You picked a population without knowing who the offender was, because his identity is the question. Counsel will ask you to defend that pick, and the truthful answer is that you guessed, in good faith, from a voice.

The second judgment is calibration and condition match. Morrison's 2017 paper dissects a real 2017 New South Wales case. A practitioner ran a GMM-UBM system, trained the numerator on the known-speaker recording (a jail landline call, reverberant, full of other voices, saved as MP3) and trained the denominator on pristine AusTalk recordings made with a head-mounted microphone. The questioned recording was a mobile call from a phone wedged under a mattress. Three different kinds of audio, three different mismatches, and no calibration step at all. When Morrison replicated the system, its Cllr came out at 0.957. A useless system, one that always says "1," scores 1. This one gave almost no information and it was biased: its "likelihood ratios" ran on average 18 percent higher than the unbiased variant. Fixing the reference condition and adding calibration pulled Cllr down to 0.674. The per-utterance numbers the practitioner reported, between 1.6 and 5.5, looked like evidence and were mostly noise.

The third soft spot is that experts diverge and some overstate. Cambier-Langeveld's 2007 collaborative exercise sent one constructed case to twelve experts in ten countries. On Q10, an 18-second clip that was a different speaker, three participants concluded it matched the reference. The automatic systems, asked for numbers, produced LRs for true matches between roughly 40 and 40,000, and for true mismatches from 4.36 x 10 to the minus 23 up to about 6. That last figure is an examiner offering weak support for the wrong answer with a straight face. Same recordings, same question, answers all over the map.

A three-legged microphone stand glowing faintly violet in the dark, two legs solid but the third buckling and flaring hot magenta where it fails, the whole stand and its mic tilting toward collapse.
Fig. 4 · Three judgments hold the number up: population, calibration, condition match. Let one fail and the whole likelihood ratio tips.

The cross will be direct. What relevant population did you use, and how did you know the offender belonged to it. Was your system calibrated. Did your validation data come from a telephone, a codec, a noise floor, and a duration that match this case, or from a pristine studio. If any answer is soft, the number is soft.

A system which gives no useful information, a system that always outputs a likelihood ratio of 1 irrespective of the input, will have a Cllr of 1. The first variant of the system had a Cllr very close to 1.
Morrison 2017, on the replicated NSW practitioner’s system (Cllr 0.957)
Challenge 04 · Put it to the test

How did you pick the population?

The witness reported a likelihood ratio in the hundreds. Counsel goes straight at the typicality term it depends on.

The question

"You reported a likelihood ratio in the hundreds. Hughes and Foulkes showed that simply changing the reference population shifted an individual comparison by more than an order of magnitude, and they say you cannot know for certain which community the offender belongs to. So how did you choose the population whose typicality your denominator depends on, and what does my client’s number become if you chose wrong?"

Your answer
V

Bias, and the witness who never heard the suspect

Two human problems lie behind the spectrograms and the likelihood ratios, and counsel can attack both. They are not the same problem, and a witness who lets them blur together hands the cross-examiner a gift.

The first lives inside the expert. Kukucka, Kassin, Zapf and Dror surveyed 403 forensic examiners from 21 countries in 2017 and asked them, among other things, how accurate their own judgements were. The mean estimate was 96 percent, and 148 of them, 37 percent of the sample, said their own work was 100 percent accurate. That is not how human perception works under irrelevant context. The same survey found a tidy staircase of denial: 71 percent agreed cognitive bias is a concern in forensic science generally, only 52 percent saw it as a concern in their own domain, and just 26 percent thought their own judgements were affected. That gap is the bias blind spot, recognising the flaw in your peers while exempting yourself. Worse for the courtroom, 71 percent believed an examiner can reduce bias by simply trying to set expectations aside, and examiners split 49 to 31 on whether they should even be blinded to irrelevant context. The authors call willpower the wrong cure, because bias operates automatically and without awareness. The fix they point to is procedural: Linear Sequential Unmasking, controlling what information reaches the examiner and when, which adopting labs reported was neither onerous nor expensive.

In the box, that science cuts one way. If a voice examiner knew the police theory, the suspect's record, or that a confession existed before forming a view, counsel can ask what was done to wall that off. "I set it aside" is, on this literature, the answer of someone who has not understood the threat.

The second problem belongs to lay listeners, and it is a separate and far weaker thing. Yarmey's 1995 review of earwitness research opens with Isaac in Genesis, who heard Jacob's voice correctly, distrusted it, and identified the wrong son. Memory for an unfamiliar voice is fragile. Yarmey and colleagues in 1994 ran the only direct comparison of one-person and six-person voice line-ups: hits were poor in both, and innocent foils were falsely picked more often in the one-person showup. Clifford and Denot found voice identification falling from 50 percent to 9 percent accuracy across one to three weeks of delay. Whisper disguise, which hides pitch and intonation, gutted accuracy in Orchard and Yarmey's 1995 study, and the more confident witnesses were about a distinctive voice, the less accurate they were. Across Yarmey's own studies the accuracy-confidence correlation hovered near .25, too low to postdict anything. Voices, Yarmey notes, are not faces: people who could pick a high-school classmate's face out years later could not reliably sort familiar from unfamiliar voices even after seven speech samples.

In the dark, a single pair of studio headphones with the clean violet voice-signal flowing into one ear cup while a second, unwanted stream of hot-magenta sound-light bleeds in from outside and contaminates the same ear, two signals merging where only one should reach the listener.
Fig. 5 · Bias is contamination of the listening channel: an extra signal reaching the ear that should never have entered it. Controlling what you are told, and when, is the fix willpower is not.

Keep the two worlds apart. The danger is letting opposing counsel collapse your instrumented, calibrated, system-based comparison into the same bucket as a frightened bystander who heard a masked man for fifteen seconds, or letting them suggest your expertise immunises you from the bias the earwitness suffers. Different mechanisms, different remedies, different reliability.

The mean estimate was 96 percent, and 148 of them, 37 percent of the sample, said their own work was 100 percent accurate. That is not how human perception works under irrelevant context.
Kukucka, Kassin, Zapf & Dror (2017)
Challenge 05 · Put it to the test

What did you know before you listened?

Counsel turns from the method to the mind running it, and asks what the witness knew before forming a view.

The question

"Doctor, you told the jury your method is rigorous, yet a survey of 403 forensic examiners found most thought willpower alone could neutralise bias and 37 percent rated themselves 100 percent accurate. Before you formed your opinion in this case, did you know the police theory, the suspect’s record, or that there was a confession, and what specific procedure walled that information off from your analysis?"

Your answer
VI

Is the recording even of a real person?

For years the threshold question in a voice case was "who is speaking?" A second question now comes ahead of it: is anyone speaking at all, or is this a clone? Liu and colleagues, writing up the ASVspoof 2021 challenge, drew submissions from 54 teams across three tasks, including a new deepfake task aimed at fabricated speech "in the voice of a target speaker," the kind posted to social media to harm a reputation or spread disinformation. The headline result for the witness box is not that detectors work. It is how badly they break when the attack is unfamiliar. On the deepfake progress data, 23 of 33 systems scored equal error rates under 10 percent, and the best dipped below 1 percent. On the evaluation data, every single system exceeded 15 percent. Same systems, harder material, and the floor falls out.

The reason matters because counsel will press on it. The 2021 organisers found that some detectors had inadvertently learned the wrong thing: the length of silence at the start and end of a clip, an artifact of how the training corpus was built, rather than any acoustic signature of synthesis. Strip the non-speech with a voice-activity detector and performance "is substantially degraded." A countermeasure that keys on a database quirk is, as the paper warns, unlikely to "lead to reliable detection in the wild." So when you say a recording is human speech, you are relying on a tool whose generalisation to a novel cloning method, the one your opponent will name, is truly unknown. The candid answer to "could a spoofing detector miss a clone built with a 2024-era tool it never saw?" is yes, and the literature says so.

There is a subtler hazard a voice expert may be dragged into: language analysis for the determination of origin, or LADO, where an analyst is asked to judge an asylum seeker's nationality or home region from how they speak. Patrick and Fraser have led a sustained critique, and the core problem is structural. Language varieties do not respect national borders, people accommodate and shift across a lifetime of displacement, and the "expert" is often a native-speaker informant with no training in linguistics. A confident verdict that someone "is not from country X" can decide a deportation, and the evidentiary base for that verdict is thin. If you are asked to opine in that arena, the defensible position is usually refusal, or a tightly bounded statement about features rather than a claim about origin.

Two long glowing violet waveforms stacked in the dark, nearly mirror-identical, one real and one a synthetic copy, with a single fabricated peak in the lower copy flaring hot magenta, hiding among hundreds of honest spikes.
Fig. 6 · A real utterance and a cloned one, laid side by side. The forged peak hides in plain sight, and the detector may have learned the wrong tell.

Pull these strands together and the admissions a competent witness should be ready to make come into focus. You cannot guarantee a recording is a real human utterance rather than a clone. Your spoofing detector was validated against known attacks and may not generalise to new ones. Your speaker-comparison method assumes the questioned sample is real, and that assumption is now contestable. You do not infer nationality from speech. The partial spoof is the sharpest version of the threat: the 2021 authors note that swapping a single phrase, "I won the election" for "I lost the election," can flip a recording's meaning while leaving most of it bona fide and most detectors blind. Concede what you cannot defend, and the things you can defend will carry more weight in the record.

While 23 (out of 33) systems have EERs of less than 10% for the progress subset, and while the best performing system even has an EER of less than 1%, all have EERs exceeding 15% for the evaluation set.
Liu et al. (2023), ASVspoof 2021, Sec. III-C
How far does each voice-evidence move actually stand?
Calibrated LR with a Cllr on case-matched recordings
Automatic system, but validated on mismatched audio
Acoustic-phonetic features without calibration
Aural-spectrographic "voiceprint" eye-matching
Lay earwitness identification of an unfamiliar voice
Inferring nationality from speech (LADO)
ValidatedInstrument-basedSubjective comparison

Relative scientific footing across the six moves in this reading, from a calibrated likelihood ratio on case-matched recordings down to claims the literature cannot support. Bar length is illustrative, not a measured metric.

What to carry into the witness box
  • 01"I matched the voiceprints" is the sentence that ends a career. Spectrogram eye-matching was never validated, and even the court that admitted it refused the fingerprint analogy the word smuggles in.
  • 02A likelihood ratio answers the evidence, not the verdict. It is not the probability your suspect is guilty, and saying so is the prosecutor’s fallacy. It means nothing until the system is calibrated.
  • 03A method’s accuracy is a measured number, the Cllr, on test recordings that match this case’s channel, duration, and speaking style. A useless system scores 1, and "off the shelf" is not "tuned to the case." Ask for the number and the Tippett plot.
  • 04The number rests on three judgements: the relevant population, calibration, and condition match. Change the population alone and a single comparison can swing more than an order of magnitude. Be ready to defend each choice, including the one you made without knowing who the offender was.
  • 05Keep the expert and the earwitness apart. Your instrumented, calibrated comparison is not a frightened bystander’s memory of a masked voice, and your expertise does not exempt you from bias. Willpower is not a debiasing procedure; controlling what you were told, and when, is.
  • 06You cannot guarantee the recording is a living human and not a clone. Your spoofing detector was validated on known attacks and may not catch a new one. And you do not infer nationality from speech.
Challenge 06 · Put it to the test

Is it even a human?

The recording in this case is of unknown provenance. Counsel asks the question that now comes before "who is speaking."

The question

"Your countermeasure was tuned against the spoofing attacks in the challenge data. The recording in this case was, you concede, of unknown provenance. Can you tell this jury, to any stated degree of confidence, that it was produced by a living human and not by a voice-cloning tool released after your detector was validated?"

Your answer
Ask the tutor

Still have questions about the research?

Ask anything about the forensic voice-comparison literature. The tutor answers from the document itself — and keeps one eye on how it might come up under cross-examination.

Your question
References
Next reading

Gait, Body & Clothing Comparison: Identifying a Person from an Image

Keep going

Put this into practice, or go deeper with the tutor on the full research.