
Gait, Body & Clothing Comparison: Identifying a Person from an Image
A figure on CCTV, and four claims stacked into one: the walk, the build, the clothes, the height, all said to match the accused. To a jury that sounds like a pile of independent confirmation. It is four feature comparisons, and not one of them has a measured error rate behind it. These six moves are how a cross-examiner pulls them apart, and how a careful examiner says what an image of a moving body can and cannot carry.
Four matches, no error rate
A figure crosses a car park on CCTV. The footage is dark, the frame rate is low, the feet drop out of shot for half the stride. From this an examiner tells the court four things. The way this person walks matches the accused. Their build matches. Their clothing matches. Their height, measured off the frame, matches. Four matches stacked into one confident conclusion. Then defence counsel asks the question the discipline still cannot answer: when you say "match," how often are you wrong? What is your error rate?
There isn't one. That is the finding Ioana Macoveciuc, Carolyn Rando and Hervé Borrion set out in their 2019 review from University College London. Almost two decades after a forensic gait expert first testified to a perpetrator's identity in a UK court, they write that the methods "remain insufficiently robust for use in court." No validation against ground truth. No standardised list of features applied the same way by every examiner. No measured frequency of how common any given feature is in the population. No code of practice that was finalised when they wrote. The conclusion lands hard: courts should treat gait evidence "with caution, as they should any other form of evidence originating from disciplines without fully established codes of practice, error rates, and demonstrable applications in forensic scenarios."
Look at how it actually entered law. The first case, R v Saunders in 2000, brought gait evidence into a courtroom that had never seen it. Rather than earning its place through validation studies and peer review, the authors note, "forensic gait analysis appears to have become a forensic science field because of this first case in which it was applied." When admissibility was finally tested on appeal in R v Otway in 2011, the evidence survived "despite being qualitative in nature, with no empirical support for the applied method and for the conclusions drawn." Experience stood in for an error rate. It still does.
Treat this section as the frame for everything that follows. Every "match" you offer (gait, body build, clothing, height) is a feature comparison delivered without a measured error rate behind it. The Royal Society's own 2017 Primer for Courts puts it directly: on a verbal scale of likelihood, the data from current gait methods would classify as "weak," partly because there is no large UK population database to anchor it. The companion critique by van Mastrigt and Larsen (2018) reaches the same place from the validity side. This is not an attack on any examiner's competence. It is a question about what happens to a verdict when an unvalidated method carries the weight.
The Danish bank robbery case shows the stakes. Police thought a disguised robber "had a unique gait." Larsen and colleagues analysed short footage, hands jammed in pockets, feet partly hidden, and concluded a limp matched, relying on variability drawn from a study of only eleven people. They told the court the analysis "cannot establish the identity of the individual." The defendant was convicted anyway. State your limits clearly and the conviction can still rest on the evidence you tried to qualify. So the work before you is simple to say and hard to do. For each of the four matches, know exactly what you can defend when counsel asks for the number, and what you must concede does not yet exist.
“Forensic gait analysis appears to have become a forensic science field because of this first case in which it was applied.”

Four matches, where is the number?
Counsel lists the four claims back to the witness, slowly, then asks for the one thing none of them carries.
"You told this jury the walk, the build, the clothing, and the height all 'match' the accused. For each of those four matches, what is your method's error rate, and where is it published?"
A feature that will not hold still
For a feature to put a name to a person, three things have to be true. You have to be able to measure it. It has to stay roughly the same within one person across occasions. And it has to differ enough between people to tell them apart. Gait, body build, height and clothing are all sold to juries as if they clear that bar. Walk through what the research actually shows and the middle requirement collapses first.
Start with the camera, because everything else rides on it. Birch and his colleagues (Science and Justice, 2014) took a single man fitted with an ankle foot orthosis, filmed him at 25 frames per second alongside lab-grade Qualisys motion capture, then edited that same walk down to nine frame rates, as low as one frame every four seconds. Twelve experienced podiatrists scored the gait features they could see. The relationship between frame rate and correct identifications was strong and linear: r = 0.868, p = 0.002, with frame rate explaining about three quarters of the variation in scores. Their point about why is the one to keep. Below a certain frame rate you are no longer watching movement. The brain stitches still frames into apparent motion through persistence of vision and the phi phenomenon, so "all perception of movement from CCTV footage is therefore illusionary." Custody-suite footage submitted for analysis has been found as low as 2 fps. At that rate a swinging arm or a flexing knee can happen entirely between frames. The feature you describe as dynamic may exist only in the gaps the camera never recorded.
The body fares no better. Scoleri, Lucas and Henneberg (Forensic Science International, 2014) had three assessors estimate stature from images of the same men wearing nothing, a black shirt, a striped shirt and a padded leather jacket. The garment alone moved the answer. Comparing the shirtless estimate to the padded-jacket estimate produced a bias of 37.8 mm, and the striped shirt 36.4 mm, on the same body. Posture made it worse: when they ran real airport footage, the technical error against truth ran past 60 mm for the weaker assessors, and stature itself is not fixed anyway, dropping up to 28.1 mm across a single day. Same person, different shirt, different slouch, different number.
Height and weight taken from a photo are no safer. Thakkar, Pavlakos and Farid (CVPR Workshops, 2022) built a state-of-the-art 3D body model and tested it on simulated figures where the truth was known exactly. Even with a perfect scale estimate, a person of 175 cm and 90 kg could only be pinned to a 95% range of 170 to 177 cm and 69 to 103 kg. With a realistic scale estimate it was no better than guessing the gender average. Their own framing of the problem lists the killers: spinal compression shifts height by up to 2 cm a day, weight by up to 2.25 kg, and slouching or walking changes apparent height by up to 6 cm. Pose alone tanked their body-shape classifier from 95% in a neutral stance to roughly 46%.
So the witness-box line writes itself. This is a feature that will not hold still. It moves with the shirt, the posture, the time of day and the frame rate, before you ever compare two people. Say that directly, and own which parts of your opinion describe movement and which merely describe a position frozen in a still.
“All perception of movement from CCTV footage is therefore illusionary, the brain making a series of assumptions as to the way in which an object or any part thereof gets from one location to another.”

Movement, or a frozen posture?
Counsel has already established the footage runs at 2 frames per second, and turns that fact on the gait opinion itself.
"You called this a gait feature, but you have agreed the footage runs at 2 frames per second. At that rate, the movement you described to the jury happens between the frames the camera recorded, so isn't your opinion really a description of a frozen posture passed off as motion?"
Two examiners, one walk, two answers
Birch and her colleagues built the Sheffield Features of Gait Tool to fix exactly the problem counsel will press you on: two examiners watching the same footage, writing down different things. The tool is no back-of-envelope checklist. It runs to 113 features of gait and variances across 14 sections, drawn from a review of 51 real forensic gait reports. In their 2019 study they handed it to 14 experienced gait analysts, all podiatrists or clinicians with at least a year of post-graduation practice, and had them score computer-generated avatars whose walks were known and fixed. The avatars never varied. The clothing never varied. The footage was high frame rate, good resolution, good lighting, the kind of footage you almost never get from a car park camera at 2am.
Even under those ideal conditions, the examiners did not agree with each other or with themselves. Repeatability, the same person scoring the same walk on different occasions, ranged from 94.65% down to 68.35%, mean 79.54%. Reproducibility across examiners came in lower, a mean of 73.45%. Take that the way opposing counsel will. Roughly a quarter of the read is not stable. One of your own colleagues, given the same six seconds, would have ticked different boxes. The authors framed this as "good" levels for a first-of-its-kind tool, and in the clinical literature it stands alongside the Salford Gait Tool and Edinburgh Visual Gait Score. But "good for a subjective observational tool" is not the same as "reliable," and a jury hears the second word, not the first.
Experience does not rescue you. In the 2020 study, Birch compared 11 trained gait analysts against 19 members of the public on the same CCTV identification task. The experts got 71.64% correct, the lay people 64.42%, and the difference was not statistically significant (p = 0.29). Worse for the witness box: the inexperienced participants were significantly more confident than the experts (p < 0.05), on both their right and their wrong answers. The authors named it the Dunning-Kruger effect. Confidence is not a reliable signal of accuracy, and the person most sure of the match may be the one least equipped to qualify it.
Then there is what you were told before you looked. Nakhaeizadeh, Dror and Morgan (2014) gave 41 forensic anthropologists one real skeleton and three different stories. With no context, 31% called the remains male. Told by "DNA" the remains were male, 72% called them male. Told the remains were female, not one examiner, 0%, called them male. Same bones, opposite conclusions, driven entirely by a sentence read out before the assessment. Context, they wrote, can override "the actual physical evidence present."
Gait is the softest of the visual reads, and counsel knows it. The lesson carries straight across. The assessment is subjective, it reproduces poorly between examiners even with a structured tool, and it bends toward whatever the investigating officer told you the answer was.
“In the group given female context, 0% of the participants concluded that the remains were male.”

Told the answer before you looked
Counsel ties together what the officer said beforehand and the absence of any reliability figure, and asks for a number.
"Mr Examiner, before you ever looked at the questioned footage, the officer told you the suspect had an 'asymmetrical limp,' didn't he? And you found one. Can you tell this jury, with a number, how reliably a second qualified examiner with no knowledge of the case would reach the same conclusion you did?"
Height done right versus the "about 180, same as the accused"
A robbery of a cinema, caught on CCTV, a person standing in an oval drawn on the questioned image. Ivo Alberink and Annabel Bolck, working at the Netherlands Forensic Institute, walk you through exactly what serious height estimation costs. Six test persons of known height are taken back to the crime scene and positioned in front of the same camera, in as close to the perpetrator's pose as possible. A 3D model of the room is built from photographs and fixed location points. Four operators place virtual cylinders over the bodies, feet to head, three times each, in randomized order. Only then does a number come out, and it comes out with a confidence band attached.
The mean measured height of the perpetrator was 166.4 cm. On the test persons, measurement ran systematically 6.3 cm below their true height, with a standard deviation of 1.7 cm. So the estimate of actual height, including head and footwear, is 166.4 plus 6.3, which is 172.7 cm. Then, because only six test persons were used, the band is set with a Student's t distribution, and the 95% interval comes out as 168 to 177.5 cm. The suspect stood 176 cm. He falls inside the interval, so the hypothesis that he is the perpetrator is not rejected. Notice what that sentence does not say. It does not say it is him.
Alberink and Bolck go further and build a likelihood ratio, which weighs how rare the estimated height is in the population against how close it falls to the suspect. In their worked case the LR came out around 2. Their own words: "This constitutes very weak evidence against the suspect." For common suspect heights the obtainable LR tops out near 6. The method, properly applied, produces small numbers and wide bands. That is not a flaw. That is the measurement telling the truth about what a smear of pixels can support.
The fragility shows next. Two years later, Gerda Edelman, Alberink and Bart Hoogeboom tested the same family of methods on four perpetrators from one fixed camera, where the camera had been nudged between the crime and the reconstruction. When operators reused the same camera match, both projective geometry and 3D modeling predicted accurately. But once new vanishing points or camera matches had to be drawn, projective geometry fell apart. Fourteen of sixteen test sets were rejected under the statistical model. The predicted perpetrator intervals across repetitions sometimes did not even overlap. The 3D modeling method held; only one of sixteen sets came back significant, which is what chance alone produces at a 5% level. Their conclusion was blunt: lens distortion and shifted camera orientations are "very regularly occurring," and you must run validation experiments rather than trust a single pass.
Set that against the casual courtroom line, "about 180 cm, the same as the accused." No reconstruction. No test persons. No interval. No likelihood ratio. No correction for the 6 cm a slouched pose can swallow, no account for footwear, no check on whether the camera even sat where it sat on the night. The rigorous version of this evidence is a range with a t distribution and a small LR. The loose version is a point estimate passed off as a match. When you are in the box, the gap between those two is the whole case.
“This constitutes very weak evidence against the suspect.”

Where is your confidence interval?
Counsel lays the rigorous photogrammetric method beside the witness's casual point estimate and asks for everything the loose version skipped.
"You told the jury the figure was 'about 180 centimetres, the same height as the accused.' In Alberink and Bolck's validated method, the very same kind of estimate carried a systematic correction of more than 6 centimetres for pose, a 95% range nearly 10 centimetres wide, and a likelihood ratio of only about 2, which the authors themselves called very weak evidence. Where is your reconstruction, where are your test persons, and what is the confidence interval around your number?"
When "unique" means "I searched a database and stopped"
One in 1.27 billion. That is the number forensic podiatrists reach for when they say a barefoot impression is individual to a person. It comes from a Royal Canadian Mounted Police study built by Robert Kennedy and colleagues: 5,755 pairs of inked barefoot impressions, with a statistical analysis run on just 19 of a possible 119 measurements, producing a random-match probability of one in 1.27 billion. It sounds like DNA. Read how it was made and it is something else entirely.
Kennedy's own account, written up by Massey and Kennedy, tells you how the matching worked. Foot impressions were scanned, measured, and searched against every other impression in the database. Using exact measurements, after only three to five measurements every foot came up new to the database, even prints from the same donor walking on a different day. So they widened the search. A 5 mm window on each measurement still returned only same-donor pairs. Only when the window opened to 15 mm did two different people ever surface as a "match," and a trained examiner could then tell them apart by eye. That is the engine behind "unique." It is a database that failed to find a second person inside a chosen tolerance. That is not an error rate. It is the absence of a hit in a finite, curated sample.
And the sample was pristine by design. Kennedy concedes the research used ideal impressions: whole, clear, distinct walking impressions on paper from cooperative donors. Smeared or unclear samples were not used. Crime scene impressions, as he admits, will not meet those criteria, which is why barefoot examiners drop the 1.27 billion figure in casework and switch to a conservative, non-numerical opinion. Counsel should make the witness say that out loud.
Then there is the harder problem, the one Arnold Hu, John Arnold, Ryan Causby and Sara Jones put on the table in their 2018 systematic review. A probability of individuality is only as good as the measurements feeding it, and those measurements have to be repeatable. Hu and colleagues screened 1,340 records, kept 11 studies, and rated overall methodological quality "Poor" to "Fair," with nine of eleven studies rated Poor. Crucially, the two studies describing the Kennedy method were caught in the search but excluded because the reliability of its measurements was never reported. The authors asked for the dataset to compute reliability themselves; it was not available. Their verdict on the technique driving the 1.27 billion claim: "additional testing is required to determine the reliability of foot impression measurements informing this technique, as to date this has not been reported."
Now stand that next to what happens in court. Michael Nirenberg's 2016 case study describes the first Daubert hearing where forensic podiatry was the primary subject, State of Wisconsin v. Travis Petersen. A sock-clad bloody footprint, 11 linear measurements, each within a recognised 5 mm error margin, and the judge ruled it admissible. Daubert is supposed to ask the known or potential rate of error. The expert had a per-measurement tolerance, not a validated error rate for the conclusion that two prints share a source. The evidence still came in, and the defence, by its own admission, could not find a single article refuting it.
Follow the chain. "Unique" rests on a database that stopped finding matches, the measurements underneath were never shown to be reliable, and the method still clears the courtroom gate. When a witness says individual, ask which of those three things they actually mean.
“additional testing is required to determine the reliability of foot impression measurements informing this technique, as to date this has not been reported.”

A database that stopped, not an error rate
Counsel walks the witness back through how the "unique" number was actually produced, then asks for the real error rate.
"Your one-in-1.27-billion figure: that came from searching a curated database of cooperative donors and not finding a second match within a chosen tolerance, didn't it? So what is the measured error rate of your conclusion that this crime scene print and the defendant share a source?"
How "lends support" walks into the record unchecked
Daniel Aitken was convicted of first-degree murder in 2009 after a 41-day jury trial, preceded by 40 days of pre-trial argument. Part of the Crown's case was a Harley Street podiatrist, Haydn Kelly, who compared six seconds of night-time CCTV of a shooter in loose track pants and slip-on sandals against covert footage of Aitken, and told the jury the likeness was "very strong." On his own scale, that was the second-highest rung, one step below "extremely strong likeness," which he said was as close as a forensic podiatrist could come to identification. Cunliffe and Edmond, in their 2014 Canadian Bar Review paper "Gaitkeeping in Canada," reconstruct from the trial transcript exactly how that conclusion got in, and what the gatekeeping missed.
When defence counsel Firestone asked Kelly directly whether he had ever done a blind test, Kelly answered no. Asked whether that meant he had no idea what his error rate was, Kelly called the suggestion "ridiculous" and pointed instead to "20-odd years of examining people's gait." He had never published his method, never had a report verified by anyone "other than myself," and could not point to a database when he had been willing to say the features appeared in one percent of the population. Satanove J severed the frequency claim but admitted the rest, calling it high in probative value. The BC Court of Appeal upheld that, reclassifying the work as "specialized knowledge gained through experience and specialized training" so the Daubert indicia of peer review, error rate and validation were declared to have "limited relevance." Chin and Dallen later called this "a dangerously facile approach towards scientific evidence: admitting it without scrutiny by dressing it up as specialized knowledge." Nirenberg, Vernon and Birch (2018) record the parallel English ruling in R v Otway, where the Court of Appeal held that an expert who "spends years studying this kind of comparison can properly form a judgment," subjective experience and all.
The danger for the witness box is in the words themselves. Cunliffe and Edmond note that jurors interpret verbal expressions like "very strong" idiosyncratically, citing Martire and colleagues on the weak-evidence effect: lay people do not map words like "lends support" or "consistent with" onto any stable strength, and weak verbal evidence can even shift belief the wrong way. So a phrase you intend as cautious can land in a juror's head as near-certainty, or as nothing. The R v DD passage the Court itself quoted warns that jurors faced with impressive credentials "abdicate their role as fact-finders and simply attorn to the opinion of the expert."
The candid position a competent examiner states out loud: there is no validated casework error rate for gait, body or clothing comparison, so you cannot give yours. Nirenberg and colleagues concede the same gap, that the lack of population databases means inferences are "currently limited to the experience" of the analyst. Your conclusion has to carry that limitation on its face, expressed as a similarity you observed under stated conditions, not framed as an identification. If counsel hears "consistent with" and runs it toward "it was him," your job is to put the brake back on before the jury does the work you did not validate.
“a dangerously facile approach towards scientific evidence: admitting it without scrutiny by dressing it up as specialized knowledge”

Each phrase below is one a gait, body or clothing examiner might offer, and each claims more than an unvalidated visual comparison can carry. Swap it for language that states the observed similarity and its limits, so the jury cannot run it to identification.
- 01"Match" on gait, build, clothing or height is a feature comparison with no validated casework error rate. Roughly two decades in, the discipline still cannot give you the number a court will ask for.
- 02The feature will not hold still. Frame rate, garment, posture and time of day move it before you compare two people, and below a few frames a second the "movement" you describe is the brain filling gaps the camera never recorded.
- 03The read does not reproduce. Even a 113-feature structured tool on pristine avatars left about a quarter of the score unstable, trained experts barely beat members of the public, and the most confident examiner is often the wrong one.
- 04Context bends the answer. Tell an anthropologist the sex of a skeleton and the call swings from 0% to 72%. Know what you were told about the case, and when you were told it.
- 05Height can be done rigorously, as a scene reconstruction with test persons, a wide confidence interval, and a small likelihood ratio. "About 180, the same as the accused" is a point estimate passed off as a match.
- 06A "unique" footprint usually means "I searched a curated database and did not find a second match within a chosen tolerance," not a measured error rate, and the measurements underneath were never shown to be reliable.
- 07Your words get misread. "Consistent with" can land in a juror's head as near-certainty. State the limitation on the face of the conclusion, and put the brake on before the jury runs it to identification.
What does "consistent with" mean?
You are on the stand. Counsel saves the hardest question about your own words for last.
"You told this jury the gait is 'consistent with' the accused. Tell us the error rate for that conclusion in casework of this kind, and if you cannot, explain how the jury is supposed to know whether 'consistent with' means near-certain or close to worthless."
Still have questions about the research?
Ask anything about Forensic gait analysis and the limits of identifying a person from an image. The tutor answers from the document itself — and keeps one eye on how it might come up under cross-examination.
- Macoveciuc, I., Rando, C. J., & Borrion, H. (2019). Forensic gait analysis and recognition: standards of evidence in forensic gait analysis. Journal of Forensic Sciences, 64(5), 1294-1303.
- Birch, I., Vernon, W., Walker, J., & Young, M. (2014). The development of a tool for assessing the quality of closed circuit camera footage for use in forensic gait analysis. Science and Justice, 54(2), 159-163.
- Scoleri, T., Lucas, T., & Henneberg, M. (2014). Effects of garments and posture on body measurements from photographs: implications for forensic stature estimation. Forensic Science International, 240, 21-28.
- Thakkar, K., Pavlakos, G., & Farid, H. (2022). The reliability of forensic body-shape identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 829-837.
- Birch, I., Vernon, W., Walker, J., Young, M., & Saxelby, J. (2019). The development of the Sheffield Features of Gait Tool for use in forensic gait analysis. Journal of Forensic and Legal Medicine, 67, 49-54.
- Birch, I., Raymond, L., Christou, A., Fernando, M. A., Harrison, N., & Paul, F. (2020). The identification of individuals by observational gait analysis using closed circuit television footage. Science and Justice, 60(3), 285-291.
- Nakhaeizadeh, S., Dror, I. E., & Morgan, R. M. (2014). Cognitive bias in forensic anthropology: visual assessment of skeletal remains is susceptible to confirmation bias. Science and Justice, 54(3), 208-214.
- Alberink, I., & Bolck, A. (2008). Obtaining confidence intervals and likelihood ratios for body height estimations in images. Forensic Science International, 177(2-3), 228-237.
- Edelman, G., Alberink, I., & Hoogeboom, B. (2010). Comparison of the performance of two methods for height estimation. Journal of Forensic Sciences, 55(2), 358-365.
- Hu, A., Arnold, J., Causby, R., & Jones, S. (2018). The reliability of measurements taken from podiatric data used in identification: a systematic review. Forensic Science International, 287, 71-81.
- Nirenberg, M. S. (2016). Forensic methods and the courts: a daubert ruling on forensic gait analysis. Podiatry Management.
- Cunliffe, E., & Edmond, G. (2014). Gaitkeeping in Canada: mis-steps in assessing the reliability of expert testimony. Canadian Bar Review, 92(2), 327-368.
- Nirenberg, M., Vernon, W., & Birch, I. (2018). A review of the historical use and criticisms of gait analysis evidence. Science and Justice, 58(4), 292-298.
- Royal Society. (2017). Forensic gait analysis: a primer for courts. London: The Royal Society.
Is the Image Even Real?
Put this into practice, or go deeper with the tutor on the full research.