Precision of evaluation

To what degree of precision can your evaluations of grants, papers, or applicants be quantified—if forced into a single numerical score?
How many bits of resolution can we reasonably expect in such a score, and how reproducible would the resulting rankings be across repeated, anonymized evaluations of the same set, by the same reviewer?
I pose these questions because I suspect that we are routinely asked to provide numerical evaluations to a degree of precision that is not supported by real world experience and available evidence.
One example is when I’m submitting letters of recommendation to graduate schools for students and the school has an online system that requires me to rank the student in the “top 1%”, “top 2%”, “top 5%”, … Professors interact with many students, and to varying degrees. How reliably can we distinguish among these categories?
When we assign a score from 1 to 5, we’re expressing about 2.3 bits of information (since log?(5) ? 2.32). And that’s assuming we use the scale evenly and consistently, which we often aren’t. In NIH review, the scale is technically 1 to 9, which would be 3.17 bits, but since most scores cluster around 3+/-2, we’re back down to the 2 bits and change range.
Now how reproducible are these numbers? Vary the time-of-day, focus, how alert and happy the reviewer is, familiarity with the field, unconscious biases, etc. Those two-plus bits of information are maybe reliable to less than 2 bits.
That said, I have noticed that in NIH reviews, disparate reviewers often independently converge on similar scores. So I don’t intend to dismiss the value of the process, but we should scale the time, energy, and deference given to such scoring.
In the end, a score provided by an evaluator is a low bit value metric. It’s a compressed signal of human judgment—and like any signal, it has noise, resolution limits, and a story behind it.