Rationally Speaking: Odds again: Bayes made usable

About Rationally Speaking

Rationally Speaking is a blog maintained by Prof. Massimo Pigliucci, a philosopher at the City University of New York. The blog reflects the Enlightenment figure Marquis de Condorcet's idea of what a public intellectual (yes, we know, that's such a bad word) ought to be: someone who devotes himself to "the tracking down of prejudices in the hiding places where priests, the schools, the government, and all long-established institutions had gathered and protected them." You're welcome. Please notice that the contents of this blog can be reprinted under the standard Creative Commons license.

Thursday, November 29, 2012

Odds again: Bayes made usable

2.bp.blogspot.com

by Ian Pollock

[Note: this post assumes basic familiarity with probability math, and also presupposes a subjectivist view in philosophy of probability.]

Readers of this blog, and of others a few Erdos numbers (Massimo numbers?) away from it, will by now be used to having Bayes’ theorem hammered into their heads all the time, as the Great Equation of Power and the Timeless Secret of the Universe.

I suspect that I am not the only one who has occasionally felt somewhat disingenuous when harping on Bayes. Even though I do actually think it’s the secret of the universe, memorizing the formula is liable to become little more than a signal of in-group identity (along the lines of being able to recite the Nicene Creed or the roster of Local Sports Team), unless people know what it means, and how to sometimes actually maybe possibly use it.

When I talk about “using” Bayes theorem, I have a different picture in mind than what you may think. I do not necessarily mean a textbook problem with all the needed information clearly specified and relevant numbers handed to you. What I tend to think of instead are problems like:
“The car in front of me just swerved halfway into my lane. How likely is the driver to be drunk?"
These underspecified problems are the meat of day-to-day probability judgments.

But let’s look at Bayes theorem as traditionally presented:

P(H|E) = P(H)•P(E|H) / ( P(E|H)•P(H) + P(E|¬H)•P(¬H) )

[Terminology: P(_) stands for “probability of _,” H stands for “hypothesis,” E stands for “evidence,” the vertical bar stands for “given,” e.g., P(E|H) is the “probability of E given that H is true”, and finally ¬ means “not.”]

This formula is hideous on at least two levels:

First, it has too many terms (some repeating) and too many operations. You end up performing 2 or 3 multiplications, 1 addition, 1 subtraction ( P(¬H) = 1 - P(H) ) and 1 division, in order to get the answer. This does not conduce to doing the arithmetic in your head in real time, unless you are unusually good at arithmetic and have good fluid memory (neither of which apply to me).

Second, and perhaps most importantly, it is conceptually opaque. You do not see the structure of reasoning when you look at Bayes’ theorem in that form; all you see is a porridge of symbols. The “prior” that Bayesians are always harping on about, P(H), appears three separate times, once in the numerator and twice in the denominator, all tangled up with P(E|H) and P(E|¬H) — the “evidence terms.” Granted, the denominator is really just an expansion of P(E), which makes it a bit less opaque. But you can rarely calculate P(E) without doing the expansion.

Notice that when we speak of using Bayes’ theorem we are speaking of modifying (1) prior judgment in the light of (2) evidence to arrive at (3) a new judgment. Ideally, we would like a formula that looks more like:

posterior = prior [operation] evidence

Well, here is Bayes’ theorem in odds form:

O(H|E) = O(H) * P(E|H) / P(E|¬H)

As you can see, it consists of only one division and one multiplication. And lo, O(H) is just the prior odds, and the ratio P(E|H)/P(E|¬H) corresponds to “evidential strength,” although the literature usually calls it a likelihood ratio or a Bayes factor.

If you’re not used to how odds work, now would be a good time to check out my old article on them, in which for some inscrutable reason I didn’t get round to talking about their advantage in re: Bayes’ theorem. The rest of this article assumes you are moderately comfortable with odds talk.

Let’s see how Bayes works with an example.

In the classic 1957 film “12 Angry Men” (one of my favorites), a young man is accused of killing his father. One of the pieces of evidence brought against him is the fact that he was identified by a store clerk as having recently purchased a switchblade knife with an unusual handle, and the same kind of knife had been found on the body (wiped of fingerprints). See a nice clip here of the jurors debating the relevance of this piece of evidence.

At first, the unusual character of the knife led the jurors to believe that it was, if not one of a kind, at least very rare. But they are led by the touch of Henry Fonda’s cool hand to modify that assessment and consider the knife a much more commonplace one than they had thought. One of the hawkish jurors then asks petulantly: “Maybe there are ten knives like that, so what?” So what indeed.

We are interested in estimating the odds that the boy is guilty, given that he had purchased a knife the same as the one found at the murder scene — O(guilty|knife). Let us assume that it is certain that the boy did indeed purchase the knife as the store clerk said (actually a very charitable interpretation in the prosecution’s favour).

The first thing we need to think about is our prior. This represents what we think the chance is that the boy committed the murder, before the knife evidence is considered at all. Different people will have different priors, but let us suppose that enough evidence had been presented at trial already to make you consider him 20% likely to be guilty, or odds of 1:4 in favor: O(guilty) = 1:4.

We still need to know two more things.

First, P(knife|guilty) — assuming the boy is guilty, how likely is the knife evidence?
Well, it is not beyond the realm of possibility that the boy could have stabbed his father and disposed of the knife altogether, so even if he is guilty, there is no guarantee of seeing the knife. However, since we know he did buy an identical knife, it is not very surprising to see it at the crime scene if he is guilty. Let us estimate this probability as P(knife|guilty) = 0.6.

We also need to know P(knife|¬guilty) — assuming the boy is innocent, how likely is the knife evidence?

If (as the jurors at first seem to assume) there is only one knife in the whole world that looks like the murder weapon, and we know that the boy bought it, then the only plausible way it could have been the murder weapon and yet the boy be innocent, is if somebody else acquired it from the boy, and then used it to kill the boy’s father. One can understand the hawkish jurors’ impatience with this “possibility.” It requires not only that the boy somehow lost possession of the knife, but that somebody else (coincidentally?) wanted to use it to kill his father in particular. This rates a very low probability, let us say 1000:1 against or P(knife|¬guilty) = 0.001.

Now we have everything we need to figure out the odds of the boy being guilty, given this evidence. We already have the prior — 4:1 against or 1:4 in favor. The “evidential strength” is just the ratio of P(knife|guilty)/P(knife|¬guilty) = 0.6/0.001 = 600. We just multiply the prior by the evidence:

O(guilty|knife) = (1:4)*600 = 150:1 in favor of guilt.

So far so good, although the three numbers involved can all be quibbled with. But here is where Henry Fonda’s duplicate knife becomes important. It does not really change the top part of the evidence ratio: P(knife|guilty) is about the same. But suddenly that factor of 1000 that was making the boy look so guilty is going to drop, because now we know that the killer had access to lots of identical knives, not just the defendant’s. Now it looks like P(knife|¬guilty) is just the fraction of all knives available in the victim’s neighborhood that look like the murder weapon. We can guess that this is something like 1 in 10. So the evidence ratio becomes 0.6/0.1 = 6, and we multiply by the prior to get

O(guilty|knife) = (1:4)*6 = 3:2 in favor of guilt.

Thus, what Fonda showed is that although the knife is evidence of the boy’s guilt, it is much weaker evidence than the jurors had been led to believe. We do not convict criminals at odds of 3:2, or at least, we ought not to.

To address one objection I anticipate: yes, many of the numbers above are very rough guesses. Wherever possible, they should be improved upon by more objective data. But in my defense, notice how mapping out the underlying structure of the reasoning directs inquiry to where it needs to go, rather than to irrelevancies. You can challenge the prior I chose of 4:1 against guilt, by saying that the other evidence presented at trial makes him look a lot more guilty than that. You can challenge the drop in the evidence ratio by checking exactly how many of these knives are sold in nearby shops. These are exactly the questions juries should be thinking about.

Meanwhile, other questions, when seen in a Bayesian light, are obviously non-starters. A bigoted juror in the movie makes much of the boy’s poor background, as if that ought to weigh heavily in favor of his guilt. Unfortunately, while his fellow jurors express their disgust at this man’s prejudice, they fail to notice the obvious silliness of the underlying logic in this case. For if the boy is more likely to commit a crime by virtue of living in a bad neighborhood, so too are all the other people in the neighborhood, leaving the boy’s relative chances of having committed this particular crime approximately the same as they would have been if he had lived in a good neighborhood. Likewise, it is not much good emphasizing the victim’s bad relationship with his son, when he had bad relations with innumerable others.

To recap what we did in our example: we had a prior judgment about how likely the boy was to be guilty, not considering the knife evidence. Then, we considered the evidential strength of the knife evidence, which can be summarized with the phrase: “how much more likely was the evidence if he was guilty, than if he was innocent?”

This way of thinking about uncertainty, while normatively correct, departs from how humans automatically reason about these things in two important ways.

First, it gives equal weight to evidence and to prior. This is important because people constantly forget all about their priors as soon as they see evidence confirming a hypothesis. “I just met Sally. She is very adventurous, a real adrenaline junkie. Is Sally more likely to be a skydiving instructor, or an accountant?” Most people will answer that Sally is probably a skydiving instructor, forgetting that although all skydiving instructors are surely adventurous, there are way more accountants than skydiving instructors (and some accountants are adventurous too). The skeptical community usually sums up the insight that priors matter as much as evidence, with Carl Sagan’s excellent slogan “extraordinary claims require extraordinary evidence,” although they sometimes display a woeful lack of inclination to generalize this principle beyond Bigfoot.

Second, it emphasizes that what matters is not that evidence be consistent with some hypothesis, but that it be more likely if the hypothesis is true, than if it is false. This has the side effect of emphasizing the non-binary nature of evidence. Amanda Knox acted oddly (for example, doing a handstand) after the murder of her roommate Meredith Kirchner, about which the prosecution made much hay. The question we now know to ask is, “How much more likely is a person to act oddly after the murder of their friend if they are guilty, as opposed to if they are innocent?”

Um... a little more likely? Maybe twice as likely, at most? Possibly even less likely, as a guilty person might be more careful not to stand out... If this is evidence of guilt at all, it is extremely weak and ambiguous evidence, an evidence ratio of close to 1.

Most of us will not serve on many juries, but the same logic applies, rather famously, to medical tests of various kinds. If I go in for random screening against bowel cancer, and test positive, I am liable to assume that I almost certainly have the disease. However, the questions that really need to be asked at this point are: (a) what’s the base rate in the population (aka, prior) and (b) how much more likely is a positive test if I have the disease than if I don’t?

Wikipedia tells us that Fecal Occult Blood screening for bowel cancer has 67% sensitivity (67% of people with the disease test positive) and 91% specificity (9% of people without the disease test positive anyway). This means the evidential strength of a positive test is P(pos_test|cancer)/P(pos_test|¬cancer) = 67/9 = 7. So whatever the prior odds were, multiply them by ~10. [1]
The base rate for bowel cancer looks to be about 54 per 100,000 or around 2000:1 against, so O(cancer|pos_test) = (1:2000)*10 = 1:200 in favor = 200:1 against. As you can see, a positive test is cause for concern, but not panic. You probably don’t have the disease. In fact, you didn’t even need to look up the incidence in this case - all you needed to do was realize that unless 1 in 10 people in your reference class have bowel cancer (surely not!), your odds of having it are less than 50:50.

I hope that this reformulation of Bayes, mathematically trivial as it is, serves you as well as it now serves me. Even if you don’t actually calculate (hard to do in the messiness of the real world), knowing how it works is, I think, very epistemically salutary.
_______

[1] 7=10 in guerrilla arithmetic. We spit on your bourgeois Peano axioms.

42 comments:

GadflyNovember 29, 2012 9:00 PM
Noam Chomsky, as part of a great interview (non-political, all scientific issues) in the Atlantic, indicated he doesn't have a lot of use for Bayesian thinking, in general, at least as it's used (or overused) today. The interview is very worth reading for dozens of reasons: http://www.theatlantic.com/technology/archive/2012/11/noam-chomsky-on-where-artificial-intelligence-went-wrong/261637/?single_page=true
ReplyDelete
Replies
RichardNovember 30, 2012 12:36 AM
Thank you for this. I always found Bayes' Theorem easier to remember if the denominator is simply P(E), knowing this can be expanded as necessary. Multiplying both sides by P(E) gives P(H|E)*P(E) = P(E|H)*P(H). Both sides of this equality are identical to P(H^E) (the probability of both the hypothesis and the evidence) -- the reverse derivation of the theorem. So I can always mentally reconstruct the equation from the identity, P(H^E) = P(H^E), if I forget the full form. That's the advantage of understanding a theorem rather than just learning by rote.

I never thought about it in terms of odds, so from now on I will have another shorthand form O(H|E) = O(H) * S(E), where S(E) is the strength of the evidence as you've defined it.

I also never thought about applying it to a jury trial. My only jury experience was a civil trial -- and it would have been difficult to apply in that context. We knew who the plaintiff's doctors were, and the medical records were all there; it was a matter of deciding which procedures were done improperly and the damages to attribute. But in a criminal trial, I see that we could start with a small prior probability 1/P that the suspect is guilty, where P is the population of individuals in reasonable proximity at the time of the crime (P could be anywhere from 1 to millions), and construct a product of terms Si for exhibits i = 0 through N: O(H|E) = S0*S1*...*SN/P.
ReplyDelete
Replies
DaveSNovember 30, 2012 7:02 AM
Another great post Ian, one can almost see your car applying Bayesian analysis before changing its program from collision avoidance mode into drunk driver avoidance mode. And promoting that occult test by crunching some numbers could save people considering colonoscopies a lot of money.

And you can even apply Bayes to anything you like, using extremely loose definitions of nearly every word except for 'that' and 'given', as in "What's the probability that gods exist given that humans created machines".

The thing that really fascinates is that when we also loosen the definition of 'given' when considering such questions, as in "What is the probability that 'x' is occurring given the past or future occurrence of 'y'. You might say we have lots of numbers on the priors but none on the future events. But we do, in the form of bets, stock prices, polls, etc.

Wonder if belief data vs 'real' data could be factored into the equation? If the weather sucks, will work on this over the weekend. Or not, - if the hypothesis sucks, will work on enjoying whatever weather we get.
ReplyDelete
Replies
RichardDecember 01, 2012 1:29 AM
Another point to remember is that the proof that "absence of evidence is evidence of absence" is Bayesian. See http://lesswrong.com/lw/ih/absence_of_evidence_is_evidence_of_absence/
ReplyDelete
Replies
Aaron ShureDecember 01, 2012 3:02 AM
Interesting post. I'm going to have to re-read a couple of times to see if I can really get it. But looking at the history of the podcast, I will say that I've noted that Julia originally rejected the Anthropic Principle (sorry I'm too lazy to look up the episode) on frequentist grounds, but later defended the Simulation Argument on Bayesian grounds. I would say this is a sad turn of events. In my mind, both of those arguments suffer from similar weaknesses, but Bayes somehow gives license where frequentism doesn't. It makes me wonder if Bayes leaves the barn door open, as I assume the good Reverend had hoped, to assertions of religious or quasi religious ideas. Or maybe it's the broader "subjectivist view" that is to blame. And, if I understand it correctly, the danger lies in the general population not understanding the difference. Can Bostrom's claim that there is a 20% chance that there is a creator god have the same epistemological validity as a meteorologist's claim that there is a 20% chance of rain tomorrow? Does either of those compare to the claim that there is a 1/6 chance of rolling a 3 on a di?
ReplyDelete
Replies
AnonymousDecember 03, 2012 4:36 AM
How would you apply it to solve this problem?
A guy is accused of murdering someone.
Evidence: the victim is his wife, and she was stabbed to death.

Knowing nothing else, we'd just take all the cases of wives who were stabbed to death, and see what fraction of them were stabbed by their husband.

But how would you use the likelihood ratio here? What are the prior odds of guilt, knowing nothing about the guy? 2 in 7 billion? Is the probability that the victim is his wife greater if he killed her or if he didn't kill her? Weird question.

ReplyDelete
Replies
Il CensoreDecember 03, 2012 5:31 AM
Thank you, Ian, I find this version of the theorem way clearer.

I was also able to start an argument using the odds form of Bayes' theorem, and I credited your article: "Betting on horses and the resurrection of Jesus (I)".

Thank you again!
ReplyDelete
Replies
RichardDecember 04, 2012 11:23 PM
Ian, I'd love to hear your thoughts on the Doomsday Arguments of Gott, Carter, Bostrum, et al. I think there is a perfectly good refutation of these arguments, but from what I've read, the proponents are aware of the refutation and don't accept it. There seems to be a specific misapplication of Bayesian inference at work. I can go into more detail later...
ReplyDelete
Replies
RichardDecember 11, 2012 2:32 AM
Ian, thanks, I am looking forward to it!
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.