Bayes rule exercises: part 2

1. The Prosecutor's Fallacy

Bayes’ theorem often applies when considering the probability that a person is guilty of a crime. Indeed, misunderstanding it leads to what is called the prosecutor’s fallacy – when you interpret the small probability of someone fitting the evidence as a small probability that an accused who does fit the evidence is in fact innocent.

Let’s consider the following case:

In a murder case you have found a sample of the murderer’s DNA, and there is a 0.1% chance of a random someone’s DNA matching this sample. You have found a man whose DNA does match.

Then the correct interpretation is NOT there is 0.1% chance that this man is not the murderer, ie there is 99.9% chance that he is the murderer.

Bayes’ theorem tells us that in order to calculate this last probability – the probability that the man is guilty, given that he matches the DNA, one also needs to take into account the probability of a random person being a murderer, which is extremely low, say it is 0.01%.

Let’s use the following notation:

  • Event A = The man is guilty
  • Event B = The man’s DNA matches the one found

Then we have

  • P(B) = 0.1%
  • P(A) = 0.01%

The probability we are interested in is P(A|B): the probability that the man is guilty, given that his DNA matches the killer’s.

Bayes’ theorem gives us that P(A|B) = P(B|A) * P(A) / P(B)

Now P(B|A) is the probability that the man’s DNA would match the killer’s, if he is indeed the killer. Which should be pretty close to 1, if your DNA testing is any good! P(A) and P(B) are given above, so in the end:

P(A|B) = 10%

Most definitely not a cause for putting someone in prison!

Obviously, in actual cases, this is not the only thing to take into account. If the man’s DNA matches the killer, and he also matches a description of the killer, and has no alibi, and some shoes were found in his house covered in blood, the odds would change somewhat.


2. Sick Child and Doctor

A doctor is called to see a sick child. The doctor has prior information that 90% of sick children in that neighborhood have the flu, while the other 10% are sick with measles. Let F stand for an event of a child being sick with flu and M stand for an event of a child being sick with measles. Assume for simplicity that F ∪ M = Ω, i.e., that there no other maladies in that neighborhood.

A well-known symptom of measles is a rash (the event of having which we denote R). P(R|M) = .95. However, occasionally children with flu also develop rash, so that P(R|F) = 0.08.

Upon examining the child, the doctor finds a rash. What is the probability that the child has measles?

P(M|R) = P(R|M) P(M) / (P(R|M) P(M) + P(R|F) P(F)) = .95 × .10 / (.95 × .10 + .08 × .90) ≈ 0.57.

Which is nowhere close to 95% of P(R|M).


3. Incidence of Breast Cancer

In a study, physicians were asked what the odds of breast cancer would be in a woman who was initially thought to have a 1% risk of cancer but who ended up with a positive mammogram result (a mammogram accurately classifies about 80% of cancerous tumors and 90% of benign tumors.) 95 out of a hundred physicians estimated the probability of cancer to be about 75%. Do you agree?

Introduce the events:

  • P - mammogram result is positive,
  • B - tumor is benign,
  • M - tumor is malignant.

Bayes' formula in this case is

P(M|P) = P(P|M) P(M) / (P(P|M) P(M) + P(P|B) P(B)) = .80 × .01 / (.80 × .01 + .10 × .99) ≈ 0.075 ≈ 7.5%.

A far cry from a common estimate of 75%.


4. Smoking and Lung Cancer

Suppose 0.1% of the American population currently has lung cancer, that 90% of all lung cancer cases are smokers, and that 21% of those without lung cancer also smoke. (These values are fairly close to the values given on the American Lung Association web site as of 2011.) Consider the following questions.

  • What percent of smokers have lung cancer?
  • What percent of non-smokers have lung cancer?
  • How much more likely is a smoker to have lung cancer than a non-smoker?

We begin by recognizing two variables for each subject, smoking and lung cancer. Let event S be the smokers, and event L be those with lung cancer. The data given specifies P(L) = 0.001, P(S|L) = 0.90 and P(S|L) = 0.21. The questions are asking P(L|S) and P(L|S). Since the conditional probabilities in the information and the question are reversed, we recognize the need to use a Bayes Theorem approach. Therefore, we shall construct a tree diagram, with the lung cancer data in the first set of branches to take advantage of the given information.


After including the original data, we recognize that the probabilities on each set of branches must add to one, and the probabilities along branches must multiply to the final result. With the completed tree diagram, we can answer the questions.

  • P(L|S) = P(L∩S) / P(S) = 0.0009 / (0.0009 + 0.20979) = 0.0042717
    Therefore, approximately 0.43% of all smokers have lung cancer.

  • P(L|S) = P(L∩S) / P(S) = 0.0001 / (0.0001 + 0.78921) = 0.0001267
    Therefore, approximately 0.01% of all non-smokers have lung cancer.

  • P(L|S) / P(L|S) = 0.0042717 / 0.0001267 = 33.72
    Therefore, smokers are almost 34 times more likely to have lung cancer than non-smokers. (This is slightly higher than the American Lung Association is reporting, but then we have made a number of simplifications in our statement of the problem.)

This analysis does not say that smoking caused lung cancer. The only way statistics can be used to determine causation is for the researcher to control the variables involved. To do this, the researcher would have to randomly assign some subjects to be smokers, and some to not smoke, and then determine the incidence of lung cancer among his participants. However, controlling the lives of human subjects in this way would not be acceptable.


5. False Positives

One prominent manufacturer of medical tests offers a test for chlamydia (a sexually transmitted disease) that has a sensitivity of 76.4% and a specificity of 93.2%. In other words, the test correctly identifies 76.4% of individuals tested who have the disease by giving a positive result, and correctly identifies 93.2% of the individuals who are healthy by giving a negative result. Currently, it is estimated that 1.5% of the American population has chlamydia. If one individual is randomly selected from the population and tests positive for chlamydia, what is the probability that he/she does not have the disease?

Our variables are the presence of chlamydia and the test result. Let C be those who have chlamydia, and let T be those who have a positive test result. The data states that P(C) = 0.015, P(T|C) = 0.764, and P(T|C) = 0.932. The question being asked is the value of P(C|T). Once again, we construct a tree diagram.


We then have:

P(C|T) = P(C∩T) / P(T) = 0.06698 / (0.01146 + 0.06698) = 0.8539

In other words, there is an 85% chance that a positive result would be a false positive. This is a horrendous result. You might think that if the sensitivity of the test was higher, the situation would be rectified. However, the real problem is the rarity of the condition in the general population. Such medical tests should not be done unless there is some reason to suspect the disease, and if a positive result occurs, a second (different) test might be appropriate to confirm the diagnosis of the first test.


If you like my articles,