A couple of days ago one of my juniors in college (Indian Institute of Science Education and Research, Kolkata) found unguarded in the guest account of our computing system a spreadsheet.
This spreadsheet contained the blood types of the Masters students across five batches (‘07-‘13). With this information he made a nice little bar diagram of the frequency distribution of blood types, which I lift here:
When I looked at this graph, what I noticed first was that this contains the distribution across both the antigen type (A/B/AB/O) and the Rh factor (+/-). It struck me that it may be possible by constructing a contingency table to check whether these two properties are independent of each other.
For this I first constructed from the graph the two-way contingency table of the joint distribution of antigen and Rh factor:
|Marginal totals||77||120||29||128||354 (total)|
Consider now the marginal distributions, i.e. the subtotals along the last row and column. If we divide these numbers through by the total number of data points (354), we obtain the relative marginal frequencies. For a large sample, this may be identified with the probabilities of occurrence of the individual properties irrespective of that of the other.
Now comes the important part. If the occurrence of these two properties are independent of each other, then the joint probability in any cell shall be the product of the marginal probabilities for that row and column (P(A∩B)=P(A)P(B)). In terms of frequencies, this means that the number in any cell should be the product of the subtotals for that row and column, divided by the total (354). You can think of it this way: the frequency in any cell is the product of one of its marginal frequencies with the other relative marginal frequency (signifying the independent conditional probability). Example:
n(A+) = n(A).P(+) = n(A).n(+)/N.
Taking this assumption of independence then, it is possible to construct a contingency table inwards starting from just the outer marginal frequencies. This has been constructed below with the marginal frequencies of the actual data, and rounded to the nearest integers.
If you compare the joint frequencies in the two tables, you can see immediately that they are very close, lending support to the assumption that these factors are independent.
For a more graphical idea, I decided to plot cluster bars of the actual and the computed frequencies assuming independence.
The average error between the actual and computed frequencies was 0 over the eight blood types. That doesn’t say much because errors of opposite signs reduce the net effect. The root mean squared error was 1.16, which too is tiny in comparison to the frequencies themselves.
Thus, it is pretty safe to suppose that the antigen and the Rh factor are uncorrelated properties.
There are stronger, more explicit tests of association than what has been done here. The usual correlation coefficient, though, cannot be calculated here because the variable (blood type) is not a quantitative one. However, there are others that you can read about here.
If you want to test this method or any of the other methods on other datasets of blood types, this page provides quite a bit of data for various countries (although last I checked, they seemed to not be completely factually correct).