A discussion at work yesterday led to a somewhat interesting sort of problem. The original statement was something like:
Given N children from the same parents, if one child dies, how accurately can we reproduce that child’s genome given the still-existing children?
For obvious reasons, “dies” was being tacitly assumed as something more like “is exiled.”
The actual problem statement was more like:
If a child dies, how many siblings of that child must we have access to to reproduce that child’s genome?
But that question, I fear, is even less well formed. Let’s take a stab at explaining why, and then we will come to a better question that I think has a little intrigue to it.
Statement and Definitions
Let’s assume, without loss of generality, that a person is actually a base pairing vector. A person can thus be represented as a sequence where each element
is a base pairing. A person is, further, entirely defined by this sequence (for our purposes), and thus, a person
with
base pairings requires exactly
pieces of information to specify.*
Let us further state that the act of procreation consists of the joining of two humans, and
by the operation
,
,
where means x or y, each with probability 50%.
Non-rigorous Dismissal of Original Problem
It follows then, that has a range of size
for given
; this is the possible set of children two parents can produce. Further, it is obvious that given two children
and
of the same parents
,
is completely independent of
; that is, no information is shared between them; they are chosen independently of each other from the range of
. Thus, I state without formal proof that there is no possibility of constructing child
from the set of that child’s siblings; there is simply no information that allows us to reproduce the particular selection that lead to its birth.
A Perhaps More Interesting Problem
So, as I promised at the beginning of today’s lecture, this conversation sparked what seems to me a more interesting question.
Given parents
who have birthed
children, how close can we get to creating a new child that accurately matches the distribution of outputs of the procreation function,
?
An alternative formulation would be:
Given parents
, and some
, how many children
must
and
have such that we have probability >=
of constructing
such that the difference in output from
to
is <=
?
I will try to come at this from behind instead of providing a more obviously straightforward wording. Then I will discuss how this is secretly a probability question, and then I will throw my hands up and hope a smarter reader than I comes to a solution.
Discussion of the More Interesting Problem
Let’s start with the simplest case: given a person , we are to construct a sibling of that person. We know that
was born out of a set of
children; unfortunately, we know exactly 1 element of that set. Thus, if we are to construct a sibling for
, that sibling will be identical to
. If we call this
**, then the range of
is
, and so we have a new parentage function that returns
of the possible results. This is a very, very bad parentage function, and
will not be very excited with its new siblings. In terms of our first problem definition, we have
, and our “error bar” is basically 100%.
Now let’s consider the situation where we have two siblings, . We know that
has exactly 50% of the possible genome.*** We also know that
has exactly 50% of the possible genome. What is the overlap here? Well, we assume that on average we now have 75% of the genome. Closer inspection of
and
could do us some good: for example, if
, we immediately have the entire parental genome and can reconstruct
perfectly. But on average we can expect 75% of the information is available to us. It is clear that we can construct a much better
from two children than from one.
Better discussion of success
Let’s consider a very small example and use that to build our definition of success. Let’s assume we have , each with a genome of length 1. Then the range of
is
, where it is not necessarily true that
.
Assume, wlog, that their child has the genome
. Note, of course, that we do not know, given
, (remember we do not have access to the genomes of
and
), whether or not
. I believe the probability of that statement can be calculated given the set of possible values (for real DNA that’s… 4 I think? It’s been a long time since I took biology), but I’m not going to do it, at least not right now; the point is that we assume that with probability > 0,
.
Now let’s sample both and
, our constructed function,
times, where
is “very large.” For
, we will find that we have obtained
about
times, and
about
times, whereas for
, we have obtained
exactly
times. That means we have an error rate of
; we would expect our construction to produce the same value as the true procreation function half the time.
This leads us to what seems to me a fairly good formalization of our statement involving s and
s, as well as a point that I’ve been skirting around for a while. That point, of course, is that
is not a function, and should not be written as
; it is a probability distribution, and a more appropriate notation would be something like
. I, to my long-lasting chagrin, have received next to no formal education in probability, and thus cannot well appeal to the theory for discussing the “similarity” or “closeness” of two probability distributions, but I hope to provide an explanation that crystallizes at least my intuition.
So let us then say that the “difference between” two probability distributions is the difference in their hypothetical, ideal histograms. So if you have
that gives
50% of the time and
the rest, and
that gives
40% of the time and
60% of the time, we say that the difference is .2: given 100 samples, we will have 80 matching samples and 20 unaccounted for (10 more
samples for
, and 10 less
).
Further, I see no reason to require their ranges to match: if instead we had ****, we would still have the same .2 difference between them.
This concept of “difference” in parentage distributions is what we are trying to minimize (or, in the alternative formulation, trying to determine).
Why I can’t solve this problem
In short: the 75% figure I gave before, for how much of the total information you have given two children, is also a probability distribution: you could have 0%, and you could have 100%; 75% is merely an average. I’ve been circling the problem a little but I think it is outside my grasp. However, I hope it was an interesting thing to think about, and if you have any input, solutions, etc, please let me know!
I have also considered other digressions (further exploring the 1-element genome as the number of siblings grows large, for example), but I think they are too elementary to bother typing up. Happy to add more to this in bits and pieces if anyone is interested!
*I am skimming over several things here. For example, a genome is, in general, compressible. This is not relevant for the discussion, however. For another, I’m using “piece of information” in place of “bit” because I don’t want to go into how much information that is in bits. I would simply represent a person as a bit vector, but that causes actually more confusion down the line; I would go the other way and represent a person as a byte vector but even in this hypothetical setting it pains me to waste that much memory. Let’s just pretend that a “piece of information” is somewhere between a bit and a byte and move on with our lives.
**This is simply a function of no inputs, since we do not have
‘s parents available; I am aware this is sloppy, but bear with me.
***This is not true, since it’s possible that for some elements each parent had the same information; i.e. if 2 identical twins had children, all of those children would be twins of both each other and their parents. However, it is not possible for us to know that, based on just the child. We have 50% of the information, in the sense that we know each element of the child’s genome is either its mother’s or its father’s; we do not know if they are the same, and no number of siblings can ever prove that they are.
****I just made this notation up; in case it’s not clear, it means that produces x 40% of the time, etc. The ordered pairs in the set are of the form (element of the range, fraction of production).
What? Out of a hundred samples, P_1 yields x in 50 of them, and of
these, P_2 matches it with another x 50*0.4 = 20 times. In another 50
samples, P_1 yields y, and P_2 matches it 50*0.6 = 30 times, for a
total of 20 + 30 matching samples. (And it would be the same for any
other P_2 over x and y; you can’t expect to do better or worse
than chance at guessing a fair coinflip.)
But anyway, I don’t think the perhaps-more-interesting problem is
hard, modulo the philosophical question of how to formalize the
natural-language concept of similarity for probability distributions
(the Overmind suggests Bhattacharyya distance?). To
compute what you should believe about the parents’ genotypes (and
therefore the distribution of possible children) given observations of
successive children, just start with your prior probability
distribution over parent genotypes, and apply Bayes’s theorem. Say that each parent has a single “allele” (scare quotes because this isn’t how real-world biology works) a or b, and the child randomly gets an allele from one parent. Then we might think about the problem like this:
And then, for example, if we keep observing children carrying the a alleles, we can compute exactly how increasingly confident we should be that both parents have the a allele: