On Children

A discussion at work yesterday led to a somewhat interesting sort of problem. The original statement was something like:

Given N children from the same parents, if one child dies, how accurately can we reproduce that child’s genome given the still-existing children?

For obvious reasons, “dies” was being tacitly assumed as something more like “is exiled.”

The actual problem statement was more like:

If a child dies, how many siblings of that child must we have access to to reproduce that child’s genome?

But that question, I fear, is even less well formed. Let’s take a stab at explaining why, and then we will come to a better question that I think has a little intrigue to it.

Statement and Definitions

Let’s assume, without loss of generality, that a person is actually a base pairing vector. A person can thus be represented as a sequence a_1, a_2, ... where each element a_i is a base pairing. A person is, further, entirely defined by this sequence (for our purposes), and thus, a person A with N base pairings requires exactly N pieces of information to specify.*

Let us further state that the act of procreation consists of the joining of two humans, a_1, a_2, ... a_N and b_1, b_2, ... b_N by the operation P,

P(A, B) = (a_1 \odot b_1), (a_2 \odot b_2), ... (a_N \odot b_N),

where x \odot y means x or y, each with probability 50%.

Non-rigorous Dismissal of Original Problem

It follows then, that P has a range of size 2^N for given A, B; this is the possible set of children two parents can produce. Further, it is obvious that given two children C_1 and C_2 of the same parents A, B, C_1 is completely independent of C_2; that is, no information is shared between them; they are chosen independently of each other from the range of P(A, B). Thus, I state without formal proof that there is no possibility of constructing child C_k from the set of that child’s siblings; there is simply no information that allows us to reproduce the particular selection that lead to its birth.

A Perhaps More Interesting Problem

So, as I promised at the beginning of today’s lecture, this conversation sparked what seems to me a more interesting question.

Given parents M, F who have birthed k children, how close can we get to creating a new child that accurately matches the distribution of outputs of the procreation function, P?

An alternative formulation would be:

Given parents M, F, and some \epsilon > 0, \phi > 0, how many children N must M and F have such that we have probability >= \phi of constructing P' such that the difference in output from P to P' is <= \epsilon?

I will try to come at this from behind instead of providing a more obviously straightforward wording. Then I will discuss how this is secretly a probability question, and then I will throw my hands up and hope a smarter reader than I comes to a solution.

Discussion of the More Interesting Problem

Let’s start with the simplest case: given a person A, we are to construct a sibling of that person. We know that A was born out of a set of 2^N children; unfortunately, we know exactly 1 element of that set. Thus, if we are to construct a sibling for A, that sibling will be identical to A. If we call this P'**, then the range of P' is A, and so we have a new parentage function that returns \dfrac{1}{2^N} of the possible results. This is a very, very bad parentage function, and A will not be very excited with its new siblings. In terms of our first problem definition, we have k=1, and our “error bar” is basically 100%.

Now let’s consider the situation where we have two siblings, A, B. We know that A has exactly 50% of the possible genome.*** We also know that B has exactly 50% of the possible genome. What is the overlap here? Well, we assume that on average we now have 75% of the genome. Closer inspection of A and B could do us some good: for example, if a_i \not= b_i \forall i <= \mathbb{N}, we immediately have the entire parental genome and can reconstruct P perfectly. But on average we can expect 75% of the information is available to us. It is clear that we can construct a much better P' from two children than from one.

Better discussion of success

Let’s consider a very small example and use that to build our definition of success. Let’s assume we have M, F, each with a genome of length 1. Then the range of P(M, F) is \left\{ {\alpha, \beta } \right\}, where it is not necessarily true that \alpha \not= \beta.

Assume, wlog, that their child C has the genome [\alpha]. Note, of course, that we do not know, given C, (remember we do not have access to the genomes of M and F), whether or not \alpha = \beta. I believe the probability of that statement can be calculated given the set of possible values (for real DNA that’s… 4 I think? It’s been a long time since I took biology), but I’m not going to do it, at least not right now; the point is that we assume that with probability > 0, \alpha \not= \beta.

Now let’s sample both P(M, F) and P', our constructed function, N times, where N is “very large.” For P, we will find that we have obtained \left\{ {\alpha} \right\} about \frac{N}{2} times, and \left\{ {\beta} \right\} about \frac{N}{2} times, whereas for P', we have obtained \left\{ {\alpha} \right\} exactly N times. That means we have an error rate of \frac{1}{2}; we would expect our construction to produce the same value as the true procreation function half the time.

This leads us to what seems to me a fairly good formalization of our statement involving \epsilons and \phis, as well as a point that I’ve been skirting around for a while. That point, of course, is that P is not a function, and should not be written as P(M, F); it is a probability distribution, and a more appropriate notation would be something like P_{M,F}. I, to my long-lasting chagrin, have received next to no formal education in probability, and thus cannot well appeal to the theory for discussing the “similarity” or “closeness” of two probability distributions, but I hope to provide an explanation that crystallizes at least my intuition.

So let us then say that the “difference between” two probability distributions P_1, P_2 is the difference in their hypothetical, ideal histograms. So if you have P_1 that gives x 50% of the time and y the rest, and P_2 that gives x 40% of the time and y 60% of the time, we say that the difference is .2: given 100 samples, we will have 80 matching samples and 20 unaccounted for (10 more x samples for P_1, and 10 less y).

Further, I see no reason to require their ranges to match: if instead we had P_2 = {(x, .4), (y, .5), (z, .1)}****, we would still have the same .2 difference between them.

This concept of “difference” in parentage distributions is what we are trying to minimize (or, in the alternative formulation, trying to determine).

Why I can’t solve this problem

In short: the 75% figure I gave before, for how much of the total information you have given two children, is also a probability distribution: you could have 0%, and you could have 100%; 75% is merely an average. I’ve been circling the problem a little but I think it is outside my grasp. However, I hope it was an interesting thing to think about, and if you have any input, solutions, etc, please let me know!

I have also considered other digressions (further exploring the 1-element genome as the number of siblings grows large, for example), but I think they are too elementary to bother typing up. Happy to add more to this in bits and pieces if anyone is interested!


*I am skimming over several things here. For example, a genome is, in general, compressible. This is not relevant for the discussion, however. For another, I’m using “piece of information” in place of “bit” because I don’t want to go into how much information that is in bits. I would simply represent a person as a bit vector, but that causes actually more confusion down the line; I would go the other way and represent a person as a byte vector but even in this hypothetical setting it pains me to waste that much memory. Let’s just pretend that a “piece of information” is somewhere between a bit and a byte and move on with our lives.

**This P' is simply a function of no inputs, since we do not have A‘s parents available; I am aware this is sloppy, but bear with me.

***This is not true, since it’s possible that for some elements each parent had the same information; i.e. if 2 identical twins had children, all of those children would be twins of both each other and their parents. However, it is not possible for us to know that, based on just the child. We have 50% of the information, in the sense that we know each element of the child’s genome is either its mother’s or its father’s; we do not know if they are the same, and no number of siblings can ever prove that they are.

****I just made this notation up; in case it’s not clear, it means that P_2 produces x 40% of the time, etc. The ordered pairs in the set are of the form (element of the range, fraction of production).

Advertisements
This entry was posted in math. Bookmark the permalink.

One Response to On Children

  1. So if you have P_1 that gives x 50% of the time and y the
    rest, and P_2 that gives x 40% of the time and y 60% of the time, we
    say that the difference is .2: given 100 samples, we will have 80
    matching samples and 20 unaccounted for (10 more x samples for P_1,
    and 10 less y).

    What? Out of a hundred samples, P_1 yields x in 50 of them, and of
    these, P_2 matches it with another x 50*0.4 = 20 times. In another 50
    samples, P_1 yields y, and P_2 matches it 50*0.6 = 30 times, for a
    total of 20 + 30 matching samples. (And it would be the same for any
    other P_2 over x and y; you can’t expect to do better or worse
    than chance at guessing a fair coinflip.)

    But anyway, I don’t think the perhaps-more-interesting problem is
    hard, modulo the philosophical question of how to formalize the
    natural-language concept of similarity for probability distributions
    (the Overmind suggests Bhattacharyya distance?). To
    compute what you should believe about the parents’ genotypes (and
    therefore the distribution of possible children) given observations of
    successive children, just start with your prior probability
    distribution over parent genotypes, and apply Bayes’s theorem. Say that each parent has a single “allele” (scare quotes because this isn’t how real-world biology works) a or b, and the child randomly gets an allele from one parent. Then we might think about the problem like this:

    #!/usr/bin/python3
    
    from collections import namedtuple, defaultdict
    from itertools import product
    
    AllelePairing = namedtuple('AllelePairing', ('F', 'M'))
    
    possible_worlds = list(AllelePairing(*alleles) for alleles
                           in product(('a', 'b'), repeat=2))
    prior_beliefs = {world: 1/len(possible_worlds) for world in possible_worlds}
    
    def prediction(belief):
        outcomes = defaultdict(lambda: 0)
        for allele in belief:
            outcomes[allele] += 0.5
        return outcomes
    
    def update(beliefs, observation):
        # the fundamental theorem of epistemology:
        # P(hypothesis | evidence) = P(evidence | hypothesis) * P(hypothesis)
        #                             / P(evidence)
        new_beliefs = {}
        prior_on_evidence = sum((prediction(belief)[observation] * probability)
                                for belief, probability in beliefs.items())
        for belief, probability in beliefs.items():
            new_beliefs[belief] = (prediction(belief)[observation] *
                                   probability / prior_on_evidence)
        return new_beliefs
    
    def belief_sequence(beliefs, observations):
        for observation in observations:
            beliefs = update(beliefs, observation)
            yield beliefs

    And then, for example, if we keep observing children carrying the a alleles, we can compute exactly how increasingly confident we should be that both parents have the a allele:

    >>> for belief_at_time in belief_sequence(prior_beliefs, ('a',)*6):
    ...     print(belief_at_time[AllelePairing('a', 'a')])
    ...
    0.5
    0.6666666666666666
    0.8
    0.8888888888888888
    0.9411764705882353
    0.9696969696969698

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s