A discussion at work yesterday led to a somewhat interesting sort of problem. The original statement was something like:

Given N children from the same parents, if one child dies, how accurately can we reproduce that child’s genome given the still-existing children?

For obvious reasons, “dies” was being tacitly assumed as something more like “is exiled.”

The actual problem statement was more like:

If a child dies, how many siblings of that child must we have access to to reproduce that child’s genome?

But that question, I fear, is even less well formed. Let’s take a stab at explaining why, and then we will come to a better question that I think has a little intrigue to it.

#### Statement and Definitions

Let’s assume, without loss of generality, that a person is actually a base pairing vector. A person can thus be represented as a sequence where each element is a base pairing. A person is, further, entirely defined by this sequence (for our purposes), and thus, a person with base pairings requires exactly pieces of information to specify.*

Let us further state that the act of procreation consists of the joining of two humans, and by the operation ,

,

where means x or y, each with probability 50%.

#### Non-rigorous Dismissal of Original Problem

It follows then, that has a range of size for given ; this is the possible set of children two parents can produce. Further, it is obvious that given two children and of the same parents , is completely independent of ; that is, no information is shared between them; they are chosen *independently* of each other from the range of . Thus, I state without formal proof that there is no possibility of constructing child from the set of that child’s siblings; there is simply no information that allows us to reproduce the particular selection that lead to its birth.

#### A Perhaps More Interesting Problem

So, as I promised at the beginning of today’s lecture, this conversation sparked what seems to me a more interesting question.

Given parents who have birthed children, how close can we get to creating a new child that accurately matches the distribution of outputs of the procreation function, ?

An alternative formulation would be:

Given parents , and some , how many children must and have such that we have probability >= of constructing such that the difference in output from to is <= ?

I will try to come at this from behind instead of providing a more obviously straightforward wording. Then I will discuss how this is secretly a probability question, and then I will throw my hands up and hope a smarter reader than I comes to a solution.

#### Discussion of the More Interesting Problem

Let’s start with the simplest case: given a person , we are to construct a sibling of that person. We know that was born out of a set of children; unfortunately, we know exactly 1 element of that set. Thus, if we are to construct a sibling for , that sibling will be identical to . If we call this **, then the range of is , and so we have a new parentage function that returns of the possible results. This is a very, very bad parentage function, and will not be very excited with its new siblings. In terms of our first problem definition, we have , and our “error bar” is basically 100%.

Now let’s consider the situation where we have two siblings, . We know that has exactly 50% of the possible genome.*** We also know that has exactly 50% of the possible genome. What is the overlap here? Well, we assume that on average we now have 75% of the genome. Closer inspection of and could do us some good: for example, if , we immediately have the entire parental genome and can reconstruct perfectly. But on average we can expect 75% of the information is available to us. It is clear that we can construct a *much better* from two children than from one.

#### Better discussion of success

Let’s consider a very small example and use that to build our definition of success. Let’s assume we have , each with a genome of length 1. Then the range of is , where it is not necessarily true that .

Assume, wlog, that their child has the genome . Note, of course, that we do not know, given , (remember we do not have access to the genomes of and ), whether or not . I believe the probability of that statement can be calculated given the set of possible values (for real DNA that’s… 4 I think? It’s been a long time since I took biology), but I’m not going to do it, at least not right now; the point is that we assume that with probability > 0, .

Now let’s sample both and , our constructed function, times, where is “very large.” For , we will find that we have obtained about times, and about times, whereas for , we have obtained exactly times. That means we have an error rate of ; we would expect our construction to produce the same value as the true procreation function half the time.

This leads us to what seems to me a fairly good formalization of our statement involving s and s, as well as a point that I’ve been skirting around for a while. That point, of course, is that is not a function, and should not be written as ; it is a *probability distribution*, and a more appropriate notation would be something like . I, to my long-lasting chagrin, have received next to no formal education in probability, and thus cannot well appeal to the theory for discussing the “similarity” or “closeness” of two probability distributions, but I hope to provide an explanation that crystallizes at least my intuition.

So let us then say that the “difference between” two probability distributions is the difference in their hypothetical, ideal histograms. So if you have that gives 50% of the time and the rest, and that gives 40% of the time and 60% of the time, we say that the difference is .2: given 100 samples, we will have 80 matching samples and 20 unaccounted for (10 more samples for , and 10 less ).

Further, I see no reason to require their ranges to match: if instead we had ****, we would still have the same .2 difference between them.

This concept of “difference” in parentage distributions is what we are trying to minimize (or, in the alternative formulation, trying to determine).

#### Why I can’t solve this problem

In short: the 75% figure I gave before, for how much of the total information you have given two children, is **also** a probability distribution: you could have 0%, and you could have 100%; 75% is merely an average. I’ve been circling the problem a little but I think it is outside my grasp. However, I hope it was an interesting thing to think about, and if you have any input, solutions, etc, please let me know!

I have also considered other digressions (further exploring the 1-element genome as the number of siblings grows large, for example), but I think they are too elementary to bother typing up. Happy to add more to this in bits and pieces if anyone is interested!

*I am skimming over several things here. For example, a genome is, in general, compressible. This is not relevant for the discussion, however. For another, I’m using “piece of information” in place of “bit” because I don’t want to go into how much information that is in bits. I would simply represent a person as a bit vector, but that causes actually more confusion down the line; I would go the other way and represent a person as a *byte* vector but even in this hypothetical setting it pains me to waste that much memory. Let’s just pretend that a “piece of information” is somewhere between a bit and a byte and move on with our lives.

**This is simply a function of no inputs, since we do not have ‘s parents available; I am aware this is sloppy, but bear with me.

***This is not true, since it’s possible that for some elements each parent had the same information; i.e. if 2 identical twins had children, all of those children would be twins of both each other and their parents. However, it is *not possible for us to know that*, based on just the child. We have 50% of the *information*, in the sense that we know each element of the child’s genome is either its mother’s or its father’s; we do not know if they are the same, and no number of siblings can ever prove that they are.

****I just made this notation up; in case it’s not clear, it means that produces x 40% of the time, etc. The ordered pairs in the set are of the form (element of the range, fraction of production).

What? Out of a hundred samples, P_1 yields x in 50 of them, and of

these, P_2 matches it with another x 50*0.4 = 20 times. In another 50

samples, P_1 yields y, and P_2 matches it 50*0.6 = 30 times, for a

total of 20 + 30 matching samples. (And it would be the same for any

other P_2 over x and y; you can’t expect to do better

orworsethan chance at guessing a fair coinflip.)

But anyway, I don’t think the perhaps-more-interesting problem is

hard, modulo the philosophical question of how to formalize the

natural-language concept of

similarityfor probability distributions(the Overmind suggests Bhattacharyya distance?). To

compute what you should believe about the parents’ genotypes (and

therefore the distribution of possible children) given observations of

successive children, just start with your prior probability

distribution over parent genotypes, and apply Bayes’s theorem. Say that each parent has a single “allele” (scare quotes because this isn’t how real-world biology works)

aorb, and the child randomly gets an allele from one parent. Then we might think about the problem like this:And then, for example, if we keep observing children carrying the

aalleles, we can compute exactly how increasingly confident we should be that both parents have theaallele: