Disclaimer

This information HAS errors and is made available WITHOUT ANY WARRANTY OF ANY KIND and without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. It is not permissible to be read by anyone who has ever met a lawyer or attorney. Use is confined to Engineers with more than 370 course hours of engineering.
If you see an error contact:
+1(785) 841 3089
inform@xtronics.com

Genetic differences and information theory


How should we measure understand genetic differences in a meaningful way?

OK, my curiosity has me again - I often read things and find that instead of understanding better I have ended up with many more questions than answers.

I was listening to a lecture (celebrating Dawkins book, the Selfish Gene) and the professor said that of the information in our chromosomes about 2% codes for proteins, perhaps 3% is used for control, but the remaining 95% is thought to be made up of remnants of retro viruses and empty "line genes" that don't do anything.

Couple that with a typical quote about the difference between Human and chimp genes, this one from National Geographic:

"The goal is to answer the basic question: What makes us humans?" said Eichler. Eichler and his colleagues found that the human and chimp sequences differ by only 1.2 percent in terms of single-nucleotide changes to the genetic code. But 2.7 percent of the genetic difference between humans and chimps are duplications, in which segments of genetic code are copied many times in the genome. If genetic code is a book, what we found is that entire pages of the book duplicated in one species but not the other," said Eichler. "This gives us some insight into the genetic diversity that's going on between chimp and human and identifies regions that contain genes that have undergone very rapid gnomic changes.


And this one:

The new estimate could be a little misleading, said Saitou Naruya, an evolutionary geneticist at the National Institute of Genetics in Mishima, Japan. "There is no consensus about how to count numbers or proportion of nucleotide insertions and deletions," he said.

So I still don't think I have a good feel for the magnitude of the difference.

I have an interest in how to measure the difference in data - it is of key importance in computer data compressing algorithms.

A reversal of data has little difference, and the insertion or deletion of data that changes the position of other data is not much of a true difference. So how do they measure these gene differences in a meaningful way?

When they say that humans vary from chimps by only a few percent - that could mean a lot of different things - are they only looking at the 5% that counts? (that would make sense to me) or are they overstating it by looking at all the junk DNA? I don't think they are looking at mitochondrial DNA. How do they deal with reversals and relocations? Are they only looking at the DNA that codes for protein amino acid sequences? How can this difference be expressed in a meaningful way?

On Linux based computers there is a command called diff that creates a difference file, while this command deals with insertions and deletions well - it is not so good at reversals or rearrangements of blocks. Related to this program is one call rsync.

rsync was originally written by Andrew Tridgell as the basis of his PhD thesis. is the key person and in some ways more important to Linux than Linus. (BTW - for a bit of self referentialism, I use rsync to transfer updates of this web site!) See see - binary diffs for more on this..

He was solving a problem that often comes along in computer files where there are different versions of - lets say a text file. Instead of sending the entire file, he came up with a system that broke the file into parts, created a hash (a mathematical method that identifies a block of data with a type of checksum) and compares the hashes on both ends on then only transmits the differences (they compress the differences as well) allowing files to be 'synchronized' with out sending the whole thing - thus speeding up the process. Other attempts to create compact differences have been worked on that use more complex algorithms at the expense of processor time.

The important point here, is how to measure the actual meaningful difference. It seems to me that only looking at the 5% and then finding the best compression of the difference data would get us a representation that has meaning. Then making a fraction based on the best compression of the difference over the best compression of the useful data is the way to go.

I've also heard estimates on how much information our genes represent - and again I don't know if they are looking at both the real data and the junk data in these estimates. Is it 700MB or compressed to 250MB? Or does this include all the junk? Is there a difference in how the junk genetic information compresses?.

Update

As if someone read this page ?? There was an article in American Scientist, September-October 2007 called Sorting Out the Genome: To put your genes in order, flip them like pancakes

Here, Brian Hayes, talks about genetic inversions: blocks of genes that had flipped end-over-end and figuring out what the actual information content is.

There is still a need for a good book by a mathematician that wold nail genetic difference measurements to the wall of understanding.

Meaningful Differences

My understanding is that the only directly useful regions on the DNA are the genes -- every protein that an organism can possibly make is coded by some gene. However, if there are also some differences between genes that encode control information, from one species to the next, this will certainly also come in to play, particularly as we move further from our own species (for instance, toward cell processes involved in photosynthesis, controlled by genes). My understanding is that molecular biologists can now tell, just from the DNA sequence, where each gene stops and starts.

I agree that the "differ by only 1.2 percent" number is a little vague. Since the number of genes humans had was completely unknown before the human genome project published its working draft of the human DNA sequence in 2000, and the "1.2 percent" number was published before that, I'm pretty sure the "1.2 percent" number includes *everything*, both coding genes and non-coding "junk".

I agree that it would be interesting to know, in detail, what exactly is different between chimps and humans. One of many ways to categorize the differences between one cell and another is:

I hear that the Great Ape Genome Project plans to publish some results Real Soon Now. I hope that makes it possible to gain a better understanding of the differences and similarities between humans and others.

Information Theory

It goes without saying that measuring differences, involves not only singular comparisons (as in traditional statistics), but only in weighted inclusion of patterns that repeat (as in time series, or future past weighted inclusions from Bayesian analysis). But how might our conclusions change, once we begin to arbitrarily decide what is important and what is not? If a particle has binary properties of spin and other quantum states, why not also a unit of meaning? Let's look at the idea of meaning for a second.

Depending on the ratio and intensity of noise to meaningful bits of data, the maximum possible information transmission efficiency and lowest possible error rate in DNA transcription affect the maximum possible amount of information transmitted in that genome per specimen, according to Shannon. Oh, but let us remove those "useless sequences" and see whether and how much entropy we have decreased! Who among us can know, can see the line clearly, between entropy and order?


Top Page wiki Index