Introns First

Abstract Knowing how introns originated should greatly enhance our understanding of the information we carry in our DNA. Gilbert's suggestion that introns initially arose to facilitate recombination still stands, though not for the reason he gave. Reanney's alternative, that evolution, from the early "RNA world" to today's DNA-based world, would require the ability to detect and correct errors by recombination, now seems more likely. Consistent with this, introns are richer than exons in the potential to extrude the stem-loop structures needed for the homology search that can lead to heteroduplex formation and the recognition of base mismatches. In nucleic acid sequences that were unable concomitantly to encode sufficient stem-loop potential, protein-encoding potential was constrained to arise as segments (exons) interrupted by segments rich in stem-loop potential (introns). Thus, sequences with properties that we now deem intronic are likely to have preceded the emergence of exons.

Keywords Exons . Introns . RNA world . Stem-loops . Conventional phenotype . Genome phenotype

This paper was presented at "Genome Brno 2," a workshop on "Structural and Functional Diversity of Genomes" held September 2012 at the Augustinian Abbey of St. Thomas, Brno, Czech Republic. The You Tube internet site hosts the talk, and also available is an extended, more elementary, version - "Introns and Exons" - as an 18 video course for high school and college students (Click Here).

A Degree of Clarity

Sometimes it is important to know the order of events. But sometimes this seems an academic exercise. Did giraffes with longer necks better detect approaching predators and then, as a useful by-product, become less vertically challenged in their grazing? Or could it have been the other way round? And, in the unlikely event that we could obtain a definite answer, would it really matter (Wilkinson and Ruxton 2012)? For introns and exons it would matter. In the likely event that we will obtain a definite answer to whether, in the general case, one preceded the other, I show here that we will have greatly enhanced our understanding of the information we carry in our DNA. For the general case, the alternatives are easily set out: Segments of DNA with properties that we would now deem as intronic, and segments of DNA with properties that we would now deem exonic, arose simultaneously. On the other hand, exons might have preceded introns, or introns might have preceded exons. And if exons were first, did introns arise early, or late, in evolutionary time? Likewise, if introns were first, did exons arise early, or late?

Before these questions could be asked, exons and introns had to be discovered. Prior to this, genomes had been neatly divided into genic (usually protein-encoding) and non-genic sectors. Then in 1977 the unexpected discovery that genes were "split" or "interrupted" by non-protein-encoding segments, led to the coinings "exon" and "intron," and a flurry of speculation as to intron origins (Gilbert 1978; Reanney 1978,1979; Darnell 1978; Crick 1979). It was easy to think of introns as "informationally irrelevant" (Doolittle 1978), an idea consistent with the view that our genomes were laden with "junk DNA," perhaps with devilishly selfish intent (Orgel, Crick and Sapienza 1980). As facts and arguments contended with counterfacts and counterarguments, the fortunes of various hypotheses waxed and waned over succeeding decades. However, some in the field recently pronounced that at last "a degree of clarity has been reached in the study of the evolution of eukaryotic gene architecture". While not attempting "a comprehensive coverage," yet dealing with "several aspects that appear directly relevant for understanding evolution of introns and eukaryote gene structure," they concluded that "an introns first scenario is not supported by any evidence" (Rogozin et al. 2012). I agree that some clarity has been reached, but it is a clarity that allows us merely to begin sorting out the various hypotheses in the light of the accumulated evidence. I will argue here that at least one version of "introns first," which is featured in two textbooks (Forsdyke 2011a, 2011b), is still very much in contention.

Technology-Driven Discovery

Introns emerged with the systematic application of new technology. Just as the phenomenon now known as genetic linkage emerged around 1900 when Bateson and Saunders applied Mendelian technology (brother-sister matings through the generations) and found certain characters to be "partially coupled," or "coupled," rather than independently inherited (Cock and Forsdyke 2008), so the intron phenomenon emerged in 1977 when electron microscopy was applied to the visualization of R-loops in hybridized DNA molecules (Witkowski 1988; Morange 1998). This was soon supported by another new technology - DNA sequencing (Gilbert 1981). Here we are concerned with yet another new technology, the computer calculation of optimum secondary structure in single-stranded nucleic acid sequences, which can be represented both pictorially and as a stability number (expressed in negative kilocalories per mol.; Le and Maizel 1989; Zuker 1990). For such structures there are three determinants - the types and order of bases, and sequence length. If, for purposes of comparison, lengths are kept constant, then there are just two determinants, base composition and order. These can be considered independently as base composition-dependent and base order-dependent stabilities, which sum together to give the total stability (Forsdyke 2007a). The base order-dependent component can be determined by subtracting the base composition-dependent component from the total stability value (Fig. 1). Before moving to the application of this technology, I first review various ideas that followed the discovery of introns.

Fig 1 Determination of base order-dependent fold potential. A natural sequence (horizontal line at left) is computer folded by successive reiteration to obtain a structure of stability -30 kcal/mol (total fold potential). Base order is randomized to produce ten shuffled sequences that share with the natural sequence only their base composition. These are each folded to obtain corresponding stability values. Idiosyncracies due to each distinctive base order are averaged out (at right) to determine the contribution of base composition (-20 kcal/mol) to the total fold potential of the natural sequence. The contribution of base order to the total fold potential of the natural sequence is determined by subtraction

Positive Role in Recombination

Early in 1978, within a few months of their discovery, one of the sequencing pioneers, Walter Gilbert, suggested that introns had arisen to facilitate recombination (Gilbert 1978). As had been pointed out by Thomas Hunt Morgan (1911), the more closely coupled are two genes, the less likely will their coupling be broken by recombinational shuffling. Gilbert suggested that if the length of an individual gene were increased by virtue of its introns, the probability of recombination between different parts of that gene would be increased. By creating new combinations of parts of the gene product (e.g. the shuffling of protein domains), the number of variant forms offered to natural selection would be increased. This could speed up evolution if variation, not selection for that variation, was rate-limiting.

Gilbert went further to suggest that an organism that modified the sequence of its introns to favor recombination would be at a selective advantage: "Middle repetitious sequences within introns may create hot spots for recombination to rearrange the exonic sequences." At that time a view similar to Gilbert's, involving conservation of intron sequences with the potential to form stem-loop structures that would engage in "topological reshufflings," was advanced by Darryl Reanney (1978). But, noting the high mutation rate in the intron sequences then available, Gilbert soon shifted position declaring that "it is not their sequence that is relevant, but their length. Their function is to move the exons apart along the chromosome" (Gilbert 1981). Some, however, doubted Nature's prescience in creating introns in the hope of opportunities for recombination many generations later. It seemed necessary that introns should have offered some more proximate advantage (Doolittle 1978; Crick 1979).

Negative Role of Recombination

The seemingly high mutation rate in introns prompted Philip Leder and his colleagues in 1978 to Philip Leder propose that introns would speed up evolution, not by enhancing, but by preventing recombination (Tiemeier et al. 1978). Supposing an adaptive advantage for the generation of globin gene duplicates within an organism, they noted that: "Immediately after the original duplication event, it is probable that extensive homology existed between these gene segments." Since recombination required close sequence identity, variation in introns "would reduce the target size for possible recombination and serve to stabilize or fix the two - - globin gene copies." Noting that the "free and easy nucleotide substitution that occurs in introns should serve as a buffer against mispairing," a respected arbiter, Russell Doolittle (1985), came to agree that, whether introns supported recombination and "exon shuffling" as proposed by Gilbert, or decreased recombination as proposed by Leder, it would be much to the advantage of an organism to have them. However, the question as to whether one role or the other (or neither) had been instrumental in the actual origin of introns, was left unanswered.

Special Cases

There was much excitement when the various domains of immunoglobulins seemed in accord with the Gilbert hypothesis, introns being located at domain boundaries (Robertson 1977). But as more protein sequences were examined, immunoglobulins came to be viewed as special cases (Crick 1979). And even though Gilbert (1978) had drawn attention to the fact that "genes with no protein product, such as the tRNA genes in yeast and the rRNA genes in Drosophila," contained introns, at first it was easy also to dismiss non-protein-encoding genes as special cases, despite the finding that the 5' and 3' non-coding regions of some protein-encoding genes contained introns (Crick 1979). When, in the 1990s, long non-coding RNAs laden with introns became evident (Pfeifer and Tilghman 1994), such dismissal seemed less valid. Those who had been trying to relate intron positions to boundaries between protein domains had been following a false trail. And the notion of introns as places for "free and easy nucleotide substitution" was also loosing ground. Walter Schaffner and his colleagues pointed to the "long known paradox - - that most introns are preserved even though their actual sequence hardly seems to matter" (Matsuo et al. 1994).

Exons as Another Line of Defence

Supporting Leder's case for a negative role of introns in recombination, Schaffner noted that "even the few sequence mismatches in introns that typically occur between different strains can dramatically lower the efficiency of homologous recombination" (my italics). Thus, in agreement with Leder's "homology interruption hypothesis," he proposed, from studies of the POU domain transcription factor gene family, that there would be a "homology-reducing effect of divergent introns". But Schaffner then went much further (Matsuo et al. 1994). If it was important to decrease recombination, why leave it to introns? Exons might help: "The frequency of homologous recombination among POU domain genes could be reduced not only by interrupting exons with introns, but also by minimizing sequence identity within exons." And would this exon identity-minimization be random, or would some base changes be more effective than others? While introns might represent "a first barrier" against homologous recombination between members of gene families, another "line of defense" would be "the overall sequence composition and especially synonymous codon choice" [my italics].

Exons and Introns Defend by Changing GC%

On checking actual sequences, the Schaffner group found that, while the amino acid sequences of the various POU domains (in genes encoding Oct-1, Oct-2, and Pit-1) tended to remain identical, the corresponding exon sequences were "quite different." This meant that, rather than changes in bases essential for specifying amino acids (first and second codon positions), it was changes in the remaining sequence (third codon positions) that would lower recombination efficiency. They observed that: "The G+C content of the Oct-2 POU domain DNA is high, while that of the Oct-1 POU domain is intermediate," and "the Pit-1 POU domain is A+T-rich." Thus they related failure of these genes to recombine with each other, to switches between synonymous codons that would change the GC% (i.e. changes at third positions). Since the GC% values of synonymous codon positions resembled that of introns (D'Onofrio et al. 1991; Vinogradov 2001), then failure to recombine would also associate with intronic GC% values. In other words, both the first and second "lines of defence" might be using the same weapon against recombination - differences in GC%. By the same token, similarities in GC% might favor recombination. But how GC% values might affect recombination remained to be explained.

Introns First

As usually employed, the terms "introns early" and "introns late" refer to whether exons acquired introns early, or late, in evolution, with the implication that exons arose simultaneously with, or preceded, introns. Although it can be considered a subset of "introns early," the term "introns first" implies that segments of DNA with properties that we would now deem intronic preceded exons. In other words, nucleic acid sequences were to some degree, and perhaps entirely, intronic in nature, and segments became actually defined as intronic when they acquired exon borders (Penny et al. 2009).

The "introns first" idea was advanced by Darryl Reanney (1979). He portrayed the RNA splicing required to remove intron segments from primary RNA transcripts as a hold-over from early life forms in an "RNA world." This preceded the evolution of modern forms where the DNA 'legislature' (information source) is largely dissociated from the protein 'executive,' which is specified by that information. In the RNA world there were no proteins, and RNA molecules were their own legislature and executive. Indeed, Reanney saw that the "'mosaic' RNAs produced by splicing are strictu sensu recombinant molecules in that they contain data drawn from different parts of the genome. - - It seems logical therefore to suggest that RNA:RNA splicing is the primitive mode of genetic recombination." Pointing to an important role of "previous folding" in interactions between separate single-stranded RNAs, Reanney deduced that "topology must have had a key role in the selection of the splice site - - explicable if nucleotides near the site are required to be unpaired in order to provide a recognition mechanism through base pairing with an independent RNA."

Proof-Reading

At that time agreeing with Gilbert on the adaptive advantages of segment shuffling (i.e. a function that increased variation), Reanney went further to suggest that "this type of recombination developed from the proofreading function which seems to be a universal correlate of DNA synthesis" (i.e. a function that decreased variation). Thus today's "generalized recombination could be regarded as an extension of the Ford Doolittle of Dalhausie University proofreading function from which it evolved." However, as the "introns early" case (Darnell 1978, Doolittle 1978) grew stronger, Reanney (1984) took a firmer position on the importance of decreasing variation to the origin of introns.

If proof-reading was so necessary in the present DNA world, then it might have been even more necessary in the earlier RNA world. In this world all "genes" would have been part of error-prone RNA molecules and their evolution would have depended on the parallel development of mechanisms for detecting and correcting errors. To this extent, it could be said that "genes" and introns arose hand-in-hand. But, in the context of the later-appearing protein-encoding genes, the scenario can be better described as "introns first" (Penny et al. 2009). It seems likely that prior development of sophisticated error detection and correction capacities would have been critical for genomes to evolve protein-encoding capacity.

Positive Role in Error-Detection

Noting that the error-free transmission of electronic information requires the interruption of message sequences by non-message, error-detecting, sequences, which operate by parity-check mechanisms (Hamming 1980), I suggested a parity check mechanism for the operation of error-detecting sequences in introns (Forsdyke 1981). Although gaining no clear supported from subsequent studies (Liebovitch et al. 1996; Battail 2007; Faria et al. 2012), four key postulates still seem valid:

1. Introns contribute positively to the hereditary transmission of error-free genetic information.
2. The structure of single-stranded DNA is involved.
3. Some form of parity between bases is involved.
4. The error-detecting function affects DNA pervasively at regular intervals - just as breathing interrupts human speech, or as adverts ("pauses for station identification") interrupt TV programs.

These postulates provided the basis for a recombination-dependent error-checking mechanism, the conception of which began with the unearthing of Chargaff's, long-forgotten, second parity rule.

Chargaff's Second Parity Rule

Chargaff's first parity rule provided a basis for the Watson-Crick structure for duplex DNA, namely Erwin Chargaff that there was parity between purines on one strand and pyrimidines in the other - base A pairing with base T, and base G pairing with base C (Chargaff 1951). By the same token, the dinucleotide AG on one strand would pair with the dinucleotide CT on the antiparallel complementary strand - a rule that applied to all 16 possible dinucleotides. Similarly, trinucleotide AGT would pair with complementary trinucleotide ACT on the opposite strand of a duplex, and this would apply for all 32 pairs of possible trinucleotides. This numerical parity between oligonucleotides on complementary strands followed naturally from the structure of duplex DNA.

Chargaff's second parity rule was that the first parity rule equivalences also apply pervasively, although not quite so precisely, to DNA single strands (Rudner et al. 1968). As with the first rule, the equivalences also extended to oligonucleotides (Prabhu 1993). An implication of this was that, single stranded DNA, although normally forming part of a duplex, could have an independent potential to form stem-loop structures that, at least in the stems, would exhibit parity between complementary bases and oligonucleotides (Fig. 2).

Fig 2 The potential of a DNA duplex (top) to reversibly extrude stem-loops (bottom) when complementary oligonucleotide sequences are both present in equal quantities on one strand (Chargaff's second parity rule), and closely located. Here the octonucleotide TACGACGC in the top strand, complements the octonucleotide GCGTCGTA in the same strand, to form the stem in an extruded secondary structure. A symmetrical stem-loop is formed by corresponding octonucleotides in the bottom strand

Stem-Loops and Introns

A new technology - computer-aided structure determination (Fig. 1) - facilitated the demonstration that the potential for the extrusion of stem-loop structures from duplex DNA ("fold potential") was pervasively distributed along the DNA molecules of numerous biological species (Forsdyke 1995a-c; 1996). Furthermore, when decomposed into base order-dependent and base composition-dependent components, it was evident that fluctuations in fold potential were largely due to the base order-dependent component (Zhang et al. 2008a; Fig. 3). The latter provided a powerful means of analyzing the distribution of fold potential between introns and exons.

FORS-D Analysis of DNA segment of Nematode worm

Fig 3 Potential stabilities of secondary structures extruded from duplex DNA, as assessed for a 200 base window moved in 50 base steps along the sequence. Total stability values (A) decompose into base order-dependent components (B) and base composition-dependent components (C). The distribution of values for the top strand (blue line) and bottom strand (red line) closely correspond. This is a 40 kb segment of chromosome I of C. elegans (nucleotides 2500 to 42500)

That fold potential was much greater in introns was particularly evident in the case of genes under positive Darwinian selection, where introns could be more conserved than exons (Fig. 4); but high intronic fold potential could also be demonstrated in other genes (Forsdyke 1996; Dawson and Yamamoto 1999; Bechtel et al. 2008).

Snake Venom Exons and Introns. Fold Analysis

Fig 4 In a gene under positive Darwinian selection, conserved introns harbor enhanced stem-loop potential, assessed as stability of secondary structure. Conversely, there is high base substitution frequency (continuous black line) and low base order-dependent fold potential (blue line with triangles) in exons. The numbered grey boxes indicate the locations of the four exons of the rattlesnake venom gene encoding a basic subunit of venom phospholipase A₂; dashed vertical lines show, consecutively, the beginning of exon 1, the beginning of the protein-coding part of exon 1, the end of the protein-coding part of exon 4, and the end of exon 4. Values were determined for a 200 base window moving in 50 base steps along the sequence. Substitutions are base differences relative to the rattlesnake phospholipase A₂acidic subunit gene. The two genes are likely to have arisen from a common ancestral gene. A similar result was obtained when the comparison was with Habu snake phospholipase A₂

The greater ability of introns to order bases to support the extrusion of stem-loop structures from duplex DNA was readily rationalized in terms of conflicting pressures. In exons the pressure to order bases for stem-loop potential ("fold pressure") would conflict with the pressure to encode amino acids ("protein pressure"). Third codon position, being less subject to protein pressure, could be seen as "mini-introns" that would allow some base ordering to support fold potential in exons.

Stem-Loops and Recombination

To initiate legitimate recombination - homologous recombination - between nucleic acids, there must Francis Crick be pairing between complementary, or closely complementary, sequences. Crick (1971) proposed that, for this to occur between two duplex DNA molecules, the strands in each duplex would need to unpair locally so that they could test each other for complementarity. But the finding of Jun-ichi Tomizawa that recombining complementary single-stranded RNA molecules first interact by reversible "kissing" between the loops of stem-loop structures (Tomizawa 1984), prompted Kleckner and Weiner (1993) to suggest that the locally unpaired single stranded DNA molecules would also interact by way of loop-loop interactions. Since recombination was a genome-wide characteristic of DNA, this was consistent with the observed pervasive distribution of stem-loop potential throughout genomes. This pervasive pressure would account for the need for protein-encoding capacity to arise in segments (exons) interrupted by segments with high stem-loop potential. Much evidence supporting the role of stem-loops in generalized recombination has since accumulated (Forsdyke 2007b). But how would recombination allow error-detection and correction?

Recombination and Error-Correction

Reanney (1979) stated:

"Generalised recombination - - developed from the proofreading function which seems to be a universal correlate of DNA synthesis. The proteins needed for generalised recombination could easily have been recruited from the molecules used in the cut and patch mechanism(s) upon which proofreading depends. - - Significantly, branch migration and strand isomerization show up base mismatches as transient heteroduplex lesions. Since these lesions can be recognized and repaired, generalised recombination could be regarded as an extension of the proofreading function from which it evolved. Preferential correction to wildtype rather than mutant genotype can be achieved - - by enzymic recognition of base mismatches in non-methylated (that is, newly made) strands. - - Thus, reciprocal recombination between equal length DNAs may, in part, be a mechanism for minimising variation, not promoting it."

Here Reanney is describing error-detection and correction by the process which, when it occurs in a genic region, is referred to as gene conversion - the directional transfer of information from a gene on one chromosome to that on another (Yang et al. 2012). What Reanney called "preferential correction to wildtype," was a topic of my first introns paper (Forsdyke 1981). If there is an error in a text, you want the error to be noted and corrected, not compounded. In other words, if you are comparing two lines of text (or two strands of DNA) you want to know, not merely that there has been an error, but which line is the wrong line and which line is the correct line. Information in the latter is used to correct the former.

Reanney pointed to strand marking by methylation. Instead of erasure, some methyl marks can persist transgenerationally. So DNA, in computer jargon, is not just "read-only memory", but "read-and-write memory," with the writing persisting for at least a few generations. A modification of this epigenetic marking was suggested by Virgil Reese (2002). He noted that the cell either "knows" which strand is incorrect, or is uncertain. In the latter case, it can mark strands as "suspicious" by methylation. Sometimes the methyl mark can be carried through to future generations where the suspiciously marked strand may find itself paired with a non-suspiciously marked strand. Correction from the latter can then be implemented.

The trouble with this, as noted with some intron hypotheses, is that Nature has to do something with no adaptive advantage in one generation in the hope that it will be useful to a future generation. Does Nature have such foresight? Sometimes something useful in one generation can be adapted for another role in a future generation. In the nineteenth century Samuel Butler (1926) noted: "I have gone out sketching and forgotten my water-dipper; among my traps I always find something that will do, for example, the top of my tin case (for holding pencils). This is how organs come to change their uses and hence their forms, or at any rate partly how." Today we make the same point with the "spandrels" metaphor (Gould 1993). Once introns were in existence, there was indeed ample opportunity for them to assume other roles, such as domain shuffling (Gilbert 1978), harboring regulatory and "selfish" elements (Orgel et al. 1980), preventing recombination (Tiemeier et al. 1978) and developing certain asymmetries between top and bottom strands that violate Chargaff's second parity rule (Forsdyke and Bell 2004; Zhang et al. 2008b). We can also note that methylation predisposes a strand to exchange a T residue for a C residue. Thus, an initial transient epigenetic event has the potential to influence our genomes more permanently. The "writing" became indelible.

GC% Differences Affect Stem-Loop Extrusion

The ability of introns to defend against recombination, was seen by the Leder and Schaffner groups as a way of preserving paralogous genes within members of a species. For this Schaffner pointed to a role of differences in base composition - GC% - but left the mechanism unexplained. Along the lines of Le and Maizel (1989), it has been shown that the structure of extruded stem-loops would be sensitive to very small differences in GC% (Forsdyke 2007b, 2011c). This should suffice to prevent recombination. By the same token, it has been argued that base composition differences between two members of a species would serve to prevent the meiotic pairing of their chromosomes in the gonads of their offspring, so enforcing their reproductive isolation - an isolation that could lead to branching speciation (Fig. 5). Thus, seeking to know how introns originate, today helps us approach Darwin's great question - how do species originate?

Difference between within-species and attempted between-species meiosis

Fig 5 Similar extruded stem-loop structures when GC percentage values are equal (left), favor strand pairing, and hence, favor recombination. Dissimilar extruded stem-loop structures when GC percentage values are not equal (right), prevent strand pairing, and hence, prevent recombination. The potentially homologous DNA duplexes, of either paternal (P) or maternal (M) origin, would attempt to pair during meiosis in the gonad of their child (hybrid). The hybrid is either fertile (left) or sterile (right)

Acknowledgement
Queen's University hosts my intron webpages Click Here.

References

Battail G (2007) Information theory and error-correcting codes in genetics and biological evolution. In: Barbieri M (ed) Introduction to biosemiotics. The new biological synthesis. Springer, Dordrecht, pp 299-345

Bechtel JM, Wittenschlaeger T, Dwyer T, Song J, Arunachalam S, Ramakrishnan SK, Shepard S, Fedorov A (2008) Genomic mid-range inhomogeneity correlates with an abundance of RNA secondary structures. BMC Genomics 9:284 doi:10.1186/1471-2164-9-284

Butler S (1926) The Shrewsbury edition of the works of Samuel Butler. Jones HF, Bartholomew AT (eds) Vol. 20, Jonathan Cape, London, p. 13

Chargaff E (1951) Structure and function of nucleic acids as cell constituents. Fed Proc 10:654-659

Cock AG, Forsdyke DR (2008) Treasure your exceptions. The science and life of William Bateson. Springer, New York, pp 339-377

Crick F (1971) General model for chromosomes of higher organisms. Nature 234:25-27

Crick F (1979) Split genes and RNA splicing. Science 204:264-271

Darnell JE (1978) Implications of RNA-RNA splicing in evolution of eukaryotic cells. Science 202:1257-1260

Dawson WK, Yamamoto K. (1999) Mean free energy topology for nucleotide sequences of varying composition on secondary structure calculations. J Theor Biol 201:113-140

D'Onofrio G, Mouchiroud D, A�ssani B, Gauter C, Bernardi G (1991) Correlations between the compositional properties of human genes, codon usage, and amino acid composition of proteins. J Mol Evol 32:504-510

Doolittle RF (1985) The genealogy of some recently evolved vertebrate proteins. Trends Biochem Sci 10:233-237

Doolittle WF (1978) Genes in pieces: were they ever together? Nature 272:581-582

Faria LCB, Rocha ASL, Kleinschmidt JH, Silva-Filho MC, Bim E, Herai RH, Yamagishi MEB, Palazzo R (2012) Is a genome a codeword of an error-correcting code? PLOS One 7 (5) e36644. Doi:10.1371/journal.pone.0036644

Forsdyke DR (1981) Are introns in-series error detecting sequences? J Theor Biol 93:861-866

Forsdyke DR (1995a) A stem-loop "kissing" model for the initiation of recombination and the origin of introns. Mol Biol Evol 12:949-958

Forsdyke DR (1995b) Conservation of stem-loop potential in introns of snake venom phospholipase A₂ genes: an application of FORS-D analysis. Mol Biol Evol 12:1157-1165

Forsdyke DR (1995c) Relative roles of primary sequence and (G+C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species. J Mol Evol 41:573-581

Forsdyke DR (1996) Different biological species "broadcast" their DNAs at different (G+C)% "wavelengths". J Theor Biol 178:405-417

Forsdyke DR (2007a) Calculation of folding energies of single-stranded nucleic acid sequences: conceptual issues. J Theor Biol 248:745-753

Forsdyke DR (2007b) Molecular sex: the importance of base composition rather than homology when nucleic acids hybridize. J Theor Biol 249:325-330

Forsdyke DR (2011a) The interrupted gene. In: Lewin's genes X. Krebs JE, Goldstein ES, Kilpatrick ST (eds) Jones and Bartlett, Boston, pp. 79-97, 172-175

Forsdyke DR (2011b) Evolutionary bioinformatics, 2^nd edition. Springer, New York, pp. 249-266

Forsdyke DR (2011c) The selfish gene revisited: reconciliation of Williams-Dawkins and conventional definitions. Biol Theor 5:246-255

Forsdyke DR, Bell SJ (2004) Purine-loading, stem-loops, and Chargaff's second parity rule: a discussion of the application of elementary principles to early chemical observations. Applied Bioinformatics 3: 3-8

Gilbert W (1978) Why genes in pieces? Science 271:501

Gilbert W (1981) DNA sequencing and gene structure. Science 214:1305-1312

Gould SJ (1993) Fulfilling the spandrels of world and mind. In: Understanding scientific prose. Selzer J (ed) University of Wisconsin Press, Madison, pp. 310-336

Hamming RW (1980) Coding and information theory. Prentice-Hall, Englewood Cliffs

Kleckner N, Weiner BM (1993) Potential advantages of unstable interactions for pairing of chromosomes in meiotic, somatic and premeiotic cells. Cold Spring Harb Symp Quant Biol 58:553-565

Le S-Y, Maizel JV (1989) A method for assessing the statistical significance of RNA folding. J Theor Biol 138:495-510

Liebovitch LS, Tao Y, Todorov AT, Levine L (1996) Is there an error-correcting code in the base sequence of DNA? Biophys J 71:1539-1544

Matsuo K, Clay O, Kunzler P, Georgiev O, Urbanek P, Schaffner W (1994) Short introns interrupting the Oct-2 POU domain may prevent recombination between POU family genes without interfering with potential POU domain 'shuffling' in evolution. Biol Chem Hoppe-Seyler 375:675-683

Morange M (1998) A history of molecular biology. Harvard University Press, Cambridge MA, pp 204-214

Morgan TH (1911) Random segregation versus coupling in Mendelian inheritance. Science 34:384

Orgel LE, Crick FHC, Sapienza C (1980) Selfish DNA. Nature 288:645-646

Penny D, Hoeppner MP, Poole AM, Jeffares DC (2009) An overview of the intron-first theory. J Mol Evol 69:527-540

Pfeifer K, Tilghman SM (1994) Allele-specific gene expression in mammals: the curious case of imprinted RNAs. Genes Devel 8:1867-1874

Prabhu VV (1993) Symmetry observations in long nucleotide sequences. Nucleic Acids Res 21: 2797-2800

Reanney DC (1978) Noncoding sequences in adaptive genetics. In: Fox CF, Todaro GJ, Stevens JG (eds) Persistent viruses. Proceedings of the 1978 ICN-UCLA symposium on molecular and cellular biology held in Keystone, Colorado, February 1978. Academic Press, NewYork, pp. 311-330

Reanney DC (1979) RNA splicing and polynucleotide evolution. Nature 277:598-600

Reanney DC (1984) RNA splicing as an error-screening mechanism. J Theor Biol 110:315-321

Reese V (2002) Mutation repair: a proposed mechanism that would enable complex genomes to better resist mutational entropy, and which suggests a novel function for meiosis. The Human Behavior and Evolution Society 14th Annual Meeting, Rutgers University. Abstracts of presentations to session on "New Developments in Biology," June 21, p. 40

Robertson M (1977) Immunoglobulin genes and the immune response. Nature 269:648-650

Rogozin IB, Carmel L, Csuros M, Koonin EV (2012) Origin and evolution of spliceosomal introns. Biology Direct 7, 11 doi:10.1186/1745-6150-7-11

Rudner R, Karkas JD, Chargaff E (1968) Separation of B. subtilis DNA into complementary strands. III. Direct analysis. Proc Natl Acad Sci USA 60:921-922

Tiemeier DC, Tilghman SM, Polsky FI, Seidman JG, Leder A, Edgell MH, Leder P (1978) A comparison of two cloned mouse β-globin genes and their surrounding and intervening sequences. Cell 14:237-245

Tomizawa J (1984) Control of ColE1 plasmid replication: the process of binding of RNA I to the primer transcript. Cell 38:861-870

Vinogradov AE (2001) Within-intron correlation with base composition of adjacent exons in different genomes. Gene 276:143-151

Wilkinson DM, Ruxton GD (2012) Understanding selection for long necks in different taxa. Biol. Rev. 87:616-630

Witkowski JA (1988) The discovery of 'split' genes: a scientific revolution. Trends Biochem Sci 13:110-113

Yang S, Yuan Y, Wang L, Li J, Wang W, Liu H, Chen J-Q, Hurst LD, Tian D (2012) Great majority of recombination events in Arabidopsis are gene conversion events. Proc Natl Acad Sci USA 109: 20992-20997

Zhang C, Xu S, Wei J-F. Forsdyke DR (2008a) Microsatellites that violate Chargaff's second parity rule have base order-dependent asymmetries in the folding energies of complementary DNA strands and may not drive speciation. J Theor Biol 254:168-177

Zhang C, Li W-H, Krainer AR, Zhang MQ (2008b) RNA landscape of evolution for optimal exons and intron discrimination. Proc Natl Acad Sci USA 105:5797-5802

Zuker M (1990) Prediction of optimal and suboptimal secondary structure for RNA. Meth Enzymol 183:281-306

End Note (Feb 2013) Help from handing editor

So much of the early intron literature having been in Nature, this paper was first submitted there (17 May 2012), but was declined for review (22 May 2012). It was then submitted to Biological Theory, where there were initially two conflicting anonymous reviews, one for and one against. The handling editor then tried, with some difficulty, to find new reviewers. On consulting those who might be able to advise on possible reviewers, he later commented that: "In the process I discovered a lot about the sociology of your field!" Eventually the paper was assigned to "someone who is a trustworthy authority on introns, who has seen your reply to the original reviewers, and whom the editors of the journal who were involved assigned as the "final arbiter" in the case." Happily the "final arbiter," whose review was in German, was positive. I hope most who read the paper will agree with his/her judgement.

End Note (Nov 2014) Are introns a burden?

Sadly we learn of the unexpected death of the above handling editor (Werner Callebaut, born in 1952), in early November 2014.

In March 2014 BMC Evolutionary Biology published an interesting paper on the topic of "intron burden" (Gorlova et al. 2014). The view that human introns are a "burden" was held to be supported by a negative correlation between gene expression and total intron size or number within a gene, with high expression being deemed as something positive and low expression being deemed as something negative. But, even if these premises are accepted, the curves are bimodal. Figure 3a of the paper, for example, shows that as intron number increases from zero to three, the expression level of the corresponding gene increases dramatically. The correlation is distinctly positive. Following the authors' line of reasoning, this can be interpreted as showing that introns are beneficial, but as their length or number within a gene exceed certain limits, a possibly independent detrimental effect, much less evident at lower lengths or numbers, intervenes. In other words, when intron number or length exceed 3 or 5 kb, respectively, their presence in a gene is sustained despite the association of decreasing gene expression as the number or length grow. Thus, the benefits of introns could be very great. Genes that could not weather this presumed detrimental effect would have had to shed or shorten introns.

Gorlova et al. also note a positive correlation of intron size and number with evolutionary conservation of a gene (Fig. 2 of their paper). This correlation was most dramatic over the lower range of intron length and number. The positive correlation is consistent with observations in bacteria (usually without introns) and yeast (with few introns). Here, conservation (low evolution rate) correlates with high expression level (the so-called expression-evolution rate anticorrelation). Thus, the data from microorganism and human genomes are in agreement in the case of genes with few or short introns. Over this range, as intron number and length increases, conservation increases. The possibility that introns might have aided that conservation was not considered.

"Burden" is a loaded term, implying that net costs might outweigh any benefits. Genes that are conserved must either depend on efficient natural selection to eliminate organisms with mutations, or be accurately corrected when they mutate. For genes that are not conserved, the luxury of accurate correction would seem less pressing. Thus, if introns were concerned with maintaining genome integrity, then more intron "burden" in conserved genes would be expected. On the other hand, in extremis, there are the rarer positively selected genes, which vary rapidly in amino acid sequence (Gorlova et al. assess conservation at the protein sequence level; "CI values"). By virtue of these rapid amino acid changes, such genes are favored by natural selection and accumulate mutations that Gorlova et al. refer to as "functional polymorphisms." For positively selected genes, these amino acid changing mutations do not imply any lack of "functional significance" or "functional importance." But here the error-correcting role can rely less on exonic synonymous sites, so any error-correction that is required comes to depend more on introns, which can then appear more conserved than exons (see Fig. 4 in above paper). Then the intron "burden" could increase.
Genes that are conserved over long evolutionary time periods (presumed to reflect their functional importance) would seem to have had more opportunity to resolve their conflicts with other forms of information that need to pass through the generations in the same genomes. Such forms include stem-loop pressure, GC-pressure and, for exons, AG-pressure ("R loading"). Sometimes synonymous substitutions [SYN] would suffice and protein sequence conservation (low chain-terminating [NON] and missense [MIS] base substitution values) would be apparent. Thus, the NON/SYN and MIS/SYN base substitution ratios would be low, and intron number/length could be high. But if high synonymous values and increases in intron number/length were insufficient, amino acid changes (non-synonymous mutations [MIS]) might be necessary for resolving conflicts. For example, an exchange of arginine for lysine (both basic amino acids) might suffice. Or more extreme exchanges might be required, ultimately extending to chain-terminating [NON] mutations. In this circumstance, both MIS and SYN values can change and intron number/length can be less. Thus, there could be "a significant negative correlation between MIS/SYN and the number of introns (R = −0.05, n = 3,363, P = 0.006)."

Gorlova O, Fedorov A, Logothetis C, Amos C, Gorlov I. (2014) BMC Evol Biol. 14, 50.

To access some early online discussion of the work described in the above paper, please (Click Here).

Return to: Introns pages (Click Here)

Go to: Bioinformatics Index (Click Here)

Go to: Homepage (Click Here)

Go to: VideoLectures (Click Here)

Placed here 09 Feb 2013 and last edited 21 Sep 2020 by Donald Forsdyke