Helical DNA’s backbone is made of sugar-phosphate-sugar linkages, where phosphates in di-ester groups have negative charges. Base pairs A-T and G-C in a specific sequence provide interactive surfaces in the major groove and less commonly in the minor groove. Though the N and O atoms are involved in base pairing to each other by hydrogen bonds, the other groups NH and O groups do provide atomic surfaces for the side chains of amino acids in proteins.
· Proteins have charged amino acids, which can bind to specific bases or to charged phosphates or to both. But the binding to phosphate group in the back bone is not sequence specific but just interactive charges.
The Watson-Crick base pairing in a coiled structure provides surfaces, one facing the major groove and the other the minor groove. The said surfaces in the base pairs provide provisions for interaction with charged ionic a.a found in proteins.
Besides atoms involved in base pairing, one can observe other groups free for interaction with amino acid side chains.
Complementary base pairing
Figure: AT and GC base pairs have available H bond donors and acceptors; The helical DNA consists of a major groove and a minor groove; the major grove provides ionic charges for the binding of amino acid side chains in the protein; however phosphates in sugar phosphate back bone provide negative charge to which positive charged side chains can bind. In this sequence specific bases in the major ( in most of the cases) binding specific protein motifs is fascinating.
Proteins can interact specifically with DNA through electrostatic, H-bond, and hydrophobic interactions. AT and GC base pairs have act as H bond donors and acceptors which are exposed in the major and minor grove of the ds DNA helix, allowing specific protein-DNA interactions.
Major groove provides interactive surface charges for interaction with amino acid side chains for non covalent bonding.
· Proteins exhibit greater diversity and greater complex structural organizations than DNA sequences, yet DNA provide base specificities in the form of N2 bases. The R-groups of amino acids, with basic residues such as Lysine, Arginine, Histidine, Aspargine and Glutamine can easily interact with adenine of the A: T base pair, and guanine of the G: C base pair, where NH2 and X=O groups of the base pairs can preferably form hydrogen bonds with amino acid residues of Glutamine, Aspargine, Arginine and Lysine.
Though hydrogen bonds are weak, but the binding of proteins to DNA over a length at different positions, in a sequence specific manner, can cause conformational change not only in the protein’s overall structure but also in DNA, either in the form of bending of DNA to 80 to 90 degrees or distortion leading to the melting of the DNA or compacting the DNA.
The key to DNA and protein interaction is specificity. For example binding of a tetramer lac-Z repressor is to 20 to 22 base pair sequence. The genomic length is 4.6 x 10^6 base pairs with so many combination of sequences, yet the repressor binds tightly and binds only to the Lac operator and not to any other regions. This clearly exemplifies strict correlation between the sequence of amino acids in the protein motif and nucleotide sequences in the DNA should be absolute and perfect for the binding. Crystal structures of DNA bound to specific proteins delineate which group of protein binds to which group of bases.
· Bending and opening of the helix into a bubble are very important features for initiating transcription or replication. This is exactly the function of RNAP-TFs complexes, but the complex, if fails to succeed in efficient initiation, it is the other DNA binding proteins, by binding elsewhere provide that function to open and activate the RNAP-TF complex to transcribe efficiently or they can also inhibit transcription. Similarly the binding origin specific protein complexes responsible for Replication.
Proteins, which are specialized in binding to DNA in sequence specific manner, come in different sizes, structures and shapes, you name it; and you have it.
· Basically, some of the structural motifs found in DNA binding proteins have been well characterized. They are characterized as zinc finger proteins, leucine zipper proteins, helix turn helix and helix loop helix proteins and there are few more other modules. They have the said motifs perhaps with certain structural combination of motifs.
All the said types of proteins contain a region to interact with DNA in sequence specific manner (DNA binding domain), which is most important for identifying the site; then they have domains for protein-protein interaction by which they dimerizes or produce multimers. Some of the binding or interacting proteins have specific charged regions, whose contact and interaction with basal transcriptional apparatus can lead to activation of the proteins.
Here is a list of cellular proteins, which bind to DNA in sequence specific manner and perform certain important functions listed against the proteins.
Helixes turn helix proteins:
Helix turn helix proteins consist of two short helices of 7 to 9 amino acids long but separated or linked by non helical segment of 3 to 4 amino acids, which are actually responsible for turning the protein by nearly 180^o degrees in space (don’t take it literally); it is an average estimate; ex. Cro (61a.a) and cI proteins of Lambda (~78 and ~260 amino acids long), Lac-Z of E.coli’s lac-Z operon, catabolic receptor protein (CRP or CAP, 224a.a), and few other transcriptional factors.
· TFs of homeo-domain of Hox- proteins, which are involved in the transcription of genes involved in body development of structures ex. like antp, eve, octa-1 octa-2, Hox and others.
Each of these homeo box proteins have three helical regions separated by a beta turn with 3 to 4 amino acids. Most of them have some similar domains, so they are called homeo-domains. Homeo-domain genes in humans and drosophila are clustered as complexes and they are involved in identity of organs; most of them act as transcription factors.
· One of the helix makes contact and binds to the major groove of the DNA in sequence specific manner (by hydrogen bonding), but the protein also binds to DNA nonspecifically to phosphate groups.
Each of the proteins also has another helix for protein- protein interaction, mostly stems from the helices containing hydrophobic amino acids on one surface of the protein.
· Many of the proteins also have nonstructural sites for the binding of substrates.
This diagram shows helix turn helix motifs that bind to a specific sequence of the DNA.
This figure shows the helix turn helix motifs of Lambda repressor that bind to O R 1 site
In the case of lactose repressor protein, by binding it does not allow RNAP to move forward thus it represses the expression of the gene.
· But the CAP protein when it is bound by cAMP binds to specific region of the promoter and activates the gene expression by means of contacting the RNAP-sigma complex. In this reaction one of the loops of the CAP protein interacts with one of the subunits of enzyme complex.
These proteins bind to the promoter region in sequence specific manner and interact with basal transcriptional apparatus by looping of the DNA by protein-protein interaction and the contact with basal transcriptional apparatus to activate the enzyme complex.
Helix loop Helix proteins:
Proteins, with two helical segments, consisting of 12 to 15 amino acids that are separated by nonstructural, randomly organized segments, are generally named as helix loop helix proteins with the helix-loop-helix motif.
· These proteins have excellent features for dimrization, through hydrophobic amino acids or amphipathic helix-helix interactions in each of the helical segments.
Many such proteins are found in E12, E 47, Myo-D, My f-5 and myc proteins (Myc is the proto-Onco gene). In fact they are classified based on the type of tissue in which they express.
· Group A: ubiquitously expressed proteins include mammalian E12 and E47 and Fly-da.
· Group B: Expressed in tissue specific manner in mammalian tissues, Myo-D and Fly Ac-s.
· Myc class: It is a class by itself and its partner proteins are different and targets are different.
This protein shows Helix loop Helix domains plus zipper.
The proteins, at C-terminal or at N- terminal, depending upon the kind of protein, have unpaired regions with more basic amino acids in specific positions. The basic residues with charges are required for the binding of proteins to DNA in sequence specific mode. Such proteins are called bHLH proteins (here b= basic amino acids).
· H-L-H proteins can dimerize with similar kind or of different kind proteins, to form either Homodimers or Heterodimers. Such proteins bind to DNA using one side of the protein and dimerize using the other side of the protein.
E12 and E47 bind to immunoglobulin enhancer regions. Myo-D (Myogenic) and myf-5 is involved in myogenesis. Myc gene is the counterpart of the Oncogene, involved in growth regulation; and the protein has such motifs.
· There are a group of gene products which have such motifs and they are involved in the development of Drosophila.
Some HLH proteins lack basic amino acids, e.g. IG proteins. Some such proteins, which lack basic amino acid ends, but have few Prolines in their place. These proteins play an important role in developmental process and they use what is called combinatorial dimrization.
Zinc finger proteins:
The zinc-finger proteins are globular proteins, but presents finger shaped motif to bind DNA, in sequence specific manner, and it is referred to as Zinc-Finger motif. There are innumerable examples of such proteins, to quote few, eg. Sp1, TFIII-A and ADR 1.
The name, Zinc finger protein, has derived from the kind of loop it generates when a covalent bond forms between a single zinc metal ion with 2 cytosine on one side and 2 Histidine on the other side either side of the polypeptide or it can be between 4 cys two each on either sides. Such a structure produces tetrahedral form. Each of the fingers is about 23 amino acids loop, if there is more than one finger the linker region between two loops is seven to eight amino acids.
· In most proteins with the zinc fingers, the N terminal region after cysteine has beta sheet and the right side of the loop has alpha helical structure.
This diagram shows Helix and beta sheets brought together by Zinc ion bound to two Cysteine and two Histidine side chains.
This is a nice ZiF protein diagram shows the components in Ball and Stick model
The number of fingers varies from protein to protein, ex. Sp1 protein has 3 zinc fingers, with 2 cys and 2 his. Each of them have beta on the left side of the finger and an alpha helix on the right.
· Each of the fingers bind to each turn at major groove, they bind by consensus sequences.
Many Drosophila transcriptional factors have these zinc finger motifs in their protein domains.
· In the case of Xenopus laevis, the transcriptional factor called X-fin has only one zinc finger protein, but the protein is folded into a globular structure as in other Z-finger proteins. However this protein has two successive beta sheets and one alpha helix. Two cys and two link the two beta sheets and a single alpha helix his by single zinc ion. In this the one cys from each of the beta sheets are involved in bond formation.
Zif 268 from mouse has three zinc fingers and each of them binds to a major groove and binds to 3 bp; in this binding process, Arginine and Guanine are involved in hydrogen bonding.
· It is interesting to note some of the Zinc-finger motif containing proteins can bind specifically to stem loop structure of certain RNA. Ex. TF-III-A which has zinc finger motifs bind to internal promoter region and also binds to 5s RNA.
A translational Initiator factor eIF2 (has zinc finger motifs). Mutations in this gene influence the recognition of initiator codons.
· Retroviral capsid protein binds to genomic RNA using its (Capsids) zinc finger motifs.
Steroid receptor has 4 cys bound to a central Zn, i.e. 2 cys and 2 cys pattern. They bind to short palindromic sequences.
· Gluco-corticoid and estrogen receptors have 2 zinc fingers each. The fingers of Gluco corticoid receptor form dimers.
Depending upon the protein, the number of fingers in a protein range from one to ten.
· The tetrahedral structure can develop either by binding of 2Cys and 2 His or 4Cys.
Leucine Zipper proteins:
These proteins have a stretch of amino acids rich in hydrophobic leucine and they are on one side of the right-handed helix.
· The repeat of Leucine is for every 3.5 residues per turn and this pattern repeats for every seven amino acid residues.
To illustrate this with an example, take a. b. c. d. e. f. g. h. as a sequence of amino acids as one segment of the helix, where a and d are hydrophobic, then one finds hydrophobic amino acids with hydrophobicity on the same side at every 3.5 amino acids, which is actually one turn of the helix.
· If two such chains having the same type of helices and hydrophobicity, they can easily interact with one another by protein-protein interaction and form coiled coils.
Two right handed helices, when coil to each other they form left handed coiled coils, but helically coiled region at the base bind to certain DNA sequences such as CAAT box of the rat liver and also to the core of SV 40 enhancer.
Leucine zipper protein on opposing amphiphilic helices
This another Leucine zipper protein bound to DNA-a wire model
Most of the LZ proteins have amino acid sequences with specific functional motifs such as DNA binding, connector, Leucine zipper and an invariant Asn next to DNA binding domain
Gcn4 is Leucine zipper protein that binds to CREB asymmetric site and another bind to AP-1 asymmetric site
This diagram shows the view from the top of the coiled coils showing the opposing hydrophobic residues providing the force for interaction and binding to each other.
Rat liver TF called C/EBP, an enhancer binding protein, binds to a response element CAAT sequences. In this a 28-a.a region has Leucine at every 7th position. This causes the dimrization of the proteins into coiled coils called zipper proteins.
Yeast’s GCN4, a transcriptional factor, and many proto- Oncogenes encoded proteins have such structural domains. It is not true for all such transcriptional factors. The GCN4 b. a Zipper protein consists of 281 residues. Its first 30 residues contain 3.6 or 3.5 heptad repeats. They coil into an 8 amino acids turn alpha helix.
Two such proteins can dimerize. Its N-terminal region remains open region, consisting of 16 amino acid residues, which are basic in nature. They engage in sequence specific DNA binding.
· But its C-terminal region forms a coiled-coiled helix as a Leucine-zipper pattern.
Such Leucine zipper proteins can sponsor both homo and hetero dimer formations.
· The C-EBP, an enhancer factor, binds to CAAT box of the rat liver and also SV 40 core enhancer sequence, and the protein has four such segment repeats, where each Leucine is found at 7th residue, that amounts to eight turns and 3.5a.a per turn.
An enhancer binding protein called AP 1 was found to bind to the enhancer region of the SV40, they are ‘jun’ and ‘fos’ factors that form heterodimers called AP-1 factor. They contain 5 such Leucine repeats.
Below some DNA binding proteins in their 3-D glory are shown for a sheer pleasure see them in their intimate binding positions.
This ribbon diagram is TBP protein as monomer binds to TATAAA box
This is same TBP with helixes and beta sheets in 3-D model
This another protein model showing how it is bound to DNA; the binding bends DNA
This ribbon model of GAL4 protein and it shows Leucine zipper region and also a loop and helix domains and the N-terminal DNA binding motifs.
This is a Glucocorticoid receptor protein bound to its cognate DNA sequences
Recent data base entries (2785 or more) with respect to transcription factors and other DNA binding or otherwise are grouped into many super class proteins.
Some of the TFs have been classified into super classes:
Superclass: Basic domains:
Leucine zipper (bZIP),
Helix loop helix (bHLH),
Helix-loop Helix/leucine zipper (bHLH-Zip),
Super class: Zinc-coordinating DNA binding:
Cys4 ZIF-nuclear receptor,
Cys6 cysteine-zinc cluster,
Zinc fingers of alternating composition.
Super class: Helix-turn-helix:
Fork head/winged helix,
Heat shock factors,
Super class: Beta-scaffold factors (with minor groove contact).
RHR (Rel homology),
Beta barrel alpha helix-transcription,
TATA binding proteins,
Heteromeric CCAAT factors,
Grainy head proteins,
Cold shock domain factors,
Super class; other TFs,
Copper first proteins,
E1A like factors,
AP2/EREBP related factors.
DNA binding proteins:
The number and characters of DNA binding proteins that bind directly in sequence specific manner and proteins that bind to DNA bound proteins are many in numbers that can run to several thousands.
Eucaryotes and Prokaryotes :
Helix Turn Helix proteins:
Homeodomian Protein (Drosophila Hox genes),
RAP1 protein (repressor n activator protein of telomeres),
TFIIB family, include winged HTH members,
IFN regulator proteins,
Transcription factors (a family of proteins),
Helix loop helix proteins:
Class A factors, MyoD, Archaeate- cute, Tal/Twist/Hen.
Helix loop helix –Leucine zipper proteins:
SREBP, USF1 and USF2,
Zinc finger Proteins: cys4- Zn finger, cys2-His2 zinc factors
Beta-beta-Alpha zinc finger proteins,
Nuclear receptor proteins (a family),
Loop-sheet-helix type proteins,
GAL4 type proteins,
Steroid hormone receptors,
Thyroid hormone receptors,
Zipper type (helix loop-helix),
AP-1, CREB, C/CREB,
Plant G-box proteins
Beta scaffold factors: RHR Rel homology region),
NFAT, STAT, p53, MADS box proteins,( SRF),
TBP, HMG box proteins- SOX and SRY,,
Other alpha helix:
E2 protein of Papilloma virus),
Histone family of proteins,
HMG family of proteins,
MADs box family proteins.
TATA box binding protein (TPB),
Met repressor proteins,
Tus replication Ter proteins,
IHF (integration host factor),
Transcription factor t-domain protein,
Rel homology region proteins,
DNA binding enzymes:
Endonuclease V proteins,
DNA mismatch endonucleases,
DNA pol I,
Most of the restriction endonucleases,
DNA pol beta,
That bind directly or indirectly to DNA, classified according to their functions: 1. Constitutively active-they are present in all cells and all times. E.g Sp1, NF1, CAAT binding proteins,
2. Conditionally active- they are required for activation; they can be- Developmental specific (cell specific),- GATA, HNF, MyoD, Myf 5 , Hox, winged Helix.
Signal dependent- extracellular ligand dependent, or intracellular ligand dependent (SREBP, p53,), cell membrane receptor dependent; resident nuclear factors (CREB, AP1), or latent cytoplasmic factors, when activated they move into the nucleus- STAT,R-MAD,NF-kB, Notch, NFAT etc.
Super class: basic domains:
Based on their structural (domains) and functional features they have been classified different classes and families.
Leucine zipper factors; Helix loop helix factors; Helix loop helix/leucine zipper factors, NF family, RF-X family and bHSH family.
Super class: Zinc binding;
Cys 4 ZFs, cys4 zinc finger, diverse zinc finger, cys2-His2 zinc finger, cys6cystein zinc clusters, Zinc fingers alternating composition and their related members.
Superclass: Helix turn helix- Homeodomain, paired box, Fork head/winged helix, Hsps, Trp clusters and TEA members.
Superclass: Beta-scaffold factors- RHR, STAT, p53, MADS box, beta-barrel alpha helix, TATA (TBP), HMG box, heteromeric, grainy head, Cold shock domain Runt and others.
0 superclass TFs:
HMG like, Copper fist, Pocket domain, E1A, AP2, EREBP, ARF,ABI and RAV.