LOCAL ALIGNMENT WITH SIP4.

First start up sip4 by typing in:

sip4 &

Then select the Simple sub option from the Load sequences option of the File pull down menu

Load in two sequences from the default EMBL sequence database. Load in:

For the Horizontal sequence, the EMBL entry with the EntryName xlacacr

For the Vertical sequence, the EMBL entry with the EntryName xlactcag

Once you have things as illustrated, click on the OK button of the Load sequences window. The selected sequences will be read in by sip4 and their details displayed sip4's Output window.

You have loaded two Xenopus Laevis actin sequences. One is genomic DNA (xlactcag) and the other is the corresponding cDNA (xlacacr).

Next select the Local alignment option from the Comparison pull down menu.

In the local alignment window, change the setting for the penalty for each residue in gap from its default setting of 0.2 to 1 and then click OK

You have made gaps substantially more expensive to extend than by default, but retained the default cost for starting a gap.

A textual representation of the alignment you compute will appear in the Output window of sip4. A corresponding graphical representation will also be generated in a new window labelled sip plot.

Look in the Output window. Use the Output window scroll bar to look at the computed alignment.

Move to plot. The diagonal line indicates the aligned regions. Select crosshairs. Note that as you move the crosshairs around, the indicated positions in the Horizontal and Vertical sequences are shown in boxes in the top right hand corner of the plot. Click on the crosshair button once more and the crosshairs will disappear.

Double click with middle mouse button near to the line indicating the aligned regions. sequence_display will appear.

Move to the sequence_display and click on Nearest match. This will move the sequence_display to one end of the aligned region. Click on the Lock button. Move around the sequence_display using the sequence_display slide bars and/or the graphics screen cross hair (middle mouse button position over the green and blue lines indicating the position of the sequence_display) to look along the aligned region. Note movement of the sequence display is strictly limited to the current alignment diagonal.

Click on the sequence_display Lock button again (turning Lock off). Move the sequence_display once more using the scroll bars and/or the graphic screen crosshair. Note you are no longer constrained to the current alignment diagonal.

Get rid of the sequence display by selecting the Exit option from its File pull down menu.

Once again, select the Local alignment option from the Comparison pull down menu. As before, set the penalty for each residue in gap to 1. This time, also click on the alignments above score button. This asks that all alignments scoring more than 20 (by default) are shown. The default is that only the one best alignment is displayed. Click OK

Look at the textual output in the sip4 Output window and the corresponding graphical output. Note that this time sip4 has reported 7 aligned regions (corresponding to the 7 exons represented in both sequences and separated by 6 introns in the genomic sequence).

You have your two graphical alignments one on top of the other. Separate them by picking up either one (middle mouse button held down over the coloured square next to the graphic to be repositioned) and moving it out of the current graphics display window (you can also move it just above or below other graphics on the current window).

For the third time, select the Local alignment option from the Comparison pull down menu. This time, accept all the default settings and click OK

Look at the textual output in the sip4 Output window and the corresponding graphical output. Note that this time sip4 has reported only 1 aligned region but that region spans the 5 of the 7 reported by the previous analysis. The default settings of the gap penalty values are such that sip4 introduces gaps in the cDNA to match the introns in the genomic sequence.

Look at both the textual and graphical output. Note that not the whole of both sequences are included in the alignment. Move the graphical output into a separate window.

For the fourth and final time, select the Local alignment option from the Comparison pull down menu. This time, click only on the alignments above score button and then go for the OK button. This time you generate 3 alignments. One covering 5 of the 7 exons and two others each covering one of the remaining two exons.

Select the results manager from the View pull down menu. Remove the 4 raster plot entries. Go to the Output window. Note that using your right hand mouse button you can out put any of your textual creations to a disc file. Instead, remove all textual output so you have a nice clean start for the next section.

GLOBAL ALIGNMENT WITH SIP4

Select Sequence manager from the File pull down menu. Put the mouse over the xlacacr entry and hold down the right hand button. Select the Set range option.

Set the Start position of xlacacr to 500. Set the End position of xlacacr to 800

Put the mouse over the xlactcag entry and hold down the right hand button. Select the Set range option.

Set the Start position of xlactcag to 4700. Set the End position of xlactcag to 5500

Put the mouse over the xlacacr (500..800) entry and hold down the right hand button. Select the Horizontal option.

Put the mouse over the xlactcag (4700..5500) entry and hold down the right hand button. Select the Vertical option.

Select Align sequences from the Comparison pull down menu. Note the very much larger default value for penalty for each residue in gap. By default, long gaps will be expensive. This method is therefore far less likely than the local method to gap the introns in xlactcag correctly.

Click on the OK button of the align sequences window for a default global alignment.

Look first at the textual output in the sip4 Output window. The regions you are aligning contain 2 of the 7 exons in both xlacacr and xlactcag. It should be clear that sip4 has correctly aligned one of the two exons but aligned the other with the intervening intron. The gap penalties were such that, from the simplistic view point of the program, this is preferable to matching the intron in xlactcag with a gap in xlacacr.

Note that at the bottom of the display sip4 notes that it has:

Added sequence xlacacr_s1_a4
Added sequence xlactcag_s2_a5

If you look again at your Sequence manager, you will see that there are two new entries. These are the aligned portions of xlacacr and xlactcag, including padding characters.

Percentage mismatch 71.9

               500       510       520       530       540       550
      xlacacr_s1 ************************************************************
     xlactcag_s2 tggatctttcttctgtagaataacctttgctaattaggccttaacattcatctattcttc
              4700      4710      4720      4730      4740      4750
               560       570       580       590       600       610
      xlacacr_s1 **taccacaggtatcgttcttgactctggtgatggtgtcacccacaatgtccccatctat
                   :: : :::::::::::::::::::::::::::::::::::::::::::::::::::::
     xlactcag_s2 tttatc*caggtatcgttcttgactctggtgatggtgtcacccacaatgtccccatctat
              4760      4770      4780      4790      4800      4810
               620       630       640       650       660       670
      xlacacr_s1 gaaggttatgctctgccccatgccatccagcgtctggacctagctggtagagacctcaca
                 ::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::
     xlactcag_s2 gaaggttatgctctgccccatgccatccagcgtctggacctggctggtagagacctcaca
              4820      4830      4840      4850      4860      4870
               680       690       700       710       720       730
      xlacacr_s1 gattacctcatgaagatcctgactgaacgtggctactcctttgtgacaacagctgaaagg
                 :::::::::::::::::::::::::::::::::::::::::::::::::::: : :
     xlactcag_s2 gattacctcatgaagatcctgactgaacgtggctactcctttgtgacaacaggtaagttc
              4880      4890      4900      4910      4920      4930
               740       750       760       770       780       790
      xlacacr_s1 gaaattgtccgtgacatcaaggaa**aagctgtgctatgtggc*tttg**gact**ttga
                  :   ::       ::: :::    : :     :: : :: :::: ::    : :
     xlactcag_s2 tatcatgctaaatccatagagggcctacacaaacttagatagcatttgccgaagaataca
              4940      4950      4960      4970      4980      4990
               800       810       820       830       840       850
      xlacacr_s1 gaatgaaatggccaccgctgcctcatcctcctccctggagaagagctatgagcttccc*g
                 :::: : :    ::   ::    :   : : :   : :::: : : ::   :   :
     xlactcag_s2 gaatctattatatacacttggaagaaaattatgtcattataaga*caacaagaaacagtg
              5000      5010      5020      5030      5040      5050
               860       870       880       890       900       910
      xlacacr_s1 acggtcag*gtc************************************************
                 :: :::: : :
     xlactcag_s2 acagtcacagactgatacatcagctgggctatgcactaattattgaaccttgtgatatgt
              5060      5070      5080      5090      5100      5110
               920       930       940       950       960       970
      xlacacr_s1 ************************************************************
     xlactcag_s2 agcaattatgtcttataaaaaagtcataggacccccgggcacataccgaaaaataactcc
              5120      5130      5140      5150      5160      5170
               980       990      1000      1010      1020      1030
      xlacacr_s1 ************************************************************
     xlactcag_s2 cccatagatgagaatctcacgtagaatcattgaatgaggacatttactgtgaccacaaag
              5180      5190      5200      5210      5220      5230
              1040      1050      1060      1070      1080      1090
      xlacacr_s1 ************************************************************
     xlactcag_s2 cagaacatcttactaataagagagaaaatagccacaatactgaaaataatgaacttgtga
              5240      5250      5260      5270      5280      5290
              1100      1110      1120      1130      1140      1150
      xlacacr_s1 ************************************************************
     xlactcag_s2 tttttttcaatgtttctgtagaataactcttcagagtttaatctcattatgctttgtttt
              5300      5310      5320      5330      5340      5350
              1160      1170      1180      1190      1200      1210
      xlacacr_s1 ************************************************************
     xlactcag_s2 tgccccatacagctgaaagggaaattgtccgtgacatcaaggaaaagctgtgctatgtgg
              5360      5370      5380      5390      5400      5410
              1220      1230      1240      1250      1260      1270
      xlacacr_s1 ************************************************************
     xlactcag_s2 ctttggactttgagaatgaaatggccaccgctgcctcatcctcctccctggagaagagct
              5420      5430      5440      5450      5460      5470
              1280      1290      1300
      xlacacr_s1 ***********************
     xlactcag_s2 atgagcttcccgacggtcaggtc
              5480      5490      5500
Added sequence xlacacr_s1_a4
Added sequence xlactcag_s2_a5

Now take a look at your graphical output.

You should see a single diagonal line passing through the correctly aligned exon.

Using your textual alignment as a guide, double click with your middle button on the diagonal somewhere around where the alignment is real.

This will cause the sequence_display window to come into view and the green and blue sequence position cross hairs to appear.

Click on the Nearest match button in the sequence_display window to move exactly to a properly aligned region. Click on the Lock button and move along the aligned exon.

Remove your incorrect graphical alignment and your sequence display.

Do further Global alignments changing the alignment parameters until you generate a believable alignment illustrated graphically here.

Hint: (as if you need one): You have to make gaps, particularly long gaps cheaper and/or correctly aligned bases better rewarded and/or incorrectly aligned bases less severely penalised.

Once you have succeeded, remove all you results in order to start afresh for the next section.

DOT PLOTS WITH SIP4

First do a dot plot of the whole of xlacacr against the whole of xlactcag. Logically, one would do this before playing around with the Global and/or Local alignment tools of sip4. Dot plots are for generating an overview showing roughly how sequences compare. This overview should be used to plan the use of the textual alignment tools.

To retain a little "mystery" we leave the obvious first step until last in this exercise.

So, load the whole of xlacacr as the Horizontal sequence and the whole of xlactcag as the Vertical sequence.

Select the Find similar spans option from the Comparison pull down menu and request a default dot plot by clicking the OK button.

The dot plot clearly shows the 7 exons that were revealed bit by bit during the previous analyses.

Fine, but not that interesting as dot plots go. The exons revealed are all very strong (almost identical regions). They offer but a small challenge to sip4.

Remove all current results, textual and graphical.

Select the Sequence manager from the File pull down menu. Put the mouse cursor over each sequence in turn, depress the right hand mouse button and select the delete option. Your Sequence manager should be empty when you have finished.

Select the Simple sub option from the Load sequences option of the File pull down menu. Select the SWISSPROT database for both the Horizontal and Vertical sequences. Enter the Entry Name egfr_human for both the Horizontal and Vertical sequences.

Click the OK button of the Load sequence window.

Select the Find similar spans option from the Comparison pull down menu and change the default window length of 11 to 25.

Note how the default minimum score adjusts automatically to reflect the change in window length.

Click on the OK button in the find similar spans window.

sip4 notices that you have given it a sequence to compare against itself. By default, for self comparisons sip4 does not plot the inevitable leading diagonal or the mirror image top half of the plot.

This plot illustrates that, contrary to common initial reaction, comparing a sequence with itself is not silly. Such plots can show up interesting internal features.

Here you should see a fairly strong diagonal line of dots indicating clear evidence of a reasonably faithful repeat of the first 300 or so amino acids.

Also, at the end of this repeat there appears to be a region of several other very short repeats.

To investigate the repeat region, bring up the sequence_display by double clicking with the middle mouse button near an interesting feature. Click on the Nearest match button of the sequence_display window to position the display exactly over the region of interest.

Move along the repeat region by clicking on the Lock button, sliding the sequence_display scroll bar along a bit, clicking on the Lock button once more (to unlock the sequence displays) and then clicking on the Nearest match button once more.

Once you have seen enough, remove the sequence_display and the plot. Next produce the same plot again, but this time computing the whole plot including the leading diagonal the merely indicates the egfr_human is remarkably similar to itself.

To ensure the whole plot is generated, select the Hide duplicate matches option from the Options pull down menu.

Once more, select the Find similar spans option from the Comparison pull down menu and change the default window length of 11 to 25. Click the OK button.

You should see the full dot plot as illustrated to the left of this text.

Next, to show the effect of varying the window size used for the dot plot. Varying the window size being the most effective way of controlling the sensitivity of a dot plot. Smaller window sizes generating more sensitive plots.

First, configure the plot you have just computed so that it is displayed using thick white dots. To do this, put your mouse cursor over the coloured square corresponding to the plot you have just generated. Hold your right mouse button down and select the configure option.

A window labelled cbox will emerge. In this window, adjust the Line width setting to 4 and position the Red, Green, Blue sliders to produce White (i.e. slide all three to the extreme right). This done, click on the cbox OK button.

The next step is to redraw the dot plot using a smaller (default size of 11) window length. Smaller window lengths generate more accurate plots.

Select once more the Find similar spans option from the Comparison pull down menu. This time compute a default dot plot by simply clicking on the OK button.

The second, more accurate plot is drawn on top of the less accurate (window length of 25) plot. In order to make comparison of the two plots as easy as possible, draw the plot on top in black with thin lines. Use the configure option from the second plot to do this as before.

The superimposed plots show three important effects of using more accurate smaller window sizes. They are:

Small features are missed with larger window sizes, but picked up with smaller window sizes (Good news - see particularly the region of small repeats).

Most very small features mean nothing. Thus smaller window sizes increase noise (Bad news - see particularly the regions where nothing of particular importance is evident).

Major features are drawn in more accurately but less obviously with smaller window sizes (as dot plots are about showing the existence of features generally rather than revealing great detail, this is generally Bad news - see particularly the main repeat region).

Try zooming into an interesting region (for example the region with the small repeats) by holding the Ctrl key and the right hand mouse button down and defining a rectangle around the region you wish to magnify.

To "unzoom", use the Back button.

Finally, try using the Help buttons. Help buttons are available from all sip4 windows. Depending on how things are set up, very full context dependant Help will be made available either in a web browser window or in a display tool specific to the Staden package.