Sequence Alignment

How do we find the best alignment of two sequences?

Introduction

An alignment is a gapped arrangement of coordinates between sequences. In other words, we want to figure out the 'best' position to start and end sequences, and where to put insert gaps. Our goal is to minimize 'edit distance' - the number of mutations to get from one sequence to another.

We first need to decide how we are going to compare (score) alignments - how can we decide whether an alignment is the "best"?

Scoring

Let's start with a simple scoring algorithm. For every position in the alignment, we might want to decrease the score for every mismatched base or gap, and we probably want to increase the score for a matched base. If we want to incentivize not inserting as many gaps, we can choose to decrease the score more for every gap.

The scores for all the mismatches, gaps, and matches are totaled for the alignment. Play around with aligning these sequences and see how various scores impact the total score. Click on the arrows to insert gaps, and click on gaps to remove them. See how different scoring systems affect what alignments are "better".

G
A
C
C
C
A
T
A
C
G
.
.
.
.
.
|
|
.
|
|
T
G
A
A
T
A
T
T
C
G

Gap (-1): 0
Mismatch (-2): 6
Match (1): 4
Score: -8

Gap (-2): 0
Mismatch (-1): 6
Match (1): 4
Score: -2

The Dotplot

While scoring is important for determining what makes a "better" alignment, it is an easy task compared to actually aligning sequences. Let's first explore a better way to represent alignments: the dot plot.

Dot plots represent all possible matches in an alignment. In a dot plot, we mark all possible match locations (in the following visualizations, they are marked with a 1). A path through a dot plot represents a possible alignment of the two sequences. Every valid path through the dot plot can represent every possible alignment. Diagonal path segments represent a direct alignment (aka, segments of alignment with no gaps). A vertical path segment indicates a gap in the top sequence, and a horizontal path segment indicates a gap in the sequence on the left.

Try playing with your alignment from earlier and seeing how it affects the path through the dot plot.

G A C C C A T A C G
T

1

G

1

1

A

1

1

1

A

1

1

1

T

1

A

1

1

1

T

1

T

1

C

1

1

1

1

G

1

1

Notice how your higher scoring alignments pass through more matches.

The dot plot gives us a simple representation of all alignments as three potential moves starting from the cell in the bottom right: diagonal (pairing two bases), left (inserting a gap in the left sequence), and up (inserting a gap in the top sequence). Now, we can reformulate our alignment problem as finding the best path through the dot plot. While this is a good representation of our problem, we still have the computational challenge of calculating the best path.

Needleman-Wunsch Algorithm

Luckily, a lot of smart people have already thought about this challenge. Let's examine one of the first algorithms developed to tackle aligning sequences - the Needleman-Wunsch algorithm.

This algorithm is an example of dynamic programming, dividing the whole sequence alignment into smaller problems - finding the best alignment of every sub-sequence. The algorithm is guaranteed to find the best alignment for a given scoring system and pair of sequences.

The steps are also pretty intuitive. First, create an empty dot plot with one extra row and column. The gap penalty is repeatedly applied to fill in the first row and column. Then, starting from the top left and moving left to right, top to bottom, apply the following formula.

S(i,j)=max{S(i1,j1)+score(ai,bi)S(i,j1)+gapS(i1,j)+gap

Just like the dot plot, a vertical or horizontal move indicates inserting a gap in one of the sequences. This formula chooses the "best" path from either matching the pairs, inserting a gap on the first sequence, or inserting a gap on the second sequence, based on whether the score for matching/not-matching versus inserting a gap is higher.

i-1 i
j-1 S(i-1, j-1) S(i, j-1)
j S(i-1, j) S(i, j)

At each step applying the scoring formula, we also record from which cell (or cells) the best score was calculated from. In other words, we keep track of the path that gives us the best score. After all the scores have been calculated, we will "traceback" the best path from the bottom right to to the top left to recover the alignment. If we move up, we insert a gap in the top sequence. If we move left, we insert a gap in the left sequence. If we move diagonally, we take a base pair from both sequences.

If that was too complicated (or if I did a bad job explaining), let's watch it in action!

G A C C C A T A C G

T

G

A

A

T

A

T

T

C

G

Conclusion

Eventually, I'll add some more sections here. For example, the Smith-Waterman algorithm is extremely similar to the Needleman-Wunsch algorithm, but finds optimal local alignments. Another thing to discuss are scoring matrices - having more complicated scoring than match vs. mismatch like PAM and BLOSUM. And finally, word-based heuristic methods like BLAST and FASTA for larger-scale alignments and searches.

For now, I hope that was an interesting demo for a very useful algorithm.

References

  • Fay, Justin. Lecture. BIOL253 - Computational Biology Lecture, Spring 2023.

  • "Needleman-Wunsch Algorithm". Wikipedia, Accessed October 2024.