lecture02_text.pdf

(180 KB) Pobierz
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
Lecture 2: Homology and similarity
Bioinformatics
Paweł K˛ dzierski
e
Contents of the lecture
The problem is that we usually make (educated) guesses about
homology – it is OK to state ”there is a 10% chance of homol-
ogy” if you have calculated such probability.
1
1
1
2
2
3
4
4
In biology, a homologous trait has a broader sense of any charac-
teristic (e.g. a phenotype feature) which is derived from a common
ancestor.
Variants of homology
Evolution of genes does not necessarily follow speciation.
A gene may be duplicated within a species. If both copies sur-
vive in the genome, they often diverge to provide specialized
or different functions. Such homologs are called
paralogs.
Genes may be transferred from one organism to another,
even completely unrelated. This is especially common among
prokaryotes (that is how resistance to antibiotics spreads), but
it happens to higher organisms, too. Homologous genes trans-
ferred to unrelated organisms are called
xenologs.
The ”proper” homology, when the genes accumulate muta-
tions during speciation of their host organisms, but (usually)
preserving their function, is called
orthology.
Differentiated due to
speciation
gene duplication
gene transfer
Gene function
usually the same
similar or different
usually the same
Homology
orthology
paralogy
xenology
Contents
1
Homology
1.1 Definition . . . . . . . . . . . . . . . . . . . . . .
1.2 Homology and similarity . . . . . . . . . . . . . .
Similarity
2.1 Measures of similarity . . . . . . . . . . . . . . .
2.2 Scoring matrices for proteins . . . . . . . . . . . .
2.3 Statistical meaning . . . . . . . . . . . . . . . . .
The rule of transitivity
2
3
1
Homology
1.1 Definition
Evolution of organisms and genes
The evolution of organisms (speciation) is the effect of differ-
entiation and selection. The differentiation of species works
by accumulation of changes in their genomes (mutations).
Study of evolution on the level of genes and genomes is the
subject of
molecular phylogenetics
or
molecular evolution.
A group of species which differentiated from a common an-
cestor is called a
clade
(sometimes a phylum, but the latter
term has a more specific meaning in taxonomy). Clade is a
general term applicable to both groups of related organisms
and to groups of related genes.
A more complex combination of the above is possible.
As a rule of thumb: homologs from two species related by a com-
mon ancestor are probably orthologs; homologs from the same or-
ganism are paralogs.
Discerning xenology is more difficult – it should be proven, that
Genes from a single clade (having a common ancestor gene) the gene homologs are missing in the ancestor – but existence of
are called
homologous.
This term can NOT be applied to or- homologs in unrelated species is a strong suggestion.
ganisms.
Exercise
Long, long ago, there lived a cell
X
with gene
A
providing function
α.
A grand-grand-. . . -grandchild of cell X lysed in a horrible way,
exposing his DNA to the public. A spectator
Y
got transfected and
discovered himself to have gained an useful function
α.[1em]
Zillions of years later, a student studied descendants of X (dX) and
descendants of Y (dY). He discovered homology between gene
A’
from dX, with function
α,
and two genes
B
and
C
from dY. B
provided function
α,
and C function
β.[1em]
Name the homology relations between
A, A’, B
and
C.
1
Homology of biological sequences
Homology
means having the same ancestor gene, period.
There are no measures of homology (or could you have 35%
of a brother?)
The term ”homologous” is commonly applied to proteins
and RNA in the sense that they are products of homologous
genes.
Project co-financed by European Union within European Social Fund
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
1.2 Homology and similarity
Similarity
Similarity
of sequences
can be measured.
Many different measures of similarity are in use.
The most
seemingly
intuitive measure of similarity – percent-
age of identical residues –
is the most misleading one!
a)
b)
AGCT
C
Sequence a) has
25%
residues identical with b)
A
G
T
C
BLASTN
A G T C
5
−4 −4 −4
−4
5
−4 −4
−4 −4
5
−4
−4 −4 −4
5
AAAGGGCCCTTT
AGCGATCTATCG
S
=
−12
C
AGCT
Scoring similarity with matrices
Sequence a) has
100%
residues identical with b)
The table of score points used to calculate similarity is named
scoring matrix
or
substitution matrix.
Sum of scores for every pair of compared residues is called
raw score.
It depends on the scoring matrix and is proportional to com-
pared length of sequences (of course!).
Raw score does not mean anything if we don’t know what
scoring matrix was used – good similarity measure must give
consistent results.
Relation between similarity and homology
”Invention” of a functional protein structure is extraordinarily
rare.
Mutations are accepted, if protein function is not destroyed.
Retained function means retained structure.
Some
amino acids at specific positions must be retained, too –
Bit score
a limited number.
Normalized score of similarity is called
bit score.
Other amino acids are replaceable.
S
=
The genetic code is degenerated.
S
is bit score;
Conclusions:
High similarity
over a significant length
is a con-
sequence of homology.
Low similarity does neither suggest nor exclude
homology.
S
is raw score or simply score;
λ
&
K
depend on the scoring matrix used to calculate
S.
bit score depends only on the compared sequences (their sim-
ilarity and length) . . .
. . . unless there are gaps!
Problems with measuring DNA similarity
Unknown reading frame and coding strand
Silent mutations: 20 amino acids / 61 codons
Two genes may have 33% different bases and still encode
an identical protein!
High threshold of insignificant similarity
1. assume equal probability
1
of each base:
1
4
1
(λS
ln
K)
ln 2
2
2.1
Similarity
Measures of similarity
Counting residues
All similarity measures count identical and different residues
length
in compared sequences and sum up scores:
S
=
Identity matrix
AGTC
A 1 00 0
G 0 10 0
T 0 01 0
C 0 00 1
AAAGGGCCCTTT
AGCGATCTATCG
S
=
4
i
s
i
Transition vs
Transversion
AGTC
A 2 10 0
G 1 20 0
T 0 02 1
C 0 01 2
AAAGGGCCCTTT
AGCGATCTATCG
S
=
12
2. make a sequence, drawing base after base randomly
3. you get
25% bases identical with
anything.
. .
Conclusion:
Homologous proteins may have quite different
genes; often below the level of statistical signifi-
cance.
1
This
assumption is for simplicity of the example, but is not met in practice.
Project co-financed by European Union within European Social Fund
2
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
Advantages of measuring similarity of proteins
No reading frame problems.
Alphabet of 20 residues vs 4 bases:
Similar analysis gives threshold of insignificance at 5%
of identity
2
Statistical analysis gives the real
threshold of about
15%
of identity, but:
1. compared sequences
must be long
enough.
2. it depends on sequence composition and structure.
Conclusion:
Looking for homology, one should compare pro-
tein sequences.
The PAM scoring
How the PAM scoring matrices were calculated:
The mutation probability matrix was calculated for a given
evolutionary distance in PAM e.g.
M
PAM70
=
M
70
The scoring matrix – PAM70 in this case – has elements
s
i j
calculated as:
m
i j
s
i j
=
log
f
j
where
m
i j
is element of
M
PAM70
, and
f
j
is frequency of
j-th
amino acid.
More on PAM matrices
Similarity of amino acid residues
What does it really mean, and how to score it?
Similar codons?
Similar size?
Polarity or hydrophobicity?
Acidity or basicity?
Functional groups?
H-bonding properites?
Ion coordination properties?
Nucleophilic properties?
Importance of properties is function-dependent, but we seek a one-
for-all measure. . .
The PAM distance is the number of mutations accumulated
per 100 residues.
PAM10 means 10 mutations and 90% identical residues,
but PAM100
does not mean
0% identity.
If mutations were scattered uniformly, PAM70 would correspond
to 55% identity, PAM100 to 45%, and PAM120 to 40% identity.
In reality, some fragments of protein sequences mutate at much
higher rate than other, conservative ones. PAM scoring matrices
do not account for such differences.
Pros and cons of PAM
Advantages of PAM scoring matrices:
Well founded theoretical model
Easy to extrapolate to a long evolutionary distance
Disadvantages:
Based on very limited sequence data available in 1978. Proba-
bilities were calculated from 1572 observed substitutions, for
a 20x20 matrix!
Extrapolation amplifies errors.
Averaged model – assumed is a constant rate of (accepted)
mutations over entire sequence.
Inferior when sequence similarity is low and fragmentary.
2.2
Scoring matrices for proteins
The first idea: Percent Accepted Mutations
Based on phylogenesis of protein sequences sharing at least
85% of identical residues;
Matrix
M
of mutation probabilities
m
i j
of all single amino
acid substitutions (i
j)
calculated from frequencies of mu-
tations;
M
defines a unit of evolutionary distance: 1 PAM defined as
one accepted mutation per 100 amino acids.
BLOSUM Matrices
Probabilities of multiple mutations (e.g first
i
j,
then
j
k)
BLOSUM stands for BLOck SUbstitution Matrix.
are given by
m
i j
·
m
jk
. Therefore,
M
probabilities can be ex- It was observed that even distantly related proteins have fragments
of relatively high conservativity with little or no gaps.
trapolated to any evolutionary distance by multiplication.
Dayhoff, M.O.
et al.,
”A model of evolutionary change in pro-
teins”,
Atlas of Protein Sequence and Structure,
vol. 5(3),
pp. 345–352 (1978).
2
If
The scores were calculated directly from frequencies of pair-
wise differences in aligned, continuous blocks.
The scores account for multiple substitutions implicitly –
there was no theoretical model of evolution.
3
residues had equal probabilities of occurence
Project co-financed by European Union within European Social Fund
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
BLOSUM80 calculated from sequences with at least 80% of
identity;
BLOSUM62 from sequences with 62% of identity, etc.
Henikoff, S., Henikoff, J., ”Amino acid substitution matrices from
protein blocks”
Proc. Natl. Acad. Sci. USA
89, pp. 10915–10919
(1992)
Pros and cons of BLOSUM Matrices
Advantages:
Based on much more data than PAM,
Specifically adressed distant homology;
No extrapolation, no error amplification;
Better than PAM for detection of local similarity;
BLOSUM62 good at finding distant homologs in most cases;
BLOSUM45 or 50 advisable for detection of very weak simi-
larity but over long sequences.
Disadvantages:
There is no theoretical model of substitution behind.
Averaged model, too – but focused on conservative parts
of distant homologs.
BLOSUM and PAM comparison
In reality, the rates of mutations in conservative and non-
conservative regions of proteins are very different. The equiva-
lence of various PAM and BLOSUM matrices, based on practical
experiences, are presented here:
http://www.ebi.ac.uk/help/
matrix.html
PAM 100
PAM 120
PAM 160
PAM 200
PAM 250
BLOSUM 90
BLOSUM 80
BLOSUM 60
BLOSUM 52
BLOSUM 45
The expectation value
E
of the number of unrelated sequences
with similarity
M
S:
E
=
exp
K
·
N
·
n
·
e
−λS
Both
p(M
S)
and
E-value
depend on lengths of sequence
n
and of the database
N.
Both are close to zero when the similarity is statistically sig-
nificant.
E
value is easier to comprehend:
E
=
10 reads ”You may ex-
pect 10 unrelated hits at least this much similar by chance”.
The longer the extent of similarity, the better statistical sig-
nificance. Short matches may be unrelated even with 100%
identity.
3
The rule of transitivity
Homology – the rule of transitivity
Let the symbol
||
denote homology.[1em]
Given the following assumptions:
The similarity of sequences A and B
over their entire length
(or almost) implies homology (beyond a reasonable doubt).
The same is true for sequences B and C.
Then:
A must be homologous to C,
even if their pairwise similarity
is insignificant.
(A||B)&(B||C)
(A||C)
Pitfall:[1ex]
2.3
Statistical meaning
Interpretation of similarity
Neither the raw score
S
nor the normalized bit score
S
can be in-
terpreted alone. In order to either confirm or deny homology, one
must consider the statistical significance of the score:
The probability, that a sequence of the same or higher score
M
may exist in the database of size
N
simply by chance:
p(M
S)
=
1
exp
−K ·
N
·
n
·
e
−λS
where
S
is the score,
n
is the sequence length, and
K
and
λ
are
specific for the scoring matrix;
or
Project co-financed by European Union within European Social Fund
4
Zgłoś jeśli naruszono regulamin