lecture12_text.pdf

(647 KB) Pobierz
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
Lecture 12: Biopython
Bioinformatics
Paweł K˛ dzierski
e
Contents of the lecture
2
Overview of Python programming lan-
guage
Simple syntax (quick to learn)
High-level, general purpose language (all and any type of
problems can be adressed)
Rich libraries of ready-to-use solutions (math, statistics, plot-
ting, databases, web, . . . )
Modern (object-oriented)
Contents
1
2
3
Overview of bio-related programming tools
Overview of Python programming language
Using Biopython
3.1 Features . . . . . . . . . . . . . . . . . . . . . . .
3.2 Documentation . . . . . . . . . . . . . . . . . . .
3.3 Examples . . . . . . . . . . . . . . . . . . . . . .
1
1
3
3
3
3
Why python?
1
Overview of bio-related programming
tools
Online resources:
http://www.python.org
http://docs.python.org/tutorial/
http://www.python.org.pl
Program structure
A program consists of definitions and instructions.
Definitions
are used to introduce new ”things” and giving
them some names.
Names
A name is any continuous chain of letters, digits and under-
scores with the exception that it can not start with a digit. The
names identify ”things” or objects in the program
Instructions
are
actions.
They are executed from left to right
on a line, and from the top to bottom.
There are also
control instructions
which may make the pro-
gram to skip some fragments, or to go back and re-execute
some fragments.
Within the computers,
everything
is a sequence of bits (the OS,
the software, data, . . . ). Its the
interpretation
of the bits which is
meaningful to humans.
The programming languages must be able to properly interpret the
stream of bits they work on. It is the essence of
data typing.
Python recognizes data
within the program
by the way it
is written (this is called
dynamic typing):
’text’ "text" ’’’text’’’ """text"""
integer numbers:
17 017 +0x17 -0xff
Introduction
The „informatics” part of bioinformatics:
Algorithms (still active area of research)
Statistical analysis of data and results
Database access: both local (SQL) and via WWW (specific)
Batch analyses (time consuming calculations)
Automation of routine tasks (e.g. think of Phylip)
High throughput analysis (genome-wide, population-wide,
microchip data etc.)
Bioinformatics related programming packages
All these are coordinated by the Open Bioinformatics Foundation
(O|B|F).
BioPerl
http://www.bioperl.org
Biopython
http://www.biopython.org
BioJava
http://www.biojava.org
BioRuby
http://www.bioruby.org
BioSQL
Project co-financed by European Union within European Social Fund
1
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
real numbers:
-1.0 .5 12. 1e2 -1.01e-4
sequences of data:
’abcd’ (1,’a’) [1,2,3]
dictionaries (mappings):
{1:’a’, -1:’b’}
Functions
number = input(’Give me a number:’)
while number > 0:
print number
number += 1 # increase by one
if number > 100: break
A repeatable or general task can be separated from the main flow
of the instructions and defined as a standalone
function.
Then, such
Useful Python resources
a function can be executed as a whole (”called”) using syntax
subprocess (for automation of batch tasks)
function_name().
The function can accept data (the
parameters)
and/or produce some result (return
value).
General scientific calculations: scipy
Examples:
http://www.scipy.org
def hello1():
print ’Hello!’
def hello2(who):
print ’Hello,’, who, ’- I like you :-)’
def sum(a,b):
return a+b
Classes
Classes are used as definitions of containers for data and func-
tions working on that data.
Simple and general classes are built into Python: for example a list
of (any) items, which can be extended, modified, rearranged or
sorted.
A specialized class may represent MSA, with the abilities to access
columns, calculate consensus or conservativity, read the MSA from
a file or convert it to a different format.
class sack:
pass
mybag = sack()
mybag.text = ’Something to remember’
mybag.pin = ’1234’
print mybag.pin
Syntax is based on whitespace.
Python code
must
be properly indented. Otherwise you will
observe the dreadful
SyntaxError
or
IndentationError
messages.
variable = input(’Enter a number: ’)
if variable > 0:
print ’The number’, variable, ’is positive.’
elif variable == 0:
print ’It is zero.’
else:
print ’Your number:’, variable, ’, is negative.’
print "OK, I’m done."
For executing instructions over a sequence of items, use
for
loop:
for item in (’one’, ’second’, ’third’):
print item
If it is unknown a priori, how many times the loop will be
executed, one can execute or
break
the loop conditionally:
Plotting: matplotlib (Matlab alike), gnuplot, Mayavi (3D). . .
http://matplotlib.sf.net
http://gnuplot-py.sf.net, http://www.gnuplot.info
http://code.enthought.com/projects/mayavi/
Interface to the R statistical package: RPy
http://rpy.sf.net
http://www.r-project.org
Molecular Modeling in Python:
Molecular Modeling ToolKit (MMTK)
Biochemical ALgorithms Library (BALL)
PyQuante
Molecular Visualization
Embedded Python Molecular Viewer (ePMV)
http://epmv.scripps.edu/
Programs embedding Python
Modeller
VMD
NWChem
Python Imaging Library (PIL)
http://www.pythonware.com/products/pil/
PDF document generation:
ReportLab
http://www.reportlab.com
A
or write LTEX files
http://www.latex-project.org/
HTML document generation – directly or with modules like
HTMLgen, HTMLTags, HyperText or XIST
Graphical User Interface libraries
If you want that „pro” look & feel
Graphical User Interfaces for Python:
Tkinter (default, mature, relatively simple)
other based on Tkinter: easygui, Python Mega Widgets
(PMW)
http://www.ferg.org/easygui/
http://pmw.sf.net
wxPython (free, feature rich)
http://www.wxpython.org
http://www.wxwidgets.org
2
Project co-financed by European Union within European Social Fund
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
PyQt (free/commercial, professional)
http://www.riverbankcomputing.co.uk/software/
pyqt/
http://qt.nokia.com
To make standalone Windows executables:
py2exe
http://www.py2exe.org
If the file has named fields (e.g. GenBank), one can access
them in resulting data structure by name (Python dictionary
alike);
The same data structures can also be initialized from database
queries or from command line.
3.2
Documentation
3
Using Biopython
There are books on Biopython and other mentioned Python
libraries (Amazon, O’Reilly)
Biopython tutorial
http://biopython.org/DIST/docs/tutorial/
Biopython Cookbook
http://biopython.org/DIST/docs/cookbook/
Biopython Wiki
http://biopython.org/wiki/Main_Page
Why Biopython
The functionality of Bio-. . . libraries is equivalent among lan-
guages, but:
Python is simple to learn and efficient to use;
fully object oriented;
simpler to learn (and to read code) than Perl;
more compact (quicker to write) than Java;
has many standard and additional libraries.
3.3
Examples
Word of caution: Biopython is a moving target.
This is partially because the supported WWW services are even
more a moving target, too (especially NCBI!); and in part due to
rapid development of bioinformatics and the „Bio-” packages.
The examples were working at the time of this writing, but. . .
They may stop/have stopped working at some time.
If this happens with your own scripts, don’t panic; such
changes are documented in Biopython FAQ (biopython.
org/DIST/docs/tutorial/Tutorial.html\#htoc5).
The most severe incompatibilities are introduced by WWW
database services. If this happens, upgrade your Biopython
version.
Example 1: Querying Entrez (NCBI)
NCBI provide server-side CGI scripts (eutils) with defined
HTTP API
http://www.ncbi.nlm.nih.gov/books/NBK25500/
This API is programmed into
Entrez
object of Biopython
Functions of this object reflect the names of eutils:
einfo
– information on Entres databases
esearch
– searching (querying) databases
The result is a list of accession numbers (ID’s)
efetch
– retrieve the real data (by ID/AC)
Outline of using eutils (not only via Biopython):
The first result is a „handle” – representing the connection to
the server;
3
3.1 Features
Data structures
Biopython has parsers for many file formats:
FASTA
GenBank
PubMed and Medline
UniGene
SwissProt
Clustalw
BLAST output (standalone and WWW versions)
ExPASy Enzyme and Prosite
SCOP
Data are represented by objects with rich interfaces;
Files are parsed into data structures with easy access to con-
tents
If the file has multiple entries (e.g. FASTA), one can loop over
the entries;
Project co-financed by European Union within European Social Fund
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
Reading from the „handle” retrieves the result in form of an
XML document (or text). The handle could only be read once;
It is best to parse this XML to something (more) usable.
Biopython provides a parser –
Entrez.read()
– which con-
>>> handle
= Entrez.einfo(db=’pubmed’)
verts XML to a Python data structure (usually, a dictionary);
>>> result
= Entrez.read(handle)
>>> pubm_info = result[’DbInfo’]
Retrieving data is (at least) a two-step process; first one gets
>>> print ’DB description:’, pubm_info[’Description’]
AC ID’s only (from
esearch),
then these ID’s could be used
DB description: PubMed bibliographic record
with
efetch.
>>> print ’Count of records:’, pubm_info[’Count’]
Count of records: 21809524
Code snippets
>>> for field in pubm_info[’FieldList’]:
print ’%(Name)s: %(Description)s’ % field
Following the behaviour of interactive Python interpreter, the
...
instructions will be distinguished by
>>>
prompt;
If an instruction (or block of them) continue over more than
one line, it will be presented with secondary prompt (ellipsis);
Output printed by the program will be
colored.
>>>
x = 1
>>> for word in ’Biopython’, ’lecture’:
...
print word
Biopython
lecture
This is an example how to query which databases are available (and
their names to be used in programs).
>>> from Bio import Entrez
>>> Entrez.email = "Pawel.Kedzierski@pwr.wroc.pl"
>>> handle
= Entrez.einfo()
>>> xml_result
=
handle.read()
>>> print xml_result
<?xml version="1.0"?>
The result (of the last
for
loop): list of PubMed field names and
descriptions:
ALL: All terms from all searchable fields
UID: Unique number assigned to publication
FILT: Limits the records
TITL: Words in title of publication
WORD: Free text associated with publication
MESH: Medical Subject Headings assigned to publication
MAJR: MeSH terms of major importance to publication
AUTH: Author(s) of publication
JOUR: Journal abbreviation of publication
ECNO: EC number for enzyme or CAS registry number
SUBS: CAS chemical name or MEDLINE Substance Name
...
Use
term=’query’
with
esearch
to query databases:
Database specific info is also retrieved using
einfo.
For example,
to search a specific field, it is good to confirm that this field exists
in the database:
>>> handle = Entrez.esearch(db=’pubmed’, term=’biopython’)
>>> result = Entrez.read(handle)
>>> print ’Keys:’, result.keys()
Keys: [u’Count’, u’IdList’, u’QueryTranslation’, ... ]
>>> print result[’QueryTranslation’]
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoRes... biopython[All Fields]
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInf...
>>> print "List of ID’s:", result[’IdList’]
List of ID’s: [’22399473’, ’21666252’, ’21210977’, ... ]
<eInfoResult>
<DbList>
Use
efetch
to retrieve database records – by ID:
<DbName>pubmed</DbName>
>>> handle = Entrez.efetch(db=’pubmed’,
<DbName>protein</DbName>
...
id=’22399473’,
<DbName>nuccore</DbName>
...
rettype=’gb’, retmode=’text’)
The XML result could be parsed to get Python data structure
>>> result =
handle.read()
(dictionary-alike). Just use
Entrez.read()
to read (& parse) the
>>> print result
result:
1. Methods Mol Biol. 2012;856:513-27.
>>> handle = Entrez.einfo()
Sharing programming resources between Bio* projects throug
>>> result =
Entrez.read(handle)
and native call stack strategies.
>>> print result.keys()
[u’DbList’]
Prins P, Goto N, Yates A, Gautier L, Willis S, Fields C, K
>>> print result[’DbList’]
...
[’pubmed’, ’protein’, ’nucleotide’, ’nuccore’, ’nucgss’,
’nucest’, ’structure’, ’genome’, ’genomeprj’,
The same record, but we want structured information (XML):
’bioproject’, ’biosample’, ’biosystems’, ’gap’,
’blastdbinfo’, ’books’, ’cdd’, ’clone’, ... ]
>>> handle = Entrez.efetch(db=’pubmed’,
Project co-financed by European Union within European Social Fund
4
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
...
id=’22399473’,
...
rettype=’gb’,
retmode=’xml’)
>>> result =
Entrez.read(handle)
This time, the result is NOT a dictionary. Actually one may fetch
multiple database records at once, so a sequence of records is ac-
tually expected. Parsing XML data from efetch always returns a
sequence-like object with results.
Note that the text result on previous slide has the number 1. at the
title – if there had been more, they would have been numbered
accordingly.
. . . continuing with structured results of fetching record no.
22399473
from PubMed:
>>> print ’Sequence of’, len(result), ’result(s)’
Sequence of 1 result(s)
>>> first = result[0]
>>> print ’Keys of first:’, first.keys()
Keys of first: [u’MedlineCitation’, u’PubmedData’]
>>> handle = Entrez.efetch(db=’nucleotide’,
...
id=’164513135’,
...
rettype=’gb’, retmode=’text’)
>>> result =
handle.read()
>>> print result
LOCUS
AM889703 836 bp DNA linear PLN 11-MAR-2009
DEFINITION Cypripedium calceolus chloroplast partial matK
ACCESSION
AM889703
VERSION
AM889703.1 GI:164513135
KEYWORDS
matK; maturase K.
SOURCE
chloroplast Cypripedium calceolus
ORGANISM Cypripedium calceolus
...
To get only the sequence, one may request FASTA format:
>>> handle = Entrez.efetch(db=’nucleotide’,
...
id=’164513135’,
...
rettype=’fasta’, retmode=’text’)
>>> result = handle.read()
We want article data, and they are to be found in
>>> print result
>gi|164513135|emb|AM889703.1| Cypripedium calceolus chloro
MedlineCitation:
TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTT
>>> print first[’MedlineCitation’].keys()
TGCATTTATTGCGATTGCTTTTCCACGAATATCATTATTTTAATAGTCT
[u’PMID’, u’DateCreated’, u’Article’, u’KeywordList’, ...
...]
>>> article = first[’MedlineCitation’][’Article’]
This is a normal Python string and it can be processed as such:
Information on articles:
>>> print ’’.join( result.split(’\n’)[1:] )
TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTTTGCATTTAT
>>> print article.keys()
[u’ArticleDate’, u’Pagination’, u’AuthorList’,
Fetching multiple FASTA sequences could be done using comma
u’Language’, u’Affiliation’, u’ArticleTitle’,
separated ID’s as a single string:
u’PublicationTypeList’, u’ELocationID’, u’Abstract’,
>>> handle = Entrez.efetch(db=’nucleotide’,
u’Journal’]
...
id=’164513135, 164513137, 380749748’,
>>> print article[’Pagination’][’MedlinePgn’]
...
rettype=’fasta’, retmode=’text’)
513-27
>>> result = handle.read()
>>> print article[’Journal’][’Title’]
>>> print result
Methods in molecular biology (Clifton, N.J.)
>gi|164513135|emb|AM889703.1| Cypripedium calceolus chloro
The data structures may be complicated; e.g. getting directly from
TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTTT
query result to abstract would require:
...
>gi|164513137|emb|AM889704.1| Cypripedium henryi chloropla
result[0][’MedlineCitation’][’Article’][’AbstractText’][0]
TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTTT
...
Example 2: Searching for sequences
>gi|380749748|gb|JN589959.1| UNVERIFIED: Cypripedium irape
How to query the Nucleotide dabatbase:
Using the XML format for access to both the sequence and the
>>> handle = Entrez.esearch(db=’nucleotide’,
annotations is easy, too:
...
term=’Cypripedioideae[ORGN] AND matK[GENE]’)
>>> handle = Entrez.efetch(db=’nucleotide’,
>>> result = Entrez.read(handle)
id=’164513135’,
>>> print ’Found’, result[’Count’], ’hits in database’...
...
rettype=’gb’, retmode=’xml’)
Found 75 hits in database
>>> result =
Entrez.read(handle)
>>> print result[’IdList’]
[’164513135’, ’164513137’, ’380749748’, ’327184538’,
>>> print
results[0][’GBSeq_sequence’]
tcgaacttagttcaaatcctgcaatgctggatcaaggatgttccttctttgca...
’327184534’, ’327184530’, ’327184526’, ’327184522’,
>>> for info in result:
’327184518’, ’327184514’, ’327184510’, ’327184506’,
...
for key in ’GBSeq_accession-version’,’GBSeq_moltype’
’327184502’, ’327184498’, ’327184494’, ’327184490’,
...
print ’%-25s %s’ % (key, info[key])
’327184486’, ’327184482’, ’327184478’, ’327184474’]
GBSeq_accession-version
AM889703.1
Now fetch the result:
GBSeq_moltype
DNA
Project co-financed by European Union within European Social Fund
5
Zgłoś jeśli naruszono regulamin