lecture12_text.pdf

(647 KB) Pobierz

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

Lecture 12: Biopython

Bioinformatics

Paweł K˛ dzierski

Contents of the lecture

Overview of Python programming lan-

guage

•

Simple syntax (quick to learn)

•

High-level, general purpose language (all and any type of

problems can be adressed)

•

Rich libraries of ready-to-use solutions (math, statistics, plot-

ting, databases, web, . . . )

•

Modern (object-oriented)

Contents

Overview of bio-related programming tools

Overview of Python programming language

Using Biopython

3.1 Features . . . . . . . . . . . . . . . . . . . . . . .

3.2 Documentation . . . . . . . . . . . . . . . . . . .

3.3 Examples . . . . . . . . . . . . . . . . . . . . . .

Why python?

Overview of bio-related programming

tools

Online resources:

http://www.python.org

http://docs.python.org/tutorial/

http://www.python.org.pl

•

Program structure

A program consists of deﬁnitions and instructions.

•

Deﬁnitions

are used to introduce new ”things” and giving

them some names.

•

Names

A name is any continuous chain of letters, digits and under-

scores with the exception that it can not start with a digit. The

names identify ”things” or objects in the program

•

Instructions

are

actions.

They are executed from left to right

on a line, and from the top to bottom.

There are also

control instructions

which may make the pro-

gram to skip some fragments, or to go back and re-execute

some fragments.

Within the computers,

everything

is a sequence of bits (the OS,

the software, data, . . . ). Its the

interpretation

of the bits which is

meaningful to humans.

The programming languages must be able to properly interpret the

stream of bits they work on. It is the essence of

data typing.

•

Python recognizes data

within the program

by the way it

is written (this is called

dynamic typing):

’text’ "text" ’’’text’’’ """text"""

integer numbers:

17 017 +0x17 -0xff

Introduction

The „informatics” part of bioinformatics:

•

Algorithms (still active area of research)

•

Statistical analysis of data and results

•

Database access: both local (SQL) and via WWW (speciﬁc)

•

Batch analyses (time consuming calculations)

•

Automation of routine tasks (e.g. think of Phylip)

•

High throughput analysis (genome-wide, population-wide,

microchip data etc.)

Bioinformatics related programming packages

All these are coordinated by the Open Bioinformatics Foundation

(O|B|F).

•

BioPerl

http://www.bioperl.org

•

Biopython

http://www.biopython.org

•

BioJava

http://www.biojava.org

•

BioRuby

http://www.bioruby.org

•

BioSQL

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

real numbers:

-1.0 .5 12. 1e2 -1.01e-4

sequences of data:

’abcd’ (1,’a’) [1,2,3]

dictionaries (mappings):

{1:’a’, -1:’b’}

•

Functions

number = input(’Give me a number:’)

while number > 0:

print number

number += 1 # increase by one

if number > 100: break

A repeatable or general task can be separated from the main ﬂow

of the instructions and deﬁned as a standalone

function.

Then, such

Useful Python resources

a function can be executed as a whole (”called”) using syntax

•

subprocess (for automation of batch tasks)

function_name().

The function can accept data (the

parameters)

and/or produce some result (return

value).

•

General scientiﬁc calculations: scipy

Examples:

http://www.scipy.org

def hello1():

print ’Hello!’

def hello2(who):

print ’Hello,’, who, ’- I like you :-)’

def sum(a,b):

return a+b

•

Classes

Classes are used as deﬁnitions of containers for data and func-

tions working on that data.

Simple and general classes are built into Python: for example a list

of (any) items, which can be extended, modiﬁed, rearranged or

sorted.

A specialized class may represent MSA, with the abilities to access

columns, calculate consensus or conservativity, read the MSA from

a ﬁle or convert it to a different format.

class sack:

pass

mybag = sack()

mybag.text = ’Something to remember’

mybag.pin = ’1234’

print mybag.pin

•

Syntax is based on whitespace.

Python code

must

be properly indented. Otherwise you will

observe the dreadful

SyntaxError

IndentationError

messages.

variable = input(’Enter a number: ’)

if variable > 0:

print ’The number’, variable, ’is positive.’

elif variable == 0:

print ’It is zero.’

else:

print ’Your number:’, variable, ’, is negative.’

print "OK, I’m done."

•

For executing instructions over a sequence of items, use

for

loop:

for item in (’one’, ’second’, ’third’):

print item

•

If it is unknown a priori, how many times the loop will be

executed, one can execute or

break

the loop conditionally:

•

Plotting: matplotlib (Matlab alike), gnuplot, Mayavi (3D). . .

http://matplotlib.sf.net

http://gnuplot-py.sf.net, http://www.gnuplot.info

http://code.enthought.com/projects/mayavi/

•

Interface to the R statistical package: RPy

http://rpy.sf.net

http://www.r-project.org

•

Molecular Modeling in Python:

Molecular Modeling ToolKit (MMTK)

Biochemical ALgorithms Library (BALL)

PyQuante

•

Molecular Visualization

Embedded Python Molecular Viewer (ePMV)

http://epmv.scripps.edu/

•

Programs embedding Python

Modeller

VMD

NWChem

•

Python Imaging Library (PIL)

http://www.pythonware.com/products/pil/

•

PDF document generation:

ReportLab

http://www.reportlab.com

or write LTEX ﬁles

http://www.latex-project.org/

•

HTML document generation – directly or with modules like

HTMLgen, HTMLTags, HyperText or XIST

•

Graphical User Interface libraries

If you want that „pro” look & feel

Graphical User Interfaces for Python:

•

Tkinter (default, mature, relatively simple)

•

other based on Tkinter: easygui, Python Mega Widgets

(PMW)

http://www.ferg.org/easygui/

http://pmw.sf.net

•

wxPython (free, feature rich)

http://www.wxpython.org

http://www.wxwidgets.org

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

•

PyQt (free/commercial, professional)

http://www.riverbankcomputing.co.uk/software/

pyqt/

http://qt.nokia.com

To make standalone Windows executables:

•

py2exe

http://www.py2exe.org

•

If the ﬁle has named ﬁelds (e.g. GenBank), one can access

them in resulting data structure by name (Python dictionary

alike);

•

The same data structures can also be initialized from database

queries or from command line.

3.2

Documentation

Using Biopython

•

There are books on Biopython and other mentioned Python

libraries (Amazon, O’Reilly)

•

Biopython tutorial

http://biopython.org/DIST/docs/tutorial/

•

Biopython Cookbook

http://biopython.org/DIST/docs/cookbook/

•

Biopython Wiki

http://biopython.org/wiki/Main_Page

Why Biopython

The functionality of Bio-. . . libraries is equivalent among lan-

guages, but:

•

Python is simple to learn and efﬁcient to use;

•

fully object oriented;

•

simpler to learn (and to read code) than Perl;

•

more compact (quicker to write) than Java;

•

has many standard and additional libraries.

3.3

Examples

Word of caution: Biopython is a moving target.

This is partially because the supported WWW services are even

more a moving target, too (especially NCBI!); and in part due to

rapid development of bioinformatics and the „Bio-” packages.

•

The examples were working at the time of this writing, but. . .

•

They may stop/have stopped working at some time.

•

If this happens with your own scripts, don’t panic; such

changes are documented in Biopython FAQ (biopython.

org/DIST/docs/tutorial/Tutorial.html\#htoc5).

•

The most severe incompatibilities are introduced by WWW

database services. If this happens, upgrade your Biopython

version.

Example 1: Querying Entrez (NCBI)

•

NCBI provide server-side CGI scripts (eutils) with deﬁned

HTTP API

http://www.ncbi.nlm.nih.gov/books/NBK25500/

•

This API is programmed into

Entrez

object of Biopython

•

Functions of this object reﬂect the names of eutils:

–

einfo

– information on Entres databases

–

esearch

– searching (querying) databases

The result is a list of accession numbers (ID’s)

–

efetch

– retrieve the real data (by ID/AC)

Outline of using eutils (not only via Biopython):

•

The ﬁrst result is a „handle” – representing the connection to

the server;

3.1 Features

Data structures

Biopython has parsers for many ﬁle formats:

•

FASTA

•

GenBank

•

PubMed and Medline

•

UniGene

•

SwissProt

•

Clustalw

•

BLAST output (standalone and WWW versions)

•

ExPASy Enzyme and Prosite

•

SCOP

•

Data are represented by objects with rich interfaces;

•

Files are parsed into data structures with easy access to con-

tents

•

If the ﬁle has multiple entries (e.g. FASTA), one can loop over

the entries;

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

•

Reading from the „handle” retrieves the result in form of an

XML document (or text). The handle could only be read once;

•

It is best to parse this XML to something (more) usable.

Biopython provides a parser –

Entrez.read()

– which con-

>>> handle

= Entrez.einfo(db=’pubmed’)

verts XML to a Python data structure (usually, a dictionary);

>>> result

= Entrez.read(handle)

>>> pubm_info = result[’DbInfo’]

•

Retrieving data is (at least) a two-step process; ﬁrst one gets

>>> print ’DB description:’, pubm_info[’Description’]

AC ID’s only (from

esearch),

then these ID’s could be used

DB description: PubMed bibliographic record

with

efetch.

>>> print ’Count of records:’, pubm_info[’Count’]

Count of records: 21809524

Code snippets

>>> for field in pubm_info[’FieldList’]:

print ’%(Name)s: %(Description)s’ % field

•

Following the behaviour of interactive Python interpreter, the

...

instructions will be distinguished by

>>>

prompt;

•

If an instruction (or block of them) continue over more than

one line, it will be presented with secondary prompt (ellipsis);

•

Output printed by the program will be

colored.

>>>

x = 1

>>> for word in ’Biopython’, ’lecture’:

...

print word

Biopython

lecture

This is an example how to query which databases are available (and

their names to be used in programs).

>>> from Bio import Entrez

>>> Entrez.email = "Pawel.Kedzierski@pwr.wroc.pl"

>>> handle

= Entrez.einfo()

>>> xml_result

handle.read()

>>> print xml_result

<?xml version="1.0"?>

The result (of the last

for

loop): list of PubMed ﬁeld names and

descriptions:

ALL: All terms from all searchable fields

UID: Unique number assigned to publication

FILT: Limits the records

TITL: Words in title of publication

WORD: Free text associated with publication

MESH: Medical Subject Headings assigned to publication

MAJR: MeSH terms of major importance to publication

AUTH: Author(s) of publication

JOUR: Journal abbreviation of publication

ECNO: EC number for enzyme or CAS registry number

SUBS: CAS chemical name or MEDLINE Substance Name

...

Use

term=’query’

with

esearch

to query databases:

Database speciﬁc info is also retrieved using

einfo.

For example,

to search a speciﬁc ﬁeld, it is good to conﬁrm that this ﬁeld exists

in the database:

>>> handle = Entrez.esearch(db=’pubmed’, term=’biopython’)

>>> result = Entrez.read(handle)

>>> print ’Keys:’, result.keys()

Keys: [u’Count’, u’IdList’, u’QueryTranslation’, ... ]

>>> print result[’QueryTranslation’]

<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoRes... biopython[All Fields]

"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInf...

>>> print "List of ID’s:", result[’IdList’]

List of ID’s: [’22399473’, ’21666252’, ’21210977’, ... ]

Use

efetch

to retrieve database records – by ID:

<DbName>pubmed</DbName>

>>> handle = Entrez.efetch(db=’pubmed’,

<DbName>protein</DbName>

...

id=’22399473’,

<DbName>nuccore</DbName>

...

rettype=’gb’, retmode=’text’)

The XML result could be parsed to get Python data structure

>>> result =

handle.read()

(dictionary-alike). Just use

Entrez.read()

to read (& parse) the

>>> print result

result:

1. Methods Mol Biol. 2012;856:513-27.

>>> handle = Entrez.einfo()

Sharing programming resources between Bio* projects throug

>>> result =

Entrez.read(handle)

and native call stack strategies.

>>> print result.keys()

[u’DbList’]

Prins P, Goto N, Yates A, Gautier L, Willis S, Fields C, K

>>> print result[’DbList’]

...

[’pubmed’, ’protein’, ’nucleotide’, ’nuccore’, ’nucgss’,

’nucest’, ’structure’, ’genome’, ’genomeprj’,

The same record, but we want structured information (XML):

’bioproject’, ’biosample’, ’biosystems’, ’gap’,

’blastdbinfo’, ’books’, ’cdd’, ’clone’, ... ]

>>> handle = Entrez.efetch(db=’pubmed’,

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

...

id=’22399473’,

...

rettype=’gb’,

retmode=’xml’)

>>> result =

Entrez.read(handle)

This time, the result is NOT a dictionary. Actually one may fetch

multiple database records at once, so a sequence of records is ac-

tually expected. Parsing XML data from efetch always returns a

sequence-like object with results.

Note that the text result on previous slide has the number 1. at the

title – if there had been more, they would have been numbered

accordingly.

. . . continuing with structured results of fetching record no.

22399473

from PubMed:

>>> print ’Sequence of’, len(result), ’result(s)’

Sequence of 1 result(s)

>>> first = result[0]

>>> print ’Keys of first:’, first.keys()

Keys of first: [u’MedlineCitation’, u’PubmedData’]

>>> handle = Entrez.efetch(db=’nucleotide’,

...

id=’164513135’,

...

rettype=’gb’, retmode=’text’)

>>> result =

handle.read()

>>> print result

LOCUS

AM889703 836 bp DNA linear PLN 11-MAR-2009

DEFINITION Cypripedium calceolus chloroplast partial matK

ACCESSION

AM889703

VERSION

AM889703.1 GI:164513135

KEYWORDS

matK; maturase K.

SOURCE

chloroplast Cypripedium calceolus

ORGANISM Cypripedium calceolus

...

To get only the sequence, one may request FASTA format:

>>> handle = Entrez.efetch(db=’nucleotide’,

...

id=’164513135’,

...

rettype=’fasta’, retmode=’text’)

>>> result = handle.read()

We want article data, and they are to be found in

>>> print result

>gi|164513135|emb|AM889703.1| Cypripedium calceolus chloro

MedlineCitation:

TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTT

>>> print first[’MedlineCitation’].keys()

TGCATTTATTGCGATTGCTTTTCCACGAATATCATTATTTTAATAGTCT

[u’PMID’, u’DateCreated’, u’Article’, u’KeywordList’, ...

...]

>>> article = first[’MedlineCitation’][’Article’]

This is a normal Python string and it can be processed as such:

Information on articles:

>>> print ’’.join( result.split(’\n’)[1:] )

TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTTTGCATTTAT

>>> print article.keys()

[u’ArticleDate’, u’Pagination’, u’AuthorList’,

Fetching multiple FASTA sequences could be done using comma

u’Language’, u’Affiliation’, u’ArticleTitle’,

separated ID’s as a single string:

u’PublicationTypeList’, u’ELocationID’, u’Abstract’,

>>> handle = Entrez.efetch(db=’nucleotide’,

u’Journal’]

...

id=’164513135, 164513137, 380749748’,

>>> print article[’Pagination’][’MedlinePgn’]

...

rettype=’fasta’, retmode=’text’)

513-27

>>> result = handle.read()

>>> print article[’Journal’][’Title’]

>>> print result

Methods in molecular biology (Clifton, N.J.)

>gi|164513135|emb|AM889703.1| Cypripedium calceolus chloro

The data structures may be complicated; e.g. getting directly from

TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTTT

query result to abstract would require:

...

>gi|164513137|emb|AM889704.1| Cypripedium henryi chloropla

result[0][’MedlineCitation’][’Article’][’AbstractText’][0]

TCGAACTTAGTTCAAATCCTGCAATGCTGGATCAAGGATGTTCCTTCTTT

...

Example 2: Searching for sequences

>gi|380749748|gb|JN589959.1| UNVERIFIED: Cypripedium irape

How to query the Nucleotide dabatbase:

Using the XML format for access to both the sequence and the

>>> handle = Entrez.esearch(db=’nucleotide’,

annotations is easy, too:

...

term=’Cypripedioideae[ORGN] AND matK[GENE]’)

>>> handle = Entrez.efetch(db=’nucleotide’,

>>> result = Entrez.read(handle)

id=’164513135’,

>>> print ’Found’, result[’Count’], ’hits in database’...

...

rettype=’gb’, retmode=’xml’)

Found 75 hits in database

>>> result =

Entrez.read(handle)

>>> print result[’IdList’]

[’164513135’, ’164513137’, ’380749748’, ’327184538’,

>>> print

results[0][’GBSeq_sequence’]

tcgaacttagttcaaatcctgcaatgctggatcaaggatgttccttctttgca...

’327184534’, ’327184530’, ’327184526’, ’327184522’,

>>> for info in result:

’327184518’, ’327184514’, ’327184510’, ’327184506’,

...

for key in ’GBSeq_accession-version’,’GBSeq_moltype’

’327184502’, ’327184498’, ’327184494’, ’327184490’,

...

print ’%-25s %s’ % (key, info[key])

’327184486’, ’327184482’, ’327184478’, ’327184474’]

GBSeq_accession-version

AM889703.1

Now fetch the result:

GBSeq_moltype

DNA

Project co-ﬁnanced by European Union within European Social Fund

Plik z chomika:

twoj.inzynier

Inne pliki z tego folderu:

lecture12_handouts.pdf (6158 KB)
lecture12_text.pdf (647 KB)

lecture12_text.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: