gfs2_filesystem_whitehouse_red_hat_2007.pdf

(167 KB) Pobierz

The GFS2 Filesystem

Steven Whitehouse

Red Hat, Inc.

swhiteho@redhat.com

Abstract

The GFS2 ﬁlesystem is a symmetric cluster ﬁlesystem

designed to provide a high performance means of shar-

ing a ﬁlesystem between nodes. This paper will give

an overview of GFS2’s make subsystems, features and

differences from GFS1 before considering more recent

developments in GFS2 such as the new on-disk layout of

journaled ﬁles, the GFS2 metadata ﬁlesystem, and what

can be done with it, fast & fuzzy statfs, optimisations of

readdir/getdents64

and optimisations of glocks

(cluster locking). Finally, some possible future develop-

ments will be outlined.

To get the most from this talk you will need a good

background in the basics of Linux ﬁlesystem internals

and clustering concepts such as quorum and distributed

locking.

Historical Detail

The original GFS [6] ﬁlesystem was developed by Matt

O’Keefe’s research group in the University of Min-

nesota. It used SCSI reservations to control access to

the storage and ran on SGI’s IRIX.

Later versions of GFS [5] were ported to Linux, mainly

because the group found there was considerable advan-

tage during development due to the easy availability of

the source code. The locking subsystem was devel-

oped to give ﬁner grained locking, initially by the use

of special ﬁrmware in the disk drives (and eventually,

also RAID controllers) which was intended to become a

SCSI standard called dmep. There was also a network

based version of dmep called memexp. Both of these

standards worked on the basis of atomically updated ar-

eas of memory based upon a “compare and exchange”

operation.

Later when it was found that most people preferred

the network based locking manager, the Grand Uniﬁed

Locking Manager, gulm, was created improving the per-

formance over the original memexp based locking. This

was the default locking manager for GFS until the DLM

(see [1]) was written by Patrick Caulﬁeld and Dave Tei-

gland.

Sistina Software Inc, was set up by Matt O’Keefe and

began to exploit GFS commercially in late 1999/early

2000. Ken Preslan was the chief architect of that version

of GFS (see [5]) as well as the version which forms Red

Hat’s current product. Red Hat acquired Sistina Soft-

ware Inc in late 2003 and integrated the GFS ﬁlesystem

into its existing product lines.

During the development and subsequent deployment of

the GFS ﬁlesystem, a number of lessons were learned

about where the performance and administrative prob-

lems occur. As a result, in early 2005 the GFS2 ﬁlesys-

tem was designed and written, initially by Ken Preslan

1 Introduction

The GFS2 ﬁlesystem is a 64bit, symmetric cluster

ﬁlesystem which is derived from the earlier GFS ﬁlesys-

tem. It is primarily designed for Storage Area Network

(SAN) applications in which each node in a GFS2 clus-

ter has equal access to the storage. In GFS and GFS2

there is no such concept as a metadata server, all nodes

run identical software and any node can potentially per-

form the same functions as any other node in the cluster.

In order to limit access to areas of the storage to main-

tain ﬁlesystem integrity, a lock manager is used. In

GFS2 this is a distributed lock manager (DLM) [1]

based upon the VAX DLM API. The Red Hat Cluster

Suite provides the underlying cluster services (quorum,

fencing) upon which the DLM and GFS2 depend.

It is also possible to use GFS2 as a local ﬁlesystem with

the

lock_nolock

lock manager instead of the DLM.

The locking subsystem is modular and is thus easily sub-

stituted in case of a future need of a more specialised

lock manager.

254

•

The GFS2 Filesystem

and more recently by the author, to improve upon the

original design of GFS.

The GFS2 ﬁlesystem was submitted for inclusion in Li-

nus’ kernel and after a lengthy period of code review

and modiﬁcation, was accepted into 2.6.16.

Bit Pattern

Block State

Free

Allocated non-inode block

Unlinked (still allocated) inode

Allocated inode

Table 1: GFS2 Resource Group bitmap states

3 The on-disk format

The on-disk format of GFS2 has, intentionally, stayed

very much the same as that of GFS. The ﬁlesystem is

big-endian on disk and most of the major structures have

stayed compatible in terms of offsets of the ﬁelds com-

mon to both versions, which is most of them, in fact.

It is thus possible to perform an in-place upgrade of GFS

to GFS2. When a few extra blocks are required for some

of the

per node

ﬁles (see the metafs ﬁlesystem, Subsec-

tion 3.5) these can be found by shrinking the areas of the

disk originally allocated to journals in GFS. As a result,

even a full GFS ﬁlesystem can be upgraded to GFS2

without needing the addition of further storage.

3.1

The superblock

of blocks containing the allocation bitmaps. There are

two bits in the bitmap for each block in the resource

group. This is followed by the blocks for which the re-

source group controls the allocation.

The two bits are nominally

allocated/free

and

data (non-

inode)/inode

with the exception that the

free inode

state

is used to indicate inodes which are unlinked, but still

open.

In GFS2 all metadata blocks start with a common header

which includes ﬁelds indicating the type of the metadata

block for ease of parsing and these are also used exten-

sively in checking for run-time errors.

Each resource group has a set of ﬂags associated with

it which are intended to be used in the future as part of

a system to allow in-place upgrade of the ﬁlesystem. It

is possible to mark resource groups such that they will

no longer be used for allocations. This is the ﬁrst part

of a plan that will allow migration of the content of a

resource group to eventually allow ﬁlesystem shrink and

similar features.

3.3 Inodes

GFS2’s inodes have retained a very similar form to those

of GFS in that each one spans an entire ﬁlesystem block

with the remainder of the block being ﬁlled either with

data (a “stuffed” inode) or with the ﬁrst set of pointers

in the metadata tree.

GFS2 has also inherited GFS’s equal height metadata

tree. This was designed to provide constant time ac-

cess to the different areas of the ﬁle. Filesystems such

as ext3, for example, have different depths of indirect

pointers according to the ﬁle offset whereas in GFS2,

the tree is constant in depth no matter what the ﬁle off-

set is.

Initially the tree is formed by the pointers which can be

ﬁtted into the spare space in the inode block, and is then

GFS2’s superblock is offset from the start of the disk

by 64k of unused space. The reason for this is entirely

historical in that in the dim and distant past, Linux used

to read the ﬁrst few sectors of the disk in the VFS mount

code before control had passed to a ﬁlesystem. As a

result, this data was being cached by the Linux buffer

cache without any cluster locking. More recent versions

of GFS were able to get around this by invalidating these

sectors at mount time, and more recently still, the need

for this gap has gone away entirely. It is retained only

for backward compatibility reasons.

3.2

Resource groups

Following the superblock are a number of resource

groups. These are similar to ext2/3 block groups in

that their intent is to divide the disk into areas which

helps to group together similar allocations. Addition-

ally in GFS2, the resource groups allow parallel alloca-

tion from different nodes simultaneously as the locking

granularity is one lock per resource group.

On-disk, each resource group consists of a header block

with some summary information followed by a number

2007 Linux Symposium, Volume Two

•

255

grown by adding another layer to the tree whenever the

current tree size proves to be insufﬁcient.

Like all the other metadata blocks in GFS2, the indirect

pointer blocks also have the common metadata header.

This unfortunately also means that the number of point-

ers they contain is no longer an integer power of two.

This, again, was to keep compatibility with GFS and

in the future we eventually intend to move to an extent

based system rather than change the number of pointers

in the indirect blocks.

the total length of the entry and the offset to the next

entry.

Once enough entries have been added that it’s no longer

possible to ﬁt them all in the directory block itself, the

directory is turned into a hashed directory. In this case,

the hash table takes the place of the directory entries

in the directory block and the entries are moved into a

directory “leaf” block.

In the ﬁrst instance, the hash table size is chosen to be

half the size of the inode disk block. This allows it to

coexist with the inode in that block. Each entry in the

hash table is a pointer to a leaf block which contains

a number of directory entries. Initially, all the pointers

in the hash table point to the same leaf block. When

that leaf block ﬁlls up, half the pointers are changed to

point to a new block and the existing directory entries

moved to the new leaf block, or left in the existing one

according to their respective hash values.

Eventually, all the pointers will point to different blocks,

assuming that the hash function (in this case a CRC-

32) has resulted in a reasonably even distribution of di-

rectory entries. At this point the directory hash table

is removed from the inode block and written into what

would be the data blocks of a regular ﬁle. This allows

the doubling in size of the hash table which then occurs

each time all the pointers are exhausted.

Eventually when the directory hash table hash reached

a maximum size, further entries are added by chaining

leaf blocks to the existing directory leaf blocks.

As a result, for all but the largest directories, a single

hash lookup results in reading the directory block which

contains the required entry.

Things are a bit more complicated when it comes to the

readdir

function, as this requires that the entries in

each hash chain are sorted according to their hash value

(which is also used as the ﬁle position for

lseek)

order to avoid the problem of seeing entries twice, or

missing them entirely in case a directory is expanded

during a set of repeated calls to

readdir.

This is dis-

cussed further in the section on future developments.

3.5 The metadata ﬁlesystem

There are a number of special ﬁles created by

mkfs.gfs2

which are used to store additional meta-

data related to the ﬁlesystem. These are accessible by

3.3.1

Attributes

GFS2 supports the standard get/change attributes

ioctl()

used by ext2/3 and many other Linux ﬁlesys-

tems. This allows setting or querying the attributes listed

in Table 2.

As a result GFS2 is directly supported by the

lsattr(1)

and

chattr(1)

commands. The hashed

directory ﬂag,

indicates whether a directory is hashed

or not. All directories which have grown beyond a cer-

tain size are hashed and section 3.4 gives further details.

3.3.2 Extended Attributes & ACLs

GFS2 supports extended attribute types

user, system

and

security.

It is therefore possible to run selinux on a

GFS2 ﬁlesystem.

GFS2 also supports POSIX ACLs.

3.4

Directories

GFS2’s directories are based upon the paper “Extendible

Hashing” by Fagin [3]. Using this scheme GFS2 has

a fast directory lookup time for individual ﬁle names

which scales to very large directories. Before ext3

gained hashed directories, it was the single most com-

mon reason for using GFS as a single node ﬁlesystem.

When a new GFS2 directory is created, it is “stuffed,”

in other words the directory entries are pushed into the

same disk block as the inode. Each entry is similar to an

ext3 directory entry in that it consists of a ﬁxed length

part followed by a variable length part containing the ﬁle

name. The ﬁxed length part contains ﬁelds to indicate

256

•

The GFS2 Filesystem

Attribute

Append Only

Immutable

Journaling

No atime

Sync Updates

Hashed dir

Symbol

Get or Set

Get and set on regular inodes

Set on regular ﬁles, get on all inodes

Get and set on all inodes

Get and set on regular ﬁles

Get on directories only

Table 2: GFS2 Attributes

mounting the

gfs2meta

ﬁlesystem specifying a suit-

able gfs2 ﬁlesystem. Normally users would not do this

operation directly since it is done by the GFS2 tools as

and when required.

Under the root directory of the metadata ﬁlesystem

(called the master directory in order that it is not con-

fused with the real root directory) are a number of ﬁles

and directories. The most important of these is the re-

source index (rindex) whose ﬁxed-size entries list the

disk locations of the resource groups.

3.5.3

statfs

The statfs ﬁles (there is a master one, and one in each

per_node

subdirectory) contain the information re-

quired to give a fast (although not 100% accurate) re-

sult for the

statfs

system call. For large ﬁlesys-

tems mounted on a number of nodes, the conventional

approach to

statfs

(i.e., iterating through all the re-

source groups) requires a lot of CPU time and can trig-

ger a lot of I/O making it rather inefﬁcient. To avoid

this, GFS2 by default uses these ﬁles to keep an approx-

imation of the true ﬁgure which is periodically synced

back up to the master ﬁle.

There is a sysfs interface to allow adjustment of the sync

period or alternatively turn off the fast & fuzzy

statfs

and go back to the original 100% correct, but slower

implementation.

3.5.1 Journals

Below the master directory there is a subdirectory which

contains all the journals belonging to the different nodes

of a GFS2 ﬁlesystem. The maximum number of nodes

which can mount the ﬁlesystem simultaneously is set

by the number of journals in this subdirectory. New

journals can be created simply by adding a suitably ini-

tialised ﬁle to this directory. This is done (along with the

other adjustments required) by the

gfs2_jadd

tool.

3.5.4

inum

3.5.2 Quota ﬁle

These ﬁles are used to allocate the

no_formal_ino

part of GFS2’s

struct gfs2_inum

structure. This is

effectively a version number which is mostly used by

NFS, although it is also present in the directory entry

structure as well. The aim is to give each inode an addi-

tional number to make it unique over time. The master

inum ﬁle is used to allocate ranges to each node, which

are then replenished when they’ve been used up.

The quota ﬁle contains the system wide summary of all

the quota information. This information is synced pe-

riodically and also based on how close each user is to

their actual quota allocation. This means that although it

is possible for a user to exceed their allocated quota (by

a maximum of two times) this is in practise extremely

unlikely to occur. The time period over which syncs of

quota take place are adjustable via sysfs.

Locking

Whereas most ﬁlesystems deﬁne an on-disk format

which has to be largely invariant and are then free to

change their internal implementation as needs arise,

GFS2 also has to specify its locking with the same de-

gree of care as for the on-disk format to ensure future

compatibility.

2007 Linux Symposium, Volume Two

•

257

Lock type

Non-disk

Plik z chomika:

MegaCoNz2014

Inne pliki z tego folderu:

nsa-rhel5-guide-i731.pdf (1117 KB)
Deploying Oracle 11gR2 on Red Hat Enterprise Linux 6.pdf (3194 KB)
Leadership TPC-C benchmark performance and price-performance using Red Hat Enterprise Linux 6.2.pdf (1055 KB)
Deploying a Highly Available Web Server on Red Hat Enterprise Linux 6.pdf (1982 KB)
deploying clustered samba on rhel6.pdf (1244 KB)

gfs2_filesystem_whitehouse_red_hat_2007.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: