Applying signal theory to the analysis of biomolecules

Gerhard Kauer & Helmut Blöcker 

Department of Genome Analysis, GBF – German Research Centre for Biotechnology, Mascheroder Weg 1, D-38124 Braunschweig, Germany

 

With getting a signal from the biomolecule to be investigated, the proven techniques for signal-analysis become available. One attribute of information-carrying biomolecules like DNA or Protein is, that their data-set is really large-scale. The amount of data to be taken into account reaches easily billions of single data points. The signal derived from them is often very complex. “Speed matters”, so that this fact must be considered for calculus needed.

In technical applications, very complicated functions (e.g. measurement data) can be transferred into a mathematically much more easy-to-handle sum of sine and cosine functions via Fourier transformation (polynomials, Figure 1). In this connection, the Fourier transform is a function which can indicate the amplitude and phase of the sine function associated with each frequency. The amplitude describes the deflection of the sine function, whereas the phase defines the function's passage through the ordinate origin (cf. Figure 2).

This procedure goes back to the ingenious and passionate mathematician Jean-Baptiste-Joseph Fourier (1768-1830).

Especially for very large data volumes, Fourier transformation, accelerated by special DFFT signal processors, represents an efficient analysis platform. Many mathematical operations for the analysis of these often rather complex measurement data can be applied much more easily and rapidly after transformation.

 

 

Figure 1:  A measured signal can be represented by the addition of two sine functions (green and blue function). The blue function is not phase-shifted and displays the amplitude a = 1. The green function is phase-shifted (F = 1.14) and has a larger elongation.

 

The procedures of Fourier analysis described here have proved efficient in many technical applications. By transferring primary sequence information into a signal, it is finally possible to modify proven techniques e.g. from digital image analysis and speech recognition so that they can lead to interesting results in the field of genome analysis and intergenomic comparisons. When representing the primary structure of a biomolecule (e.g. DNA sequence code) as a signal, it is not only conceivable to use the methods of transformation analysis (Fourier, Hartley, Hough) but, in principle, the mathematics of wavelet analysis and the methods of fuzzy logic could also be applied. For this purpose, however, a method must first be found to represent DNA as a signal in terms of signal theory. Furthermore, a method must be found to efficiently transform the enormous sequence data of whole genomes from the spatial domain into the frequency domain. This requires a basic understanding of the transformations (e.g. Fourier transformation) to be used and of their boundary conditions as well as the determination of the coefficients.

The methods described in the following section now enables, in principle, the mathematical procedure of Fourier transformation to be also applied to sequence data sets of biomolecules (DNA and protein).

 

Mathematical Methods

Continuous Fourier transformation

Fourier transformation is clearly invertible. The Fourier-transformed data set can be recalculated again into the original data by the inverse operation (inverse Fourier transformation) without any loss of information. F(u) is said to be the Fourier transform of f(x). The transition from f to F is designated Fourier transformation (FT), the transition from F to f inverse Fourier transformation (IFT) and described by the following symbols:

f(x) o¾· F(u) (one-dimensional) and f(x,y) o=· F(u,v) (two-dimensional)

 

The aim is to determine the coefficients with which the sine and cosine functions of the polynomials ultimately provide the function to be represented. A periodic function of period 2p can be expanded into a series of trigonometric functions:

 

Equation 1

 

The expansion of a function into such a Fourier series is called harmonic analysis. Since in practice the operation is often terminated after a finite number of terms, an approximation by a trigonometric polynomial is obtained. In the Hilbert space L²(-p,p) the system of functions 1/Ö(2p), 1/Öp cos x, 1/Öp sin x, 1/Öp cos 2x, 1/Öp sin 2x 1/Öp cos 3x, 1/Öp sin 3x, ... is a complete orthonormal system. The mean square error

Equation 2

 

for the approximation of a function f(x) Î L²(-p,p) by a trigonometric polynomial

Equation 3

 

precisely becomes minimal for ak, bk, if ak and bk are the Fourier coefficients of the function f(x):

Equation 4

 

A non-periodic function f(x) cannot be expanded into a Fourier series. If, however, f(x) satisfies the Dirichlet conditions, the Fourier integral can be expanded through the passage to the limit l®¥ in the range of (-l , l).

Equation 5

 

With the Fourier coefficients:

Equation 6

 

Equation 7

is obtained

 

or in a complex notion:

Equation 8

 

The double integral may be understood as a "superimposition" of two integrals. These two integrals describe the path to Fourier transformation or inverse Fourier transformation, where F(y) is the Fourier transform of f(x).

Equation 9

 

If a Fourier series is expanded, a periodic function with period 2l is obtained as the sum of harmonic oscillations with the frequencies nn= n p/l; n= 1, 2, 3.... . The Fourier integral, in contrast, shows the function f(x) as the sum of an infinite number of oscillations with continuously varying frequency n. The Fourier integral expands the function so to speak into a continuous spectrum.

Continuous Fourier Transformation of a Rectangular Pulse.

 

Figure 2: A rectangular pulse function is represented by the sum of a finite trigonometric polynomial. The Fourier coefficients are shown as crosses (red = cosine terms, blue = sine terms). The calculated Fourier integral (cf. Equation 18) is plotted as a green function. The upper diagram shows the result for a 5th‑order polynomial, whereas the bottom diagram shows the result for 20 sine-cosine terms. The increase of the Fourier coefficients for the harmonics and the more precise modeling of the rectangular function can be seen.

 

In order to illustrate the functioning of Fourier transformation, the simple case of a rectangular pulse function f(x) is first of all considered:

Equation 10

 

The Fourier integral is given by:

Equation 11

 

For |x| > To, f(x)=0 and hence also F(n)= 0.

For |x| < To, f(x)=A. The integral is therefore determined as follows for the range (-To to To):

Equation 12

 

Every harmonic function of constant frequency n has this property:

Equation 13

 

Substituted into the Fourier integral determined:

Equation 14

 

Equation 15

 

Calculated within the limits -To to To:

Equation 16

is obtained.

 

Because sin(-x) = -sin(x):

Equation 17

follows.

 

Because cos(x)-cos(-x)=0, hence follows:

Equation 18

 

Equation 18 was substituted into Figure 2 to represent the Fourier transform. There exist Fourier transformations and their inverses for most practical cases of scientific measurement data. Nevertheless, a Fourier transform only exists if the following three conditions are fulfilled:

 

a)

Equation 19

b)

If f(x) = b(x) sin(2pnx+a), where a and n are arbitrary constants, as well as b(x+k) < b(x), and the function f(x)/x is absolutely integrable for |x| > l > 0 in the sense of Equation 19.

c)

If conditions a) and b) are fulfilled, the functions must be boundedly variant. In a finite interval, a curve of finite length is present.

A detailed description of the criteria for a), b) and c) can be found in [1]. For the method of signal generation from primary structures presented in this study, the three conditions are fulfilled.

Discrete Fast Fourier Transformation (DFFT).

In practice, for measurements alone there are generally no known functions available which could be processed via continuous Fourier analysis. This applies, in particular, to the sequences of biomolecules to be analysed, where the aim is rather to analyse more or less large data sets of a discrete nature. In the present study, these are, for example, the billions of nucleotides of a genome or rather their signal data. If these values originate from a "normal" measurement series, they are additionally distorted by the actual measuring process; one speaks of disturbed data. A prominent example is detector noise. In the present study, however, the signal data sets are generated from the sequence data of DNA by applying known physico-chemical facts to these data. Thus, for example, normalized enthalpy data can be used for duplex stability in relation to the respective neighbouring nucleotide in order to generate a signal (see below). In this case, the factor of "detector noise" is deleted and an undisturbed signal may be assumed. In order to process such discrete data by Fourier transformation on computers, a Fourier transformation is required which can be used specifically for these data: discrete Fourier transformation. It can be derived as a special case from continuous Fourier analysis1.

First of all, the continuous signal h(x) is scanned. This is done via a scanning function Da(x), which ensures an infinite sequence of equidistant delta functions. The delta function is undefined at its location of appearance, has an area of 1 below the function and otherwise exhibits the value of zero:

 

d(x-xo)=0; xxo

 

Equation 20: The delta function can be imagined as a pulse of very small width, but great pulse height.

 

The resultant scanning signal is thus given by the product h(x) Da(x):

Equation 21

 

where T is the scanning signal. In the following, a window for the data under consideration will be prepared. The scanning signal will be limited. This is achieved by multiplying the scanning signal by a rectangular function. T0 defines the width of this window. It is important that the boundaries of this window are between two scans:

Equation 22

 

The "observation window" is thus installed by multiplication of the scanning signal by this rectangular function f(x):

Equation 23

 

Further investigations thus take place in this "observation window", where N indicates the number of equidistant delta functions. N=To/T:

Equation 24

 

The signal originally assumed to be continuous is now already under discrete consideration. It is now still necessary to scan the Fourier transform of this signal to finally enable a completely discrete consideration of the transformation pair. This is done by a further scanning function, Db(u), which has its equidistant delta functions (interval = 1/To) in the range of the Fourier transform. In the spatial domain, this is achieved by multiplication by the scanning function. The analogous operation for multiplication in the frequency domain is convoluting in the spatial domain. Scanning function Db(x) is therefore convoluted with the scanning signal h(x) Da(x) f(x):

Equation 25

Convolution [h(x) Da(x) f(x)] * Db(x) is obtained by substituting:

Equation 26

 

Calculation of the equation gives:

Equation 27

Function h(x) is an approximation of h(x) and describes a periodic function of period T0. Since the Fourier transform of the periodic function h(x) consists of a sequence of equidistant delta functions, the Fourier transform h(x) is determined by:

Equation 28

 

 

 

 

 

 

 

 

 

 

Substituting Equation 26 into Equation 28 gives:

Equation 29

 

The Fourier transform of a periodic function consists of N values, if the N scanning values of the original function h(x) are regarded as a period of a periodic function. Integration over only one period gives:

Equation 30

 

 

And finally:

Equation 31

 

Because of T0=N T 

Equation 32

is written.

 

The final notation of Fourier transformation (Equation 27) is thus as follows:

Equation 33

 

H(u) is the approximation of H(u) and is understood as H(n/NT).



[1] Brigham, E.O.,  The Fast Fourier Transform. (Prentice Hall Inc., New York, 1974)