Applying signal theory to the analysis of biomolecules
Gerhard Kauer & Helmut
Blöcker
Department of Genome Analysis, GBF – German Research Centre for Biotechnology, Mascheroder Weg 1, D-38124 Braunschweig, Germany
With getting a signal from the biomolecule to be investigated, the proven techniques for signal-analysis become available. One attribute of information-carrying biomolecules like DNA or Protein is, that their data-set is really large-scale. The amount of data to be taken into account reaches easily billions of single data points. The signal derived from them is often very complex. “Speed matters”, so that this fact must be considered for calculus needed.
In technical applications, very complicated functions (e.g. measurement data) can be transferred into a mathematically much more easy-to-handle sum of sine and cosine functions via Fourier transformation (polynomials, Figure 1). In this connection, the Fourier transform is a function which can indicate the amplitude and phase of the sine function associated with each frequency. The amplitude describes the deflection of the sine function, whereas the phase defines the function's passage through the ordinate origin (cf. Figure 2).
This procedure goes back to the ingenious and passionate mathematician Jean-Baptiste-Joseph Fourier (1768-1830).
Especially for very large data volumes, Fourier transformation, accelerated by special DFFT signal processors, represents an efficient analysis platform. Many mathematical operations for the analysis of these often rather complex measurement data can be applied much more easily and rapidly after transformation.

Figure 1: A measured signal can be represented by the addition of two sine functions (green and blue function). The blue function is not phase-shifted and displays the amplitude a = 1. The green function is phase-shifted (F = 1.14) and has a larger elongation.
The procedures of Fourier analysis described here have proved efficient in many technical applications. By transferring primary sequence information into a signal, it is finally possible to modify proven techniques e.g. from digital image analysis and speech recognition so that they can lead to interesting results in the field of genome analysis and intergenomic comparisons. When representing the primary structure of a biomolecule (e.g. DNA sequence code) as a signal, it is not only conceivable to use the methods of transformation analysis (Fourier, Hartley, Hough) but, in principle, the mathematics of wavelet analysis and the methods of fuzzy logic could also be applied. For this purpose, however, a method must first be found to represent DNA as a signal in terms of signal theory. Furthermore, a method must be found to efficiently transform the enormous sequence data of whole genomes from the spatial domain into the frequency domain. This requires a basic understanding of the transformations (e.g. Fourier transformation) to be used and of their boundary conditions as well as the determination of the coefficients.
The methods described in the following section now enables, in principle, the mathematical procedure of Fourier transformation to be also applied to sequence data sets of biomolecules (DNA and protein).
Fourier transformation is clearly invertible. The Fourier-transformed data set can be recalculated again into the original data by the inverse operation (inverse Fourier transformation) without any loss of information. F(u) is said to be the Fourier transform of f(x). The transition from f to F is designated Fourier transformation (FT), the transition from F to f inverse Fourier transformation (IFT) and described by the following symbols:
f(x) o¾· F(u) (one-dimensional) and f(x,y) o=· F(u,v) (two-dimensional)
The aim is to determine the coefficients with which the sine and cosine functions of the polynomials ultimately provide the function to be represented. A periodic function of period 2p can be expanded into a series of trigonometric functions:

Equation 1
The expansion of a function into such a Fourier series is called harmonic analysis. Since in practice the operation is often terminated after a finite number of terms, an approximation by a trigonometric polynomial is obtained. In the Hilbert space L²(-p,p) the system of functions 1/Ö(2p), 1/Öp cos x, 1/Öp sin x, 1/Öp cos 2x, 1/Öp sin 2x 1/Öp cos 3x, 1/Öp sin 3x, ... is a complete orthonormal system. The mean square error

Equation 2
for the approximation of a function f(x) Î L²(-p,p) by a trigonometric polynomial

Equation 3
precisely becomes minimal for ak, bk, if ak and bk are the Fourier coefficients of the function f(x):

Equation 4
A non-periodic function f(x) cannot be expanded into a Fourier series. If, however, f(x) satisfies the Dirichlet conditions, the Fourier integral can be expanded through the passage to the limit l®¥ in the range of (-l , l).

Equation 5
With the Fourier coefficients:

Equation 6

Equation 7
is obtained
or in a complex notion:

Equation 8
The double integral may be understood as a "superimposition" of two integrals. These two integrals describe the path to Fourier transformation or inverse Fourier transformation, where F(y) is the Fourier transform of f(x).

Equation 9
If a Fourier series is expanded, a periodic function with period 2l is obtained as the sum of harmonic oscillations with the frequencies nn= n p/l; n= 1, 2, 3.... . The Fourier integral, in contrast, shows the function f(x) as the sum of an infinite number of oscillations with continuously varying frequency n. The Fourier integral expands the function so to speak into a continuous spectrum.

Figure 2: A rectangular pulse function is represented by the sum of a finite trigonometric polynomial. The Fourier coefficients are shown as crosses (red = cosine terms, blue = sine terms). The calculated Fourier integral (cf. Equation 18) is plotted as a green function. The upper diagram shows the result for a 5th‑order polynomial, whereas the bottom diagram shows the result for 20 sine-cosine terms. The increase of the Fourier coefficients for the harmonics and the more precise modeling of the rectangular function can be seen.
In order to illustrate the functioning of Fourier transformation, the simple case of a rectangular pulse function f(x) is first of all considered:

Equation 10
The Fourier integral is given by:

Equation 11
For |x| > To, f(x)=0 and hence also F(n)= 0.
For |x| < To, f(x)=A. The integral is therefore determined as follows for the range (-To to To):

Equation 12
Every harmonic function of constant frequency n has this property:
![]()
Equation 13
Substituted into the Fourier integral determined:

Equation 14
![]()
Equation 15
Calculated within the limits -To to To:
![]()
Equation 16
is obtained.
Because sin(-x) = -sin(x):
![]()
Equation 17
follows.
Because cos(x)-cos(-x)=0, hence follows:
![]()
Equation 18 was substituted into Figure 2 to represent the Fourier transform. There exist Fourier transformations and their inverses for most practical cases of scientific measurement data. Nevertheless, a Fourier transform only exists if the following three conditions are fulfilled:
a)

b)
If f(x) = b(x) sin(2pnx+a), where a and n are arbitrary constants, as well as b(x+k) < b(x), and the function f(x)/x is absolutely integrable for |x| > l > 0 in the sense of Equation 19.
c)
If conditions a)
and b) are fulfilled, the functions must be boundedly variant. In a finite interval, a curve of finite length is
present.
A detailed description of the criteria for a), b) and c) can be found in [1]. For the method of signal generation from primary structures presented in this study, the three conditions are fulfilled.
In practice, for measurements alone there are generally no known functions available which could be processed via continuous Fourier analysis. This applies, in particular, to the sequences of biomolecules to be analysed, where the aim is rather to analyse more or less large data sets of a discrete nature. In the present study, these are, for example, the billions of nucleotides of a genome or rather their signal data. If these values originate from a "normal" measurement series, they are additionally distorted by the actual measuring process; one speaks of disturbed data. A prominent example is detector noise. In the present study, however, the signal data sets are generated from the sequence data of DNA by applying known physico-chemical facts to these data. Thus, for example, normalized enthalpy data can be used for duplex stability in relation to the respective neighbouring nucleotide in order to generate a signal (see below). In this case, the factor of "detector noise" is deleted and an undisturbed signal may be assumed. In order to process such discrete data by Fourier transformation on computers, a Fourier transformation is required which can be used specifically for these data: discrete Fourier transformation. It can be derived as a special case from continuous Fourier analysis1.
First of all, the continuous signal h(x) is scanned. This is done via a scanning function Da(x), which ensures an infinite sequence of equidistant delta functions. The delta function is undefined at its location of appearance, has an area of 1 below the function and otherwise exhibits the value of zero:
d(x-xo)=0; x≠xo

Equation 20: The delta function can be imagined as a pulse of very small width, but great pulse height.
The resultant scanning signal is thus given by the product h(x) Da(x):

Equation 21
where T is the scanning signal. In the following, a window for the data under consideration will be prepared. The scanning signal will be limited. This is achieved by multiplying the scanning signal by a rectangular function. T0 defines the width of this window. It is important that the boundaries of this window are between two scans:

Equation 22
The
"observation window" is thus installed by multiplication of the
scanning signal by this rectangular function f(x):

Equation 23
Further investigations thus take place in this "observation window", where N indicates the number of equidistant delta functions. N=To/T:

Equation 24
The signal originally assumed to be continuous is now already under discrete consideration. It is now still necessary to scan the Fourier transform of this signal to finally enable a completely discrete consideration of the transformation pair. This is done by a further scanning function, Db(u), which has its equidistant delta functions (interval = 1/To) in the range of the Fourier transform. In the spatial domain, this is achieved by multiplication by the scanning function. The analogous operation for multiplication in the frequency domain is convoluting in the spatial domain. Scanning function Db(x) is therefore convoluted with the scanning signal h(x) Da(x) f(x):

Equation 25
Convolution [h(x) Da(x) f(x)] * Db(x) is obtained by substituting:

Calculation of the equation gives:

Function h(x) is an approximation of h(x) and describes a periodic function of period T0. Since the Fourier transform of the periodic function h(x) consists of a sequence of equidistant delta functions, the Fourier transform h(x) is determined by:
Equation 28
Substituting Equation 26 into Equation 28 gives:

Equation 29
The Fourier transform of a periodic function consists of N values, if the N scanning values of the original function h(x) are regarded as a period of a periodic function. Integration over only one period gives:

Equation 30
And finally:

Equation 31
Because of T0=N T

Equation 32
is written.
The final notation of Fourier transformation (Equation 27) is thus as follows:

Equation 33
H(u) is the approximation of H(u) and is understood as H(n/NT).