Posts Tagged ‘Computational Biology

Machine Learning Techniques for the Analysis of Functional Data in Computational Biology

The amount of data in typical computational biology (bioinformatics) applications tends to be quite large but is on a manageable scale. In contrast,astrophysical applications have huge amount of data, and medical research oftenonly has a rather limited number of samples. The challenges in bioinformaticsseem to be:

• Diversity and inconsistency of biological data,

• Unresolved functional relationships within the data,

• Variability of different underlying biological applications/problems.

As in many other areas, this requires the utilization of adaptive and implicitmethods, as provided by machine learning

Protein function, interaction, and localization is definitely one of the key research areas in bioinformatics where machine learning techniques can beneficiallybe applied. Protein localization data, no matter whether on tissue, cell or evensubcellular level, are essential to understand specific functions and regulationmechanisms in a quantitative manner. The data can be obtained, for example,by fluorescence measurements of appropriately labelled proteins. Now the challenge is to recognize different proteins, and classes of them , respectively, whichusually leads to either an unsupervised clustering problem or, in case availablea-priori information is to be considered, a supervised classification task. Here anumber of different neural networks have been used. Dueto the underlying measurement technique, often artifacts are observed and haveto be eliminated. Since the definition of these artifacts is not straightforward,here too, trainable methods are used. In this context, for the separation of artifact vs. all other data, support vector machines have successfully been appliedas well.

Spectral Data in Bioinformatics

Frequentlyused measurement techniques providing such data are mass spectrometry (MS)and nuclear magnetic resonance spectroscopy (NMR). Typical fields, where suchtechniques are applied in biochemistry and medicine, are the analysis of smallmolecules, e.g., metabolite studies, or studies of medium or larger molecules, e.g.,peptides and small proteins in case of mass spectrometry. One major objectiveis the search for potential biomarkers in complex body fluids like serum, plasma,urine, saliva, or cerebral spinal fluid in case of MS or search for  characteristicmetabolites as a result of metabolism in cells (NMR).

Spectral data in this field have in common that the raw functional datavectors, representing the spectra, are very high-dimensional, usually containingmany thousands of dimensions idepending on the resolution of the measurementinstruments and/or the specific task. Moreover, the raw spectra are usuallycontaminated with high-frequency noise and systematic baseline disturbances.Thus, before any data analysis may be done, advanced pre-processing has tobe applied. Here application specific knowledge can be involved.Here machine learningmethods including neural networks offer alternatives to traditional methods like averaging or discrete wavelet transformation.

Preprocessed spectra often still remain high-dimensional. For further complexity reduction usually peak lists of the spectra are generated which then areunder consideration. These peak lists can be considered as a compressed, information preserving encoding of the originally measured spectra. The peak pickingprocedure has to locate and to quantify the positions and the shape/height ofpeaks within the spectrum. The peaks have to be identified by scanning all localmaxima and the associated peak endpoints followed by a S/N thresholding suchthat one obtains the desired peak list. This method is usually applied to theaverage spectrum generated from the set of spectra to be investigated. This approach works fine if the spectra belong to a common set or two groups of similarsize, with similar content to be analyzed.

Tags : , , ,