![]() |
The commercialization of
bioinformatics Keywords:
Computational biology, Nucleotide sequences, Pharmaceuticals.
Biological research has experienced a paradigm shift from in vivo or in vitro experimentation ("wet work") to computer-based, or in silico, experimentation. This is a development that relies upon bioinformatics, which is the management and analysis of biological information stored in databases. The best-known use of bioinformatics lies in the manipulation and storage of nucleic acid and protein sequences. Biological polymers, such as nucleic acid molecules (e.g., DNA) or proteins, consist of simple building blocks of nucleotides or amino acid residues. As a result, the composition of a nucleic acid molecule or a protein can be represented as sequence of digital symbols from a limited alphabet. The connection between molecular biology and computer science can be viewed as the outcome of coincidental timing. Mini- and bench-top computers began to appear in laboratories at the same time that researchers were adopting techniques of cloning and nucleic acid sequencing. Thus, the tools needed to store, search, and analyze the new sequence data developed alongside the tools necessary to generate the data. After the formation of DNA and protein databases, software slowly became available to search sequence databases. The first methods were simple, and involved hunting for keyword matches and short sequence words. These approaches were followed by sophisticated pattern matching and alignment-based software. Today, researchers rely upon software for a variety of activities, such as reading nucleotide sequences from electrophoresis gels, predicting encoded protein sequences, identifying primers for gene amplification, sequence comparison or alignment, database searching, analyzing evolutionary relatedness, pattern recognition, and structure prediction. The advancement of bioinformatics has benefited greatly from progress in computer speed. Coincidentally, computer speed and the amount of sequence data have been growing at roughly the same rate since the late 1980s, apparently doubling about every 15 to 18 months by 1998. Some claim that it is becoming increasingly difficult to separate advances in biotechnology from advances in high-performance computing. Access to sequence data is critical, and much of the new sequence data are distributed over the Internet. Currently, there are at least 400 Internet-accessible databases of biological data. The Internet also provides a means to distribute software, and enables researchers to perform sophisticated analyses on remote servers. Thus, the performance of bioinformatics relies upon developments in computer hardware and software. However, it is the excessive amount of sequence data that has driven the development of bioinformatics, a circumstance that can be traced to the establishment of the Human Genome Project. This endeavor was conceived in the mid-1980s, and was widely discussed in the press and scientific community through the end of the 1980s. In the United States, the Human Genome Project officially started on October 1, 1990, as a 15-year program to map and sequence the complete set of human chromosomes, as well as those of several model organisms. The goal of sequencing an estimated three billion base pairs of the human genome was ambitious, considering that few laboratories in 1990 had sequenced even 100,000 nucleotides. By 1993, the Human Genome Project had become an established international effort, which included nine countries and the European Community. The strategy of this international project was to make a series of maps of each human chromosome at increasingly finer resolutions. According to this approach, chromosomes were divided into smaller fragments that could be cloned, and then, fragments were arranged to correspond to their locations on a chromosome. After mapping, each of the ordered fragments would be sequenced. The prevailing view is that the bulk of useful information about the human genome can be gained from the regions of DNA that encode proteins. Analysis of these nucleotide sequences allows elucidation of the corresponding amino acid sequences. Although this seems simple, a significant problem is that gene density in the human genome is exceptionally low, and only about 3% of the genome encodes proteins. In the early 1990s, J. Craig Venter, a researcher at the National Institutes of Health (Bethesda, MD), and his colleagues devised a new way to find genes. Rather than taking the Human Genome Project strategy of sequencing chromosomal DNA one base at a time, Venter's group isolated messenger RNA molecules, copied these RNA molecules into DNA molecules, and then sequenced a part of the DNA molecules to create expressed sequence tags, or "ESTs." These ESTs could be used as handles to isolate the entire gene. Venter's method, therefore, focused on the "active" portion of the genome, which was producing messenger RNA for protein synthesis. The EST approach has generated enormous databases of nucleotide sequences, and facilitated the construction of a preliminary transcript map of the human genome. The development of the EST technique is considered to have demonstrated the feasibility of high-throughput gene discovery, as well as provided a key impetus for the growth of the genomics industry. As a result of the Human Genome Project and the parallel EST-based sequencing approaches, sequence data began to appear at an extraordinary rate. By mid-1999, the amount of GenBank nucleotide sequence data was doubling every 14 months, and a genetics laboratory could easily produce 100 gigabytes of data each day. Human DNA did not provide the only source of data for burgeoning databases. Researchers were also sequencing genomes from bacteria, fungi, plants, and animals of research and agricultural interest. Nevertheless, the commercialization of bioinformatics was strongly motivated by the belief that bioinformatics could profoundly alter the way that drugs are developed. That is, bioinformatics-derived information had a ready market - the traditional pharmaceutical industry. In the mid-1990s, pharmaceutical companies were primed for the new approaches of bioinformatics due to a lack of innovative new products in the traditional drug pipeline. This impending profit-gap is a particularly significant problem in view of the industry's annual growth rates. Within the next decade, the leading pharmaceutical companies may need to bring to market ten times as many compounds per year as they currently manage just to maintain growth levels of 10 to 15%, a rate anticipated by investors. Another problematic trend for the pharmaceutical industry is that a large number of blockbuster drugs will lose patent protection within the next few years. According to one estimate, drugs with sales approaching US $25 billion in revenues will come off-patent by the year 2002. Consider Merck & Co. (Whitehouse Station, NJ) as an example. Within the next two years, Merck will lose U.S. patent protection for five major products, which brought the company US $4.38 billion in U.S. sales and royalties during 1999 alone. In light of these trends, the pharmaceutical industry is turning to bioinformatics-based approaches to revitalize drug discovery programs. Biotechnology companies have tried a number of commercialization strategies to meet this need. One approach is to sell data in the form of complete genes or gene fragments, which others can use to identify potential drugs or drug targets. Human Genome Sciences, Inc. (Rockville, MD), the first company to commercialize genomics, used this tactic. Incyte Pharmaceuticals, Inc. (Palo Alto, CA) was another of the early biotechnology companies to engage in high-throughput computer-aided nucleotide sequencing to identify new genes and their corresponding proteins with potential therapeutic applications. Incyte's basic strategy was to predict biological function by comparing partial sequences with known sequences, and to offer companies a non-exclusive access to its genomic information. However, the increasing numbers of contributions to public sequence databases indicate that time is running out for this early strategy of selling sequence information. Another approach to commercializing bioinformatics is to enhance the value-added transformation of sequence data by linking structural information about genes with observations about gene function. For example, CuraGen (New Haven, CT) and Gene Logic (Gaithersburg, MD) market information on gene expression. Companies are also adding value to sequence data by creating high-value intellectual property like validated drug targets. This strategy is illustrated by Millennium Pharmaceuticals Inc. (Cambridge, MA), which struck a US $465 million deal to provide Bayer AG (Leverskusen, Germany) with 225 drug targets relevant to cardiovascular disease, cancer, osteoporosis, pain, liver fibrosis, hematology and viral infections. The commercialization of bioinformatics has produced an industry that provides bioinformatics tools, a function previously fulfilled by university researchers. These bioinformatics support companies are offering new data analysis tools and software platforms for data management, expression profile analysis, links to sequence and annotation databases, function prediction based on pathway information, data mining, and data tracking of automated processes. Companies that implement these new bioinformatics tools face problems that arise when trying to compare, store, and analyze data produced from multiple platforms. Since there are no clearly accepted bioinformatics industry leaders, biotechnology and pharmaceutical companies are operating one or more outside systems with their own proprietary system to produce expression data. This situation leads to a practice of capturing only the lowest common denominator of data. Thus, the integration of biological data will require some form of standardization. This is particularly important in view of the wealth of data available from diverse sources (and in diverse formats) on Internet. Another
challenge facing those who rely upon bioinformatics to enhance drug discovery
is that the need for data analysis is creating a bottleneck in the drug discovery
process: data analysis. Data mining encompasses the use of pattern recognition
technologies and statistical techniques to examine large amounts of data.
The objective of data mining is to discover meaningful new correlations,
patterns, and trends. Considering the bottleneck, data mining technology
appears to represent the pacing technology of a company that uses bioinformatics
for drug discovery. |
|||||||||||
|
Home | Mail to Editor | Search | Archive |
||