CODATA-ICSTI Portal Proposal NSF DRAFT

PROJECT SUMMARY (1 page)
Include a self-contained description of the activity that would results if the proposal is funded. Write in the 3rd person and include a statement of objectives and methods to be employed. Clearly address in separate statements intellectual merit and broader impacts. Proposals that do not do so will be returned without review. - PROJECT GOAL

A basic tenet of science is the need for external confirmation of reported research results. Another is using past basic data as building blocks for future research. The practice of science requires that primary data be available and usable. Biological research suffers greatly from the difficulty of general access to archival data. Published papers most often are interpretations of primary data, and the detailed data are not accessible. There are many portals for accessing published papers, e.g., the US National Library of Medicine and National Agricultural Library. There are no comparable portals to compendia covering the wide range of primary biological observations. GenBank and other macromolecular sequence and structure databanks do not link to data on the source organisms.

The primary goal of this project is to develop an Internet portal to facilitate interactions among scientists and scientific data managers by annotation and communication of the existence of available sources of basic biological data. The process requires community consideration of the issues, challenges, and future trends related to the long-term preservation of, and access to, diverse biological data resources.

OBJECTIVES and METHODS

The primary objectives of this project are (1) to make available information on the existence and availability of primary biological data, (2) to raise the visibility of the issues and challenges concerning archiving and preserving access to these data, and (3) to create and facilitate a social network of researchers, data professionals, and data center managers who are working on, and interested in, permanent access to scientific biological data. The resulting portal will locate useful biological data compendia, and encourage their long term accessibility and preservation. The project will use emerging and evolving web-based technologies to both organize its content and facilitate a social network of portal contributers and users. The project will use a tiered organizational structure. A core operating group will administer and organize the project and its required facilities. A steering group of collaborating experts will advise on various areas of biological data, on social networking in science, on annotation of databases, and on archiving of information. An international group of point persons will advise on candidate data sources and local archiving facilities (real or projected). The primary result envisioned is an an online clearinghouse of information on (1) the location and description of available databases of primary biological data and (2) resources that address the issues, problems, solutions, and both current and future data preservation and access practices in biology and related disciplines.

The project will accomplish its objectives by developing and deploying a web-based internet gateway of information about and links to:


 * Scientific and technical primary biological data and data collections;
 * Information about biology data preservation and access procedures, technologies, standards, and policies;
 * Biology discipline-specific and cross-disciplinary preservation projects and activities; and
 * Expert points of contact regarding data preservation and access in biology and associated disciplines.

INTELLECTUAL MERIT

The production of digital scientific and technical data in biology and other observational and experimental sciences currently is growing geometrically. In addition to increases in data volumes, projects in biology are inter- and multi-disciplinary in nature and depend on the continuing availability of data collected in different fields. Exemplary basic areas clearly dependent on biological data are ecology, environmental studies,and medicine. Applied areas with such dependency include bioremediation and control, genetic engineering, drug development, industrial fermentation, forestry management and aquaculture. The need for permanent access to basic scientific data within and across fields of inquiry requires solutions based on technology and social networks of researchers, projects and disparate data collections. Practicable data management solutions require the availability of information about existing projects, accessible data collections, management practices, and regulatory policies that support data preservation, data sharing, and data access.

This project will utilize emerging Internet and web-based applications and services to establish an Internet portal and gateway to provide current, practical, and useful information regarding permanent access to primary scientific data resources. The portal has three objectives: (1) locate and provide links to existing accessible compendia of basic data relevant to biology, (2) provide information on current and emerging preservation and access practices and (3) facilitate development of a community of practice among researchers and data managers. The information sharing will include links to scientific data and to information on preservation and access procedures, technologies, standards, and policies; and discipline-specific and cross-disciplinary preservation and access projects and activities. The community aspect will build a social network of biological and general data management experts across the content disciplines. As long-term data mangement grows in importance, and data collections increase in size and complexity, it is crucial for researchers and data managers to be able to share information about their respective projects and to build social and professional relationships that can help them understand and resolve the many practical and operational problems and challenges they face. Long term management of scientific and technical data is fast becoming a profession, and it needs the kinds of online support that has been available for other, related professions, such as digital librarians.

BROADER IMPACTS

The project will collect and disseminate news regarding data preservation and access to collections, conferences, reports, and studies. This, in turn, will support the evaluation of many individual and institutional data collection and management practices and processes. These experiences grounded on what works, and what works well, will help promote and make visible good long-term data preservation and access practices. Such educational promotion is critically needed in biology since most data compendia are poorly annotated and largely lacking in adequate metadata.

This project also will provide experience in using emerging Internet-based technologies and tools to facilitate information sharing and online social networking in the sciences. Prototypes need to be built and used to discover and explore long-term requirements for using these tools to support a data preservation and access community of practice. These requirements will necessarily include design needs and constraints on technologies (broadly construed), information regarding the needs and skills of users, and impacts of support of long-term preservation and access on organizations and institutions associated with scientific research and education.

The project will operate under the endorsements of the International Council of Science (ICSU) Committee on Data for Science and Technology (CODATA) and the International Council for Scientific and Technical Information (ICSTI). The project will promote the portal content broadly and provide information on, and operational experiences with, biological datasets that will inform related efforts in other disciplines and other countries.

PROJECT DESCRIPTION (15 pages)
i. RESULTS FROM PRIOR NSF SUPPORT Must be included for any PI or co-PI for past 5 years. State Title, NSF award #, amount & period of support. Summarize the results of completed work to date and resulting publications. ii. INTRODUCTION NEED FOR A PORTAL SUPPORTING SCIENTIFIC DATA COLLECTION AND MANAGEMENT

As the volume and use of scientific and technical data and information that is created in all areas of science continues to grow exponentially, the effective long-term preservation of, and access to, these information resources increases in importance as an essential component of the global scientific research infrastructure. The rate of growth in science data volumes, as well as the consequences of this growth, was recently documented in a Nature special commentary on 2020 Computing [1]. The need for "curated data repositories where information is actively acquired, organized, maintained and distributed" was a key recommendation of a recent needs assessment of plant biology databases [2].

Substantial progress has been, and is being, made in the development and deployment of digital libraries of published and grey literature. Browsing the online archives of DIGLIB, an e-mail discussion list for digital libraries researchers and librarians, provides a wealth of information on the many institutions, projects, and reports that detail these developments [3]. The preservation of, and permanent access to, digital scientific and technical (S&T) data and information in many cases poses greater, and significantly different, challenges than the data and information stored in print formats. These challenges are not only technical, but involve new scientific, financial, organizational, management, legal, and policy considerations. Moreover, while many of the challenges that require sustainable solutions are the same for digital data and information across all scientific and technical disciplines, others are distinct or unique for certain disciplines or types of data.

The growth of the internet as a component of scientific research and publication has been accompanied by the development of digital libraries, collections, and communities of digital librarians, computer scientists, information specialists, and collection managers. This has served to facilitate the development of digital document indexing and metadata standards to support both preservation and search and discovery. The work of these communities is helping many public and private libraries development and establish digital collection management practices.

However, the same degree of professional community and networking is not available to those who work managing the many growing collections of digital S&T data. In 2002 the U.S. National Committee for CODATA (a committee of the National Research Council) approached the International Committee on Data for Science and Technolog (CODATA) and the International Council for Scientific and Technical Information (ICSTI) to develop and maintain an Internet portal to scientific data and information preservation and access resources. It is envisioned that this portal will promote and integrate the work of ICSU bodies, especially of the CODATA Task Group on the Preservation of and Access to S&T Data in Developing Countries, with the work being done in this area by other relevant groups and organizations.

The development of a professional network and community is a large undertaking, and, to be successful, it is best done incrementally. The primary task of this portal project is to raise the visibility of the issues and challenges concerning archiving and preserving access to biological scientific and technical data and information, and to facilitate establishing an international social network of researchers, data professionals, and data center managers in the biological and related fields. By so doing, the project will encourage, by example, holders of useful data to properly curate the data and make them widely available. Biology is a good first candidate because the field includes disciplines that are virtually completely involved with and dependent on computing (e.g., bioinformatics), as well as disciplines that are often paper-based and where the description of primary scientific objects remain difficult to standardize (e.g., microbiology, macromolecular annotation[Krichevsky, M.I. 2004. Taxonomy: a Moving Target for Sequence Data. in Database Annotation in Molecular Biology: Principles and Practice. A. M. Lesk, Ed. John Wiley & Sons, Ltd. Chichester] This project will establish a portal that provides information about and links to:  Biological data, data collections, and data preservation projects; Scientific and technical data and information preservation procedures, technologies, standards, and policies; and Expert points of contact (national and international) regarding data preservation and access in biology and associated disciplines. 

CODATA and ICSTI, the principal endorsing organizations, are cognizant of many of the other national and international efforts in digital archiving and access that are already underway outside the ICSU family of organizations, particularly in the professional archiving and library communities. Of particularl relevance is the Global Information Commons for Science, announced by CODATA and ICSU at the World Summit on the Information Society, held in Tunis, Tunisia in November 2005. The rationale, objectives, and scope of that activity is described here: http://www.codata.org/wsis/GlobalInfoCommonsInitiative.html

The scope of this activity is a subset of that envisioned for the Global Information Commons for Science. That initiative will undertake to explore, document, and promote practical and workable policies and practices to enable the permanent access to both the data generated by scientific research as well as the published and synthesized research results. In effect, this project, by gathering and documenting both the issues of S&T data preservation, as well as the established and evolving digital S&T data management practices and standards, will provide a gateway to information and resources that will be used and leveraged by the Global Information Commons for Science, as well as other initiatives.

Through annotated links, the portal will point to data sets having adequacy of metadata, consistent formats and potential for remaining archived. Lanquage is, and will remain for the foreseeable future, a problem. A partial solution is to network among biologists on an international scale. This will mitigate the problem of access to the many valuable primary data sets throughout the world.

The portal also will provide value by establishing a collaborative editorial practice that evaluates and makes available links to content on and about scientific and technical data and information archiving that best fulfills the portal objectives stated above. There are many changes in both technology and human communication practices that are having an impact on how research is done. It is important that these changing norms of interaction be leveraged to assist in providing permanent access to S&T data and information. These research resources will not be preserved by accident.

Raising visibility for preservation of and permanent access to S&T data is a challenging undertaking. By starting with building the requisite network in the biological sciences, the project will both increase communication about shared problems and coordination on shared solutions for data preservation and access in the biological sciences. There is enough disciplinary diversity to discover where common shared solutions work and where individual and local solutions are the most practicable. It is expected that what is learned by implementing this portal and building this network will be useful in other broad disciplinary research areas.

iii. PROJECT PLAN Include Goals, Objectives and Deliverables. Describe your Activities. Provide enough information as to why you and your team are expert enough to accomplish the goal. What facilities and resources are available? Consider including a Logic Model. How will you address a diversity component? Include for each Objective: (1) Rationale (2) Methods/Activities (3) Expected Results/Deliverables & (4) Limitations & alternatives - The team involved with this portal has experience in research, information technology development and deployment, as well as connections to a wide network of data centers and data managers. Bionomics International has experience in a wide variety of data-focused project in the biological sciences. The relationship with CODATA and ICSTI further extends this network and will enable the portal to gather relevant information from a wide variety of sources, individual and well as institutional.

The primary objectives of this project are (1) to identify, annotate and make available access links to compendia of primary biological data, by (2) raising the visibility of the issues and challenges concerning archiving and preserving access to biological scientific and technical data and information, and (3) creating and facilitating an online social network of biologists, data professionals, and data center managers. The primary result of the project is an Internet portal that serves as an online clearinghouse of information and networking resource that serves the practical mechanism of bringing primary biological data sources together with those needing such data. The mechanisms for bridging this gap are education in good arhiving practices, locating and providing contacts and communication paths among the the disparate disciplines required for meaningful primmary biological data archiving. A fundamental aspect of the effort is locating existing and potential archiving sites with adequate institutional support to ensure long term survival.

With a few notable exceptions, accessible data archives are not part of the culture of biology. The proposed project is a start at changing this aspect of the culture. The task requires a complex of techniques. At the root of much of the problem lies the common attitude that once recorded and the subject of publication, the biologist wishes to go on to further studies. There is little motivation to share the base data with others.

The proposed project has elements to help with the required motivation. The first is to establish a mechanism for finding accessible base databases of utility to the working biologist. A field biologist would like to find data on the location of particular interest. The molecular biologist would like to test alternate clustering and cladistic algorithms to help choose that most appropriate to the problem at hand. The developer of microbial identification kits for ecological studies would like to have data on the type of organism found in saline environments. Thus, the first element of motivation is to provide a mechanism for locating such data. The various macromolecular sequence banks serve this archiving role for a specific community.

However, a central repository for all basic biological data is impractical. Rather, biology needs a distributed set of archive repositories for primary data. The next element is to find, and make public through the proposed portal, exemplary existing sources of primary data. At best, the coverage of the initial set of databases will be spotty and not interoperable. However, this initial set of links through the portal can serve as a practical example for the community.

Concurrent with the establishment of the above elements, the portal will contain components to consider the clear technical challenges of data archiving as well as new scientific, financial, organizational, management, legal, and policy considerations. Contributions by an international network of biologists, data managers, archivists, administrators, and person to person networking experts will provide consultation through the portal on establishing and maintaining biological data archives.

OBJECTIVE 1: Develop and implement mechanisms and personnel to locate, annotate, and disseminate information on primary data repositories of utility to biologists.
 * Rationale: Many disciplines make and make use of publically accessible primary data by their intrinsic nature. Examples of such disciplines include meteorology, oceanography, space sciences and genome sequencing. With the exception of macromolecular sequences, primary observational data in biology largely is unavailable. Museums may make available information from their records the geographic distribution of species of birds. However, the primary data which lead to the identification of the individual specimen is unavailable. Thus, no external study of the phenotypic characteristics by others is possible. The utility of the availability of primary data to others is illustrated by the anecdote of Spielgelman's recalculation of the exquisite data compiled by Carl and Gertrude Lindegren which led to a better understanding of Mendelian inheritence in Yeast [Mendelian Inheritance of Adaptive Enzymes, Carl C. Lindegren, S. Spiegelman, and Gertrude Lindegren, Proc Natl Acad Sci U S A. 1944 November 15; 30(11): 346–352.] Further, the ability to access primary data of others may provide baseline data for studies of diversity changes. In general, combining data sets from varied contributors would produce sets beyond the capacity of the individual investigator to produce. A compendium of primary data on grasses around the world would allow better judgement on the species distribution as species identification varies. The same variation occurs in such biota as mollusks, batcteria, and the many indigenous, but unnamed organisms iin the general environment. There is currently no general mechanism devoted to making primary biologic data available to the scientific community. Such a mechanism, especially one that is annotated and curated, would make possible new and unique findings. To be truly useful, the mechanism must provide efficient and straightforward functions [???].
 * Methods/Activities: Collecting and disseminating information on the existence, or projected existence, of repositories of primary data requires an infrastructure of facilities and personnel. An Operations Secretariat will coordinate the full effort and organize the establishment of the other components. A Steering Committee of biologists, information professionals, and archivists will set standards for annotation and curation as well as oversee the editing of portal content. The location of databases requires diverse input. A network of point persons, distributed by biologic discipline and geography will locate and annotate candidate databases subject to final curation by the Operations Secretariat.
 * Expected Results/Deliverables: The list of accessible data sets starts with modest content. As the content grows in both volume and disciplinary coverage a critical mass is expected. Thus, by education and example, the content will grow. Because of annotation and curation of the entries, the expectation is system acceptance of by the target user community of professionals.
 * Limitations & alternatives: Finding and gaining access to appropriate databases with addequate metadata is the most important limitation. There is always the limitation of attempting too much too soon. Success requires the incremental building of the accession list under the guidance of the Steering Committee.

The main alternative is laborious seeking databases through time consuming literature search to locate candidate data sets or the more usual personal networking through various forms of correspondence or conversation. The time consumed in the traditional networking methods increases dramatically when the data sets sought are outside of the seeker's immediate discipline.

OBJECTIVE 2: Raise visibility and descriptions of the issues and challenges of archiving and preserving access to biological scientific and technical data and information.
 * Rationale: Locating and making available biological and related data sets of use to the community requires awareness of the existence of, and need for, archived primary data. Long-term preservation and management of S&T data collections are critical to the continuation of scientific research and the use of research results. WIth the growing involvement of, and dependence on, computing technology, there is a increasing awareness that provision for long-term access to research results must be integral to the design of any data generating project. This need, in turn, requires changes in planning and execution of projects. The project plan must incorporate definitive data archiving practices, supported by changes in institutional policies and economics. The required changes will follow when all participants in research endeavors (researchers, data managers, and institutional administrators) understand the implicit needs, possible solutions, and trade-offs. Furthermore, the underlying computing infrastructure constantly undergoes development and innovation. Adapting to a changing infrastructure requires timely availability of information about technology trends. As with any new method applicable to scientific research, the scientist must gain knowledge of these infrastructure methodology improvements. Today, much of the information regarding management of, and permanent access to, scientific data and the data per se  are held locally in project descriptions, research reports and papers, and websites, at the websites of research institutions, national and international data centers, or in published research articles, workshop proceedings. There is a clear need for a portal or gateway that provides a triage point for learning about existence of primary data sets, preservation and access issues. The projected portal consists of an up to date, annotated, collection of pointers to projects, resulting primary data, reports, and people.
 * Methods/Activities: The primary method will be to build a dynamic, and timely portal of information. The portal will provide a well documented collection of links to relevant data sets, projects, reports, and people. All entries in the portal will be evaluated for clarity and adequacy of information. For inclusion of pointers to data sets, the data set must have understandalbe metadata. The portal will gather and disseminate information on events, news items, and discussion on key issues in permanent access to S&T data. These items will be published using newly emerging web communication technologies, such as syndication. The portal will also provide a searchable archive of this information, thereby creating another online resource for future use.
 * Expected Results/Deliverables: This project will deliver an online information resource that is up to date and also a searchable collection of annotated content.
 * Limitations & alternatives: One well-known alternative to providing the kind of information specified is simply an e-mail distribution service that is often used to publicize and distribute disciplinary and organizational information. A portal that supports more content management features will allow searching and access to the material by standard web browsers.

OBJECTIVE 3: Create and facilitate an online social network of biology researchers, data professionals, and data center managers.
 * Rationale: The ongoing development of digital collections of published information (books, journals, grey literature, etc.) has been accompanied by the development of a professional community of digital librarians. This community includes librarians, information scientists, computer programmers, as well as web specialists and information architects. Building on the existing library profession, a professional community is emerging that consistently works on the many issues and challenges presented by managing collections of digital materials. In contrast, a similar professional community does not exist for people primarily involved with management of scientific and technical data collections. It is clear that such support is crucial if the numerous and growing collections of digital data are to be effectively and efficiently managed for the long-term. This project will provide support for building such a professional community for the biological sciences. It will use new and emerging web-based technologies to establish an online social network of people who can use the portal to learn about each other and each other's work.

PROJECT OUTCOMES


 * Measurable outcomes include
 * Number and distribution of:
 * Databases
 * Contributing producers
 * Archiving institutions
 * Portal users
 * Workshops on archiving practices
 * Participants in workshops
 * Didactic material in portal on construction of biological metadata
 * Links to sources of good archiving techniques
 * Reduction of overhead in locating archived data and archiving techniques by savings in:
 * Connection costs (especially important in developing nations)
 * Users time due to organized and annotated portal contents
 * National, regional and international activities resulting from establishing portal
 * Discernable but not directly measurable outcomes include
 * Facilitation of international sharing of information through agreed archiving conventions
 * Ability to easily combine and exchange data
 * Allows larger studies
 * Reduces duplication
 * Enhances commercial utilization of biological resources
 * Promotes educational resources in biology
 * Provides permanent access (archive) to research and survey data
 * Enhances access to international resources and collaborators
 * Improves overall efficiency (cost effectiveness) of biological resources for private sector and public good uses
 * Provides a model for, and communication pathway to, other disciplines

PROJECT TIMELINE

iv. MANAGEMENT PLAN Detail how the project will be managed. Who is responsible overall for the project and include specific roles and responsibilities information for PI, co-PIs and senior personnel. Are there advisory boards? Industry relations? How will communications be handled? Quarterly meetings? Include activities, persons responsible and timeline. Will the project be sustainable and how after the NSF funding period? - The project will be managed by a principal from Bionomics International, a not-for-profit, tax-exempt non-governmental organization based in the United States. The management of the portal development and operations will be led by the Principal Investigator, and supported by an editorial committee for content, and a steering committee for oversight. The work of the PI will include project administration and and may include a separate Portal Site Editor to oversee portal content production and operation. All management and operational communication will be carried out by e-mail, telephone, and other computer-based means. The primary work actors, tasks and responsibilities are described in the following table.

v. EVALUATION/ASSESSMENT PLAN This is an area that is receiving more emphasis from NSF. Be sure to include details about how your project and objectives will be evaluated and how results will be analyzed. Project evaluation should be reflected in the budget as well. - In order to assure that the Portal returns value the following process measures are proposed for monitoring and evaluating portal activity.


 * Web site user metrics
 * Web site performance metrics
 * Content access metrics
 * Average response times
 * System availability
 * Qualitative measures of users and usability gathered from interviews and on-line feedback.
 * Organization participation (will partially be measured by activity of contributing editors and steering committee).

vi. DISSEMINATION How will results be broadly conveyed? - The Portal will announce its content and attract web access by several established methods.


 * The Portal content will reference other existing online resources: documents, websites, weblogs, etc.
 * The Portal itself will be providing value added descriptive information of its content. It is expected that this will generate web traffic when made public.
 * The Portal weblog module will use existing web-based discovery practices, such as Technorati tags, to announce the availability of new content.
 * The Portal will provide web syndication feeds for the major content categories.
 * The Portal will be announced in several online venues and through professional conferences, papers, special seminars, etc.

vii. SUMMARY Summarize project goals and expected outcomes, including how they pertain to broader impacts. - Providing online information about scientific and technical data and information preservation and access procedures, technologies, standards, and policies; discipline-specific and cross-disciplinary archiving projects and activities; and expert points of contact will go a long way towards promoting practicable sharing of biological data collections. Since much scientific research is multi-disciplinary, successfully documenting and coordinating information about resources and people in biology will yield experience and lessons that can be applied in other fields.

The goals of the portal are also consistent with the digital data preservation activities and objectives of both CODATA and ICSTI, as well as ICSU and its other member organizations. This project will provides an internet-based resource not otherwise available, and will support scientists and data managers in many disciplines and internationally as well.

Nevertheless, this is a time when interest in preserving digital content is growing in science, education, government, and business. It is important for the issues, needs, and special considerations of scientific and technical data and information to be available to support and influence the systems development that is already well underway, and to promote best practices across the international scientific and technical communities. Finally, and perhaps most important, this portal project will also help to bring together information about the many different organizations and institutions with commitments and programs focused on permanent access to scientific and technical data.

REFERENCES CITED (use consistent style – APA, IEEE, Chicago, MLA, etc)
1. Szalay, A. & Gray, J. 2020 Computing: Science in an exponential world. Nature, Published online: 22 March 2006, doi:10.1038/440413a. Retrieved April 21, 2006 from http://www.nature.com/news/2006/060320/full/440413a.html

2. Stein, L. et al. Plant Biology Databases: A Needs Assessment. Retrieved April 21, 2006 from http://www.gramene.org/resources/plant_databases.pdf

3. DIGLIB, A discussion list for digital libraries researchers and librarians, International Federation of Library Associations and Institutions. Retrieved April 21, 2006 from http://www.ifla.org/II/lists/diglib.htm

BIOGRAPHICAL SKETCHES (2 pages; may vary)
Include Professional preparation, appointments, publications (has limits), synergistic activities and collaborators. See GPG for specific details and formatting required. See NSF CV Outline.

BUDGET
BUDGET JUSTIFICATION (3 pages)