Research data is a valuable raw material for scientists. In research projects, data sets are often collected over periods of several years and at high costs. The resulting data sets are unique and almost irreplaceable: many experiments and long-term studies cannot be repeated with an acceptable amount of effort.
Even during the planning of a research project, the question of how to store research data in the long term takes on central importance. Compliance requires that data be stored and reusable for a period of 10 years.
What are the criteria for long-term electronic archiving of research data and what are the challenges involved in designing digital archives?
- Reusability of the data:
Other researchers should also be able to use the collected data. To enable availability and seamless access, standardized formats and interfaces should be used for storage.
- Retention period:
The lifespan of common storage media is shorter than the prescribed retention period of 10 years. During the archiving period, the storage media must therefore be replaced. This means that the data must be migrated to new data carriers. Alternatively, storage media can be selected whose service life covers the archiving period.
Compared to the total costs that may be incurred in research projects, the costs for the storage infrastructure are relatively low. Nevertheless, the financial aspect should not be neglected: Even in the field of research, comprehensive data acquisition and storage is becoming more and more attractive due to increasing analysis possibilities. The volumes of data that the infrastructure has to hold are growing - and this will in turn increase the hardware costs.
Data and storage management for research institutions
A sustainable strategy for archiving the resulting data should therefore already be considered when designing a research project. This way, the data remains available and usable in the electronic long-term archive.
Research institutions that have already collected large amounts of data often face the problem of full storage systems and costly storage expansions. Optimizing the storage infrastructure can help here to relieve high-performance primary storage and archive older data.
- Information Lifecycle Management is a storage strategy that takes data through its complete lifecycle.
With the help of the ILM approach, data is stored at the storage tier that corresponds to its respective lifecycle phase.
- Data is stored within a multi-level, hierarchical storage architecture.
Fast primary storage is provided for current, "hot" data. Inactive, "cold" data that has not been used for some time is archived on secondary storage.
- Data is moved to the lower storage tier automatically according to individually defined rules – so-called tiering. In consultation with users, administrators can, for example, set up archiving so that data of a specified age is automatically archived for a specified period of time.
In order to ensure the secure long-term archiving of research data, considerations should already be made during project planning with regard to the volumes and types of data that will be generated. If you know what needs to be stored, you can plan a storage and archiving strategy that will reliably and efficiently secure the valuable data and keep it available.
Software for efficient storage and long-term archiving of research data
For the long-term archiving of research data and digital documents, a specialized software solution is used that quickly and reliably stores the accumulating data on the selected storage systems and ensures long-term, barrier-free access to the data.
Various considerations should play a role in the selection of this software:
- Which storage technologies are supported?
The greater the range of supported storage technologies and storage media, the more flexible users remain. Existing hardware can continue to be used, and storage expansions are not restricted to specific manufacturers or technologies. Hardware can be selected with regard to its suitability and costs.
- Is seamless access to the data possible?
In order to be able to access archived data and digital documents seamlessly, it is advisable to store them in standardized formats. The archiving software should then provide the appropriate interfaces to enable the data to be read again quickly and reliably. Adherence to standards for data storage ensures that the data can also be used by other researchers.
- Does the software allow archive migrations?
A retention period of 10 years exceeds the lifetime of many common storage media. So during the retention period in the long-term archive, the data must be migrated to new media. If the archiving software also provides a migration function, no additional software or service is required.
Long-term archiving in the field of research with solutions by PoINT
Our software solutions for data archiving are used in renowned research institutions. With PoINT Storage Manager and PoINT Archival Gateway, institutes can cover different use cases, securely archive their valuable data and meet compliance requirements. At the same time, both solutions provide possibilities to quickly access the archived data. Thus, the data remains both readable and usable for further analyses in the future.
File-based long-term archiving with PoINT Storage Manager
Our file-based archiving solution PoINT Storage Manager implements an information lifecycle management within a multi-tier storage architecture:
- Based on individually defined policies, the software automatically moves data to the appropriate storage tier and storage medium within the storage architecture.
- PoINT Storage Manager supports storage systems independent of manufacturer and technology so that users can choose customized hardware solutions.
- Older data, which is no longer needed directly for analysis purposes, is moved to the archive storage level and there stored in a compliant and secure way.
- With PoINT Storage Manager, users can access archive data via the familiar user interface.
- The solution performs automated and non-disruptive archive migrations.
Case Study: Max Planck Institute Bad Nauheim
The Max Planck Institute for Heart and Lung Research (MPI) has permanently retained terabytes of measurement data using online storage. This includes data that is only rarely accessed. The MPI holds petabytes of data overall. In order to reduce the costs associated with maintaining such large volumes, the institute chose to use the PoINT Storage Manager software, which enables long-term preservation, transparent read access and multiple media formats.
For more information, see our Case Study.
The solution for Big Data Storage: PoINT Archival Gateway
In research projects, often data volumes in the petabyte range accumulate, which have to be stored securely and for a long time. For this purpose, our software PoINT Archival Gateway offers a high-performance and scalable object storage with standardized S3 interface.
- PoINT Archival Gateway stores the data on tape, a particularly cost-efficient and long-life storage medium for archiving cold data.
- The lifetime of up to 30 years also covers long retention periods and thus makes archive migrations unnecessary.
- The solution receives data via the standardized S3 interface, and data is also accessed via this API.
- With growing data volumes PoINT Archival Gateway is flexibly scalable.
Case Study: EMBL European Bioinformatics Institute (EMBL-EBI)
EMBL’s European Bioinformatics Institute (EMBL-EBI) stores research data and compressed data of less than 1MB up to 100GB in size, with a current dataset of approximately 50PB. Backup and long-term archiving data is written to tape. The previous in-house solution wrote the data from object storage to a disk file system and then to tape media. However, this approach did not provide sufficient performance for the growing data volumes. Moreover only 90% of the tape’s capacity was utilized. With the introduction of PoINT Archival Gateway, the research institute now has a high-performance and cost-effective solution to write data directly to tape via the standardized S3 interface and to cope with the data growth.
For more information, see our Case Study.