Posts Tagged ‘file instance

Replica Creation and Replica Selection in Data Grid Service

Replica selection is interesting because it does not build on top of the core services, but rather relies on the functions provided by the replica management component described in the preceding section. Replica selection is the process of choosing a replica that will provide an application with data access characteristics that optimize a desired performance criterion, such as absolute performance (i.e. speed), cost, or security. The selected le instance may be local or accessed remotely. Alternatively the selection process may initiate the creation of a new replica whose performance will be superior to the existing ones.

Where replicas are to be selected based on access time, Grid information services can provide information about network  performance, and perhaps the ability to reserve network bandwidth, while the metadata repository can provide information about the size of the file. Based on this, the selector can rank all of the existing replicas to determine which one will yield the fastest data access time.  Alternatively, the selector can consult the same information sources to determine whether there is a storage system that would result in better performance if a replica was created on it.

A more general selection service may consider access to subsets of a fi le instance. Scientific experiments often produce large les containing data for many variables, time steps, or events, and some application processing may require only a subset of this data. In this case, the selection function may provide an application with a fi le instance that contains only the needed subset of the data found in the original file instance. This can obviously reduce the amount of data that must be accessed or moved.

This type of replica management has been implemented in other data-management systems. For example, STACS is often capable of satisfying requests from High Energy Physics applications by extracting a subset of data from a file instance. It does this using a complex indexing scheme that represents application metadata for the events contained within the file . Other mechanisms for providing similar function may be built on application metadata obtainable from self-describing file formats such as NetCDF or HDF.

Providing this capability requires the ability to invoke ltering or extraction programs that understand the structure of the fi le and produce the required subset of data. This subset becomes a fi le instance with its own metadata and physical characteristics, which are provided to the replica manager. Replication policies determine whether this subset is recognized as a new logical file (with an entry in the metadata repository and a fi le instance recorded in the replica catalog), or whether the fi le should be known only locally, to the selection manager.

Data selection with subsetting may exploit Grid-enabled servers, whose capabilities involve common operations such as reformatting data, extracting a subset, converting data for storage in a different  type of system, or transferring data directly to another storage system in the Grid. The utility of this approach has been demonstrated as part of the Active Data Repository. The subsetting function could also exploit the more general capabilities of a computational Grid such as that provided by Globus. This o ers the ability to support arbitrary extraction and processing operations on fi les as part of a data management activity.

Tags : , , , , , , , , , , , , , , , ,

Storage Systems and the Grid Storage API

In a Grid environment, data may be stored in diff erent locations and on diff erent devices with di fferent characteristics. The mechanism neutrality implies that applications should not need to be aware of the speci c low-level mechanisms required to access data at a particular location. Instead, applications should be presented with a uniform view of data and with uniform mechanisms for accessing that data. These requirements are met by the storage system abstraction and our grid storage API. Together, these de ne our data access service.

1. Data Abstraction: Storage Systems

We introduce as a basic data grid component what we call a storage system, which we de ne as an entity that can be manipulated with a set of functions for creating, destroying, reading, writing, and manipulating the attributes of named sequences of bytes called fi le instances. Notice that our defi nition of a storage system is a logical one: a storage system can be implemented by any storage technology that can support the required access functions. Implementations that target Unix fi le systems, HTTP servers, hierarchical storage systems such as HPSS, and network caches such as the Distributed Parallel Storage System (DPSS) are certainly envisioned. In fact, a storage system need not map directly to a single low-level storage device. For example, a distributed file system that manages fi les distributed over multiple storage devices or even sites can serve as a storage system, as can an SRB system that serves requests by mapping to multiple storage systems of di fferent types.

Our defi nition of a fi le instance is also logical rather than physical. A storage system holds data, which may actually be stored in a fi le system, database, or other system; we do not care about how data is stored but specify simply that the basic unit that we deal with is a named sequences of uninterpreted bytes. The use of the term “fi le instance” for this basic unit is not intended to imply that the data must live in a conventional fi le system. For example, a data grid implementation might use a system such as SRB to access data stored within a database management system. A storage system will associate with each of the le instances that it contains a set of properties, including a name and attributes such as its size and access restrictions. The name assigned to a leinstance by a particular storage system is arbitrary and has meaning only to that storage system. In many storage systems, a name will be a hierarchical directory path. In other systems such as SRB, it may be a set of application metadata that the storage system maps internally to a physical leinstance.

2. Grid Storage API

The behavior of a storage system as seen by a data grid user is defi ned by the data grid storage API, which de fines a variety of operations on storage systems and fi le instances. Our understanding of the functionality required in this API is still evolving, but it certainly should include support for remote requests to read and/or write named fi le instances and to determine fi le instance attributes such as size. In addition, to support optimized implementation of replica management services (discussed below) we require a third party transfer operation used to transfer the entire contents of a le instance from one storage system to another.

While the basic storage system functions just listed are relatively simple, various data grid considerations can increase the complexity of an implementation. For example, storage system access functions must be integrated with the security environment of each site to which remote access is required. Robust performance within higher-level functions requires reservation capabilities within storage systems and network interfaces. Applications should be able to provide storage systems with hints concerning access patterns, network performance, and so forth that the storage system can use to optimize its behavior. Similarly, storage systems should be capable of characterizing and monitoring their own performance; this information, when made available to storage system clients, allows them to optimize their behavior. Finally, data movement functions must be able to detect and report errors. While it may be possible to recover from some errors with in the storage system, other errors may need to reported back to the remote application that initiated the movement.

Tags : , , , , , , , , , , , , , , , , , ,