A RAM disk is a proportion of the RAM that is treated as if being a disk storage. That is, it could host a file system through which applications can deal with files in the same way as dealing with files on other disks, using the same file API, however in a much more performant manner.
RAM is the fastest storage medium accessible by applications when compared against other storage mediums (especially HDD and SSD). However, as a very expensive storage medium, RAM is a scarce memory resource that has a relatively limited capacity (measured in gigabytes while the capacity of other storage mediums are measured in terabytes). Moreover, it’s a volatile memory that loses data after a system restart, shutdown or other power loss scenarios. It’s important to keep these limitations in mind while thinking of a way to leverage the outstanding performance of the RAM disk.
- Scarcity limits the amount of data stored, the thing that could bring a solution idea down. However, limits could be pushed further by paying more money for more capacity as RAMs are now in hundreds of gigabytes.
- The volatility of RAM could bring a solution idea up or down; RAM does not fit for persistent data however it fits for volatile temporary data.
- RAM performance is always a win for the solution. It’s the reason to why we think of RAM disks.
In the big data applications domain, we list two possible generic use cases:
- Storing temporary and intermediate discardable data
- Storing persistent data backed by persistent storage medium
Storing temporary and intermediate discardable data:
In the big data applications domain, most of the temporary and intermediate data is volatile in nature; it is nonpermanent (by definition) and discardable such that the existence of the data is tied with the existence of the application. As long as the application is running, this data is being generated and is being kept such that its existence is mandatory for the application to finish processing. Once the application is done processing, it becomes not necessary to keep this data and hence it becomes discardable.
Moreover, the discardability of the temp data, in the majority of the big data applications, is extended much more beyond that; it is not only discardable after the termination of the application but also during the lifetime of the application as well. That is, an application can regenerate this data (as a whole or in parts) if for some reason the data is lost. In general, this is a design principle for big data frameworks.
In a mapreduce framework, shuffle files are intermediate files that are written to disk and transferred through network. They are volatile in the same way described above. However, these data are comparable in size to the input data. Hence, in almost all the situations, it’s as big as the big input data unless the processing framework is doing something more with the data; something like writing these files compressed or partitioned either across time (i.e. processing one partition at a time) or across several machines. Writing these files to a RAM disk would definitely increase the shuffle performance especially when multiple jobs with different workloads are running.
It worth mentioning that there are at least two levels of buffering when making disk IOs: the application level buffers and the OS level buffers. Making the appropriate configurations for both levels, it’s possible to reach the same level of performance like that of the RAM disk. However, it’s not always possible to make the appropriate configurations; the application buffers capacity differs from an application to another and from a workload to another while the OS buffers are affected by other IO operations from the same or other applications. So, having an adaptable dedicated buffer for the temp volatile data would most often assure the required performance level.
Storing persistent data backed by persistent storage medium:
There are situations when computing reusable data is very expensive in terms of the consumption of computing resources. Moreover, such expensive data could be used frequently afterwards (for example, by other analytical tasks and low-latency jobs). Hence, persisting this data becomes mandatory to save computation resources and to assure real time performance of its consumers.
In order to devise a scheme for persisting such data, a number of factors should be considered. For how long would that data be persisted? How frequently would that data be consumed? Would the data be shared across different applications? If the data is required for a relatively long period, a persistent storage medium is obligatory. However, if the data is being consumed frequently, a reasonably fast storage medium is prefered. In addition to that, sharing data efficiently across applications is not straightforward. If there is a storage medium with which we can satisfy the three aspects of persistency, usage frequency and sharing, it would be perfect for that case.
All these aspects could be satisfied by leveraging different storage types in a tiered storage model. At the top tier, the RAM disk resides while being backed at lower tiers by persistent storage. RAM disk would assure high performance while the persistent storage would assure persistency. If it happens that data in the RAM disk is lost, with a suitable failover plan, data could be restored from the persistence storage. Also, sharing data among applications becomes a matter of reading and writing files which simplifies the implementation of applications.
A simple implementation for the tiered storage model, is to save data into two locations: a copy in the RAM disk and another copy in the persistent disk. Then for a consumer application, it could first try reading the data from the RAM disk. If it’s not found, the consumer application could read it from the persistent disk while at the same time rewriting it to the RAM disk for consequent usages.
It worth mentioning that Tachyon is an in-memory distributed file system that is backed by HDFS. Also, by the time of writing this article, HDFS is being developed in phases in order to support heterogeneous storage types (like RAM disk, regular disk, SSD …) and different data storage policies (like hot data, warm data, cold data …). Both features coupled allow the implementation of the tiered storage system we seek.
- RAM disk on wikipedia (http://en.wikipedia.org/wiki/RAM_drive)
- A collection of RAM disk drives for different OSs (http://en.wikipedia.org/wiki/List_of_RAM_drive_software)
- tmpfs: in-memory file system backed by swap space for Unix-like OSs (http://en.wikipedia.org/wiki/Tmpfs)
- tmpfs VS ramfs:
- Creating a RAM disk in Linux (http://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux)
- HDFS heterogeneous storage types and storage policies: