Notes on VAST Data
I/O servers (CNodes)
- Stateless worker nodes that can be scaled without bringing the system down
- Mounts every JBOF and accesses them via RDMA
- I/O server failures are stateless and don't affect the system
Storage server (DNodes)
- SCM (RAM) and QLC (flash)
- SCM
- Stages writes and groups them into sequential optimized writes for flash
- QLC
- Cheap and optimized for reads and sequential writes
- 300 PB limit
- if JBOF or SCM fails, then it
Network
- NVMeoF via Ethernet or IB
- Very sensitive to topology
Metadata
- Distributed across DNodes via consistent hashing algorithm
- Redirect-on-Write (ROW) Updates are written to new locations, and metadata is only atomically updated after writes are complete.
- File system structure (e.g., file-to-block mapping).
- Erasure coding and striping information.
- Extent information (location of blocks on NVMe devices).
- Data compression information.
- User-defined metadata (in S3 mode).
Failure
- RAID 5, at least 3 drives and can tolerate 1 failure
- Erasure coding
- Reed-Solomon
- The overhead of erasure coding can be calculated using the formula:overhead = (n_data + n_parity) / n_data - 1For example, in a configuration with 17 data chunks and 3 parity chunks (often written as EC 17,3), the overhead would be:(17 + 3) / 17 - 1 = 0.18, or 18%[1](https://www.reddit.com/r/DataHoarder/comments/99ubi0/erasure_coding_overhead/)
- storage overhead
https://xenon.com.au/products-and-solutions/vast-data/
https://www.supermicro.com/en/support/resources/aoc/aiom?show=SELECT&type=Hyper
https://cse.buffalo.edu/faculty/tkosar/cse710/papers/lustre-whitepaper.pdf
https://www.nas.nasa.gov/hecc/support/kb/lustre-basics_224.html
https://wiki.lustre.org/Understanding_Lustre_Internals