Notes on VAST Data I/O servers (CNodes) - Stateless worker nodes that can be scaled without bringing the system down - Mounts every JBOF and accesses them via RDMA - I/O server failures are stateless and don't affect the system Storage server (DNodes) - SCM (RAM) and QLC (flash) - SCM - Stages writes and groups them into sequential optimized writes for flash - QLC - Cheap and optimized for reads and sequential writes - 300 PB limit - if JBOF or SCM fails, then it Network - NVMeoF via Ethernet or IB - Very sensitive to topology Metadata - Distributed across DNodes via consistent hashing algorithm - Redirect-on-Write (ROW) Updates are written to new locations, and metadata is only atomically updated after writes are complete. - File system structure (e.g., file-to-block mapping). - Erasure coding and striping information. - Extent information (location of blocks on NVMe devices). - Data compression information. - User-defined metadata (in S3 mode). Failure - RAID 5, at least 3 drives and can tolerate 1 failure - Erasure coding - Reed-Solomon - The overhead of erasure coding can be calculated using the formula:overhead = (n_data + n_parity) / n_data - 1For example, in a configuration with 17 data chunks and 3 parity chunks (often written as EC 17,3), the overhead would be:(17 + 3) / 17 - 1 = 0.18, or 18%[1](https://www.reddit.com/r/DataHoarder/comments/99ubi0/erasure_coding_overhead/) - storage overhead https://xenon.com.au/products-and-solutions/vast-data/ https://www.supermicro.com/en/support/resources/aoc/aiom?show=SELECT&type=Hyper https://cse.buffalo.edu/faculty/tkosar/cse710/papers/lustre-whitepaper.pdf https://www.nas.nasa.gov/hecc/support/kb/lustre-basics_224.html https://wiki.lustre.org/Understanding_Lustre_Internals