Overview - Notes

Notes on VAST Data VAST is premised on the Disaggregated Shared Everything Architecture (DASE). This essentially means that the I/O and storage functions are on separate hardware for example a diskless server making the I/O reqe # Big Ideas - The VAST DASE (Disaggregated Shared Everything Architecture) disaggregates (separates) the storage media from the CPUs that manage that media and provide storage services. This disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster. DASE allows users to scale the capacity of a VAST Cluster independently from the compute resources of the cluster, adding enclosures for capacity and servers for performance. - Metadata operations are quick - Compute and Storage are independently scalabe - Shared nothing is out of date with network being faster than local storage and RDMA - this is leading to narrowing of gap between network and local storage performance # Architecture - CNodes - DNodes - SCM (RAM) and QLC (flash) - Network - NVMeoF via Ethernet or IB - Metadata - Fault Tolerance I/O servers (CNodes) - Stateless worker nodes that can be scaled without bringing the system down - Mounts every JBOF and accesses them via RDMA - I/O server failures are stateless and don't affect the system Storage server (DNodes) - SCM (RAM) and QLC (flash) - SCM - Stages writes and groups them into sequential optimized writes for flash - QLC - Cheap and optimized for reads and sequential writes - 300 PB limit - if JBOF or SCM fails, then it Network - NVMeoF via Ethernet or IB - Very sensitive to topology Metadata - Distributed across DNodes via consistent hashing algorithm - Redirect-on-Write (ROW) Updates are written to new locations, and metadata is only atomically updated after writes are complete. - File system structure (e.g., file-to-block mapping). - Erasure coding and striping information. - Extent information (location of blocks on NVMe devices). - Data compression information. - User-defined metadata (in S3 mode). Failure - RAID 5, at least 3 drives and can tolerate 1 failure - Erasure coding - Reed-Solomon - The overhead of erasure coding can be calculated using the formula:overhead = (n_data + n_parity) / n_data - 1For example, in a configuration with 17 data chunks and 3 parity chunks (often written as EC 17,3), the overhead would be:(17 + 3) / 17 - 1 = 0.18, or 18%[1](https://www.reddit.com/r/DataHoarder/comments/99ubi0/erasure_coding_overhead/) - storage overhead https://xenon.com.au/products-and-solutions/vast-data/ https://www.supermicro.com/en/support/resources/aoc/aiom?show=SELECT&type=Hyper https://cse.buffalo.edu/faculty/tkosar/cse710/papers/lustre-whitepaper.pdf https://www.nas.nasa.gov/hecc/support/kb/lustre-basics_224.html https://wiki.lustre.org/Understanding_Lustre_Internals building blocks performance cost limitations disaggregated shared everything architecture [[Big Idea]] https://www.youtube.com/watch?v=hobvps-H38o A DASE cluster includes four primary logical networks. · The NVMe fabric, or back-end network, connects CNodes to DNodes. VAST clusters use NVMe over RDMA for CNode<->DNode communications over 100 Gbps Ethernet or InfiniBand with Ethernet as the default. · The host network, or front-end network, that carries file, object, or database requests from client hosts to the cluster’s CNodes. · The management network that carries management traffic to the cluster, including DNS and authentication traffic. · The IPMI network used for managing and monitoring the hardware in the cluster. 4 reasons shared nothing is losing popularity: shared nothing happened when networks were slower than local disk so it made sense to have compute nodes contain storage. with NVMe-oF networks are faster or equal to local disk so shared everything model makes sense When a VAST Server (CNode) receives a read request, that CNode accesses the VAST DataStore’s persistent metadata from shared Storage Class Memory to find where the data being requested actually resides. It then reads that data directly from hyperscale flash (or SCM if the data has not yet been migrated from the write buffer) and forwards the data to the requesting client. For write requests, the VAST Server writes both data and metadata directly to multiple SSDs and then acknowledges the write. [[Metadata]]