Attaca: Git-esque Version Control for Absurd Quantities of Data (in Rust)
Version control is an indispensable component of modern software development. However, today’s version control tool of choice, Git, is not designed to handle extremely large datasets or large quantities of binary data. There are many situations in which a user might want to add such data to version control: for example, a system built on machine learning would want to version control its dataset alongside the code which uses it. Game developers work with large 3D models and image files on a regular basis, and scientists working with constantly changing datasets and datasets subjected to post-processing would benefit from the ability to revision their data. In most cases, the “solution” of choice is to store such files separately from version-controlled source code.
Built in Rust, Attaca is a new Git-like version control system designed from the ground up to be efficient with multi-gigabyte files and multi-terabyte repositories. It shares most of its data structures with Git, enabling Git users to work productively with it right away, but borrows clustering and resilience properties from NoSQL databases. Like Git, Attaca enables any authorized user to check the integrity of a project back to the beginning. Like a NoSQL database, Attaca enables horizontal scaling and replication of data across multiple data centers. This talk and demonstration will run through the crucial differences between Git and Attaca as well as the pros and cons of using Rust, and then show Attaca running on a live large scientific dataset and storage cluster.