Big Data

Kastor, the Scalable, Cost-Efficient Data Lakehouse

Jammond Hayes-Ruffin

August 27, 2024

•

min read

Resouces

Big Data

We are excited to introduce Kastor, our data lakehouse solution. Kastor tackles the high costs and complexities of traditional data warehousing and data lakes by leveraging blob storage and a more efficient embedded query engine to offer a scalable and cost-efficient alternative for data management.

Background

As organizations grow, their data becomes fragmented across multiple sources, such as transactional databases, spreadsheets, and other storage systems. This fragmentation leads to operational inefficiencies and significant integration challenges. Departments managing their own data, rather than relying on centralized services, may find this particularly daunting.

‍

Kastor addresses these issues by consolidating diverse data sources into a secure and scalable system, enhanced with robust data governance to streamline processes and simplify management.

‍

Kastor offers data integration, managed Iceberg tables, and an embedded query engine within a single application to streamline data processing and analytics. We are committed to enabling organizations to quickly deploy a data lakehouse without complex documentation or intricate settings. We strive for seamless integration, minimal setup time, and speed. Essentially, we are building a data platform that is straightforward and user-friendly, eliminating the need for specialized skills to operate.

Solution

Kastor's architecture is profoundly influenced by the principles set forth in Wes McKinney and Co's Composable Data Systems Manifesto and Davis Treybig's vision of utilizing S3 as the universal infrastructure backend. Although we opt for Google Cloud Storage (GCS) instead of S3, the core idea remains to leverage blob storage as our primary hot storage to balance cost efficiency with performance. By embracing this model, Kastor accepts a slight increase in latency in exchange for substantial cost savings.

DataFusion

‍

Our selection of Apache DataFusion as the query engine is due to its high performance and extensibility. Developed in Rust and leveraging the Apache Arrow in-memory format, DataFusion offers substantial performance advantages over traditional JVM-based engines like Apache Spark. Its modular architecture supports significant customization and scalability, facilitating various analytical and data-processing tasks.

‍

While DataFusion is currently evolving and may not yet match Spark's feature set, its active development and growing community support promise parity in the near future.

Iceberg Tables

‍

Kastor utilizes Apache Iceberg tables to enhance our data management capabilities. These open-source, high-performance tables support multiple writers and readers conducting ACID transactions concurrently, ensuring reliable data consistency that combines the best aspects of data warehouses and lakes.

‍

Along with ACID transactions, Iceberg offers considerable benefits compared to legacy alternatives:

Efficient Data Management: Apache Iceberg's table formats and metadata management support faster query performance and more efficient data operations.
Scalable Architecture: Iceberg can scale seamlessly with its ability to handle large volumes of data while maintaining high performance.
Flexible Schema Evolution: Iceberg supports schema evolution without disrupting ongoing operations, making it easy to handle evolving data structures.
Time Travel and Rollbacks: Iceberg offers built-in time travel capabilities to access historical data snapshots and roll back changes if needed.
Interoperability: Iceberg integrates smoothly with a wide range of data processing engines and tools.

Binary Large Object Storage

‍

Kastor employs binary large object (BLOB) storage as its primary hot storage solution, carefully balancing cost-efficiency with performance. By choosing BLOB storage, we prioritize affordability, accepting minimal increases in latency in exchange for significantly lower costs.

‍

Blob storage is universally available and secure, highly scalable, and capable of efficiently managing diverse data types.Binary large object (BLOB) storage offers several benefits for handling large and complex data types.

‍

Here are some of the key advantages:

Scalability: BLOB storage systems are designed to handle large volumes of data efficiently.
Flexibility: BLOB storage can manage various data types, including unstructured data like images, videos, and documents.
Cost-Effectiveness:
Accessibility: BLOB storage systems can be accessed from anywhere with an internet connection. This allows for easy data sharing and retrieval across different locations and devices.
Durability and Redundancy: BLOB storage includes built-in redundancy and data replication features.

‍

What's Next

‍

We invite you to join our Lighthouse Customer Program as we refine and enhance Kastor. Participants in this program will receive a discounted startup package and have the opportunity to influence our product roadmap to meet their specific needs.

‍

Discover more and join the waitlist at www.kastordata.io or schedule time to talk with us.

‍

Share this post

Jammond Hayes-Ruffin