7Rivers logo, just the name, in multiple colors

BLOG

wave
Reduce Datawarehouse costs

Prepping for Snowflake Polaris Part 1: Introducing Iceberg

Preston Blackburn

Share This

Before we dive into the Polaris catalog, it’s important to understand the Iceberg table format. Organizations are focusing on this digital format not to combat global warming or melting icebergs, but to address rising data warehouse costs. 

Iceberg is an open-source table format designed for handling large analytic datasets. It facilitates features like schema evolution, partitioning, and versioning, making data management more efficient and reliable.  

Before we go any deeper, let’s take a brief look at how Polaris and other data catalogs work with Iceberg’s table format. 

Understanding Iceberg Catalogs 

Iceberg is a table format. But simply having Iceberg formatted data doesn’t get you very far. In fact, data in Iceberg format is more difficult to access than plain csv or parquet files.  

Catalogs, including the new Polaris Catalog from Snowflake can help centralize and organize Iceberg-formatted tables. With a catalog, we can take steps to access the Iceberg data more easily. What’s more, it allows users to take our first steps towards creating an open-source, provider-agnostic data warehouse that supports operations such as creating, dropping, and renaming tables.   

Iceberg

Since the catalog features that Polaris will provide overlap with core data warehouse functionality, you may be wondering why Snowflake would so graciously give us Iceberg table organization and tracking without asking anything in return? 

Well, there is one key piece of the data warehouse missing: the compute engine.  

The compute engine will interface with the data catalog and provide all the resources needed to interface with your data. Snowflake already provides a compute engine for Iceberg, and the compute engine is where the bulk of the data warehouse spend is. Simply put, Snowflake needs Iceberg to stay competitive. 

Furthermore, there is a growing market of Iceberg query engines and catalogs, such as:  

  • Nessie 
  • Dremio 
  • Tabular 
  • Trino 
  • And others 

Now that we’ve discussed what, let’s look at how all this works together. 

Part 1: Using Iceberg with PyIceberg  

To get hands on with the most basic implementation of an Iceberg “stack, “we’ll use “pyIceberg.” PyIceberg, as you might have already guessed from its name, can use Python as the compute engine, SQLite as the catalog, and local storage for the Iceberg tables.  

local iceberg scaled at 7Rivers

For this basic example we will: 

  1. Load some source data (parquet format),  
  1. Create the catalog with pyIceberg,    
  1. Load the data to the catalog using pyIceberg, then  
  1. Query the data using pyIceberg and your local machine as the compute engine.  

We can do this in under 50 lines of code with Python using PyIceberg. Below is a walkthrough of the code. 

The only two dependencies for this are pyArrow and pyiIeberg 

pip install pyarrow 

pip install pyiceberg[pyarrow] 

Then we can start loading in our example parquet data to a pyArrow data frame.  

import pyarrow.parquet as pq 

from pyiceberg.catalog.sql import SqlCatalog 


# Create a pyarrow dataframe from a parquet file 

df = pq.read_table("example.parquet") 

We’ll also create the SQLite catalog. 

This will create a SQLite database “pyiceberg_catalog.db” in the “warehouse” folder. 

Now that we have our data and the catalog, we can populate it with the pyIceberg API. 

# Create a namespace and table 

catalog.create_namespace("default") 

table = catalog.create_table( 

    "default.customer", 

    schema = df.schema, 

) 

# Load the table with the dataframe 

table.append(df) 
 

This creates a “default.db/customer” directory with our table data and table metadata in the warehouse directory.  

If we drop and re-create the table and namespace, we can see that additional versions of the data and metadata are created in the /data and /metadata folders. The dropped tables will not show up in normal queries but can be recovered or permanently deleted later. 

# Drop the table if it exists 

namespace_tables = catalog.list_tables("default") 

if ("default", "customer") in namespace_tables: 

    catalog.drop_table("default.customer") 

    catalog.drop_namespace("default") 

 

# Create a namespace and table 

catalog.create_namespace("default") 

table = catalog.create_table( 

    "default.customer", 

    schema = df.schema, 

) 

# Load the table with the dataframe 

table.append(df) 

Finally, PyIceberg offers some simple query features, so we can query the Iceberg tables efficiently. We want to run this as our data warehouse in the cloud on highly available infrastructure. 

Great! We have made interacting with local data more complicated. Really, we want to run this as our data warehouse, so in the cloud on highly available infrastructure. 

# ----- query the Iceberg table  ----- 

# Get the number of records loaded 

total_records = len(table.scan().to_arrow()) 

print(total_records) 

 

# Select 10 records from the table 

table_sample = table.scan( 

    limit=10, 

).to_arrow() 

print(table_sample) 

Now that we have the basics handled, we’ll want somewhere to store our Iceberg tables, an Iceberg Catalog, and a query engine. The catalog will be Polaris but we still need cloud file storage and a compute engine. 

polaris iceberg scaled at 7Rivers

In part 2 of our Prepping for Polaris series, we’ll look at how to get value out of using Iceberg with a Snowflake-based architecture, where Snowflake is the compute engine.  

Data stacks based on Apache Iceberg offer greater infrastructure flexibility than traditional data warehouses, but they may require more specialized engineering to implement. 

If you have any questions about whether an Apache Iceberg-based stack is right for your organization, reach out to 7Rivers. Connect with our experts today! 

contact@7riversinc.com 

Share This

Revolutionize Your Financial Data Management with Data Vault and Snowflake

Data Vault and Snowflake® can streamline financial data management for CFOs and offer powerful insights for making data-driven business decisions.

Retrieval-Augmented Generation (RAG) Basics

Get an overview of Retrieval-Augmented Generation (RAG) and learn how this technique is improving the results of Large Language Models (LLMs).

Funding Milestone: 7Rivers Inc. Raises Additional $2 Million for Expansion

7Rivers celebrates earning an additional $2 million in funding to further fuel expansion and reinforce their position in the technology consulting market.

Snowflake’s Document AI: An Advanced Document Processing Tool

Process documents faster and improve your business efficiency with Document AI by Snowflake.

Beyond Chatbots: LLMs for Data Processing in Snowflake

Snowflake container services can be leveraged in tandem with LLMs for data processing

A Break Down of Snowflake Container Services and Kubernetes

Snowflake released its container services which can benefit your business and its data infrastructure.

Embracing AI: A Beginners Guide to Transforming Your Business

Building a sturdy foundation for AI with your Business goals in mind. Adopting AI is not just a technological upgrade, but a strategic business decision.

7Rivers Announces Ben Kerford as President

7Rivers is thrilled to announce the appointment of Ben Kerford as its president. Kerford is poised to spearhead expansive growth initiatives for 7Rivers.

7Rivers and New Resources Consulting Announce Partnership to Serve Rapidly Growing AI Market

7Rivers has formed a new strategic partnership with New Resources Consulting (NRC) to bring a variety of services, expertise and innovative AI business solutions

LISTEN: 7Rivers Founder & CEO Paul Stillmank Talks Tech on Experience MKE: Tech in MKE Podcast

Take a listen to Paul Stillmank, CEO of 7Rivers, talk tech on experience Milwaukee on the Tech in MKE Podcast.

7Rivers Announces Acquisition of PerformanceG2

7Rivers announces its acquisition of PerformanceG2 (PG2), a leading performance management consultancy, combining the expertise of these two Milwaukee-based innovators to forge high-impact business solutions for their clients. Learn more at the link below.
footer wave

Unlock the potential of your organization’s data with help from 7Rivers. Contact us to explore what’s possible.