Scaling Data Workflows: Pandas on Snowflake

Pandas is a powerful open-source Python library widely used for data manipulation, analysis, and cleaning. It provides intuitive data structures like DataFrames and Series, enabling users to handle structured data efficiently with operations like filtering, grouping, and reshaping. However, as data sizes grow and become more complex, pandas faces limitations due to its in-memory constraint and single-threaded nature. Snowflake, a robust and scalable native cloud data warehousing platform, complements pandas by addressing these scalability challenges. Together, pandas and Snowflake provide a powerful toolset for professionals to streamline data workflows, enabling efficient handling and analysis of large and complex datasets.

This blog explores the coupling of pandas in Snowflake using Modin, which integrates with pandas API and distributes operations across multiple cores or a distributed computing framework making it ideal for large-scale data processing.

Outline

Setting up Environment

Connecting pandas to Snowflake

Practical Use Cases

When to Use pandas on Snowflake

Wrap Up

Resources

Setting Up Environment

When working with pandas and Snowflake, setting up the environment correctly can be cumbersome. While the Snowflake documentation provides quickstart guides for setup, this section simplifies the set-up process as much as possible to ease up the experience, especially when preparing for large-scale or iterative workflows.

Key Components:

Python Environment: a clean python environment is highly recommended to avoid conflicts. Use tools like venv , conda , or docker . Here is an example of how to set up venv : Python venv: How To Create, Activate, Deactivate, And Delete
Required Libraries: To set up the environment for integrating Snowflake and pandas (via Modin), the following library is necessary:
- Snowflake-snowpark-python[modin] : This will also install pandas and modin dependencies to be used in Snowflake.
- Optional: dotenv . This library is useful for securely managing environment variables such as Snowflake credentials in a .env file.
Optional: Jupyter Notebooks. provides an interactive, visual environment for developing, testing, and documenting your Snowflake-pandas workflows.

- More information can be found here: Jupyter Notebooks in VS Code

Installation Guide

Step 1. Create a Virtual Environment (recommended for isolating dependencies):


python -m venv Snowflake_env
source Snowflake_env/bin/activate   # On Windows: Snowflake_env\Scripts\activate

2. Install Required Libraries: Use pip to install the required packages:


pip install Snowflake-snowpark-python[modin]
pip install python-dotenv

Setting up this environment may seem straightforward, but the choices you make here can ripple through your workflow. For instance, if you’re working with small datasets, pandas is often sufficient. For workflows requiring high scalability or frequent updates, investing time in Modin setup or even Snowflake-native tools like Snowpark might save you headaches down the line.

While many developers overlook notebooks in favor of scripts, their ability to provide an iterative, interactive, and visual workflow makes them indispensable for data exploration. However, as with all tools, you must consider their limitations; production-grade pipelines often require more robust solutions.

Connecting pandas to Snowflake

Snowflake provides a robust platform for scalability and speed for querying large datasets, and pandas is unmatched for its flexibility and ease of use manipulating data once it’s in memory. Some of the benefits of coupling pandas and Snowflake are that it allows to:

Query Snowflake data into pandas dataframes for analysis and transformations

Process large datasets with modin for parallelized operations

Write transformed data back to Snowflake for further use or sharing

Here’s a breakdown of the connection process between pandas (or Modin) and Snowflake, highlighting the nuances of using Snowflake with pandas and the Snowpark API.

Step 1: Set up connection parameters


connection_parameters = {
"account": "your_account",
"user": "your_username",
"password": "your_password",
"role": "your_role",
"warehouse": "your_warehouse",
"database": "your_database",
"schema": "your_schema"
}

Step 2: Create a Snowpark session


# Import the Snowpark plugin for Modin.
import Snowflake.snowpark.modin.plugin
from Snowflake.snowpark.session import Session
# Create a Snowpark session connection.
session = Session.builder.configs(connection_parameters).create()

Step 3: Query data into pandas Modin dataframe


# QUERY DATA INTO PANDAS MODIN DATAFRAME
import modin.pandas as pd
query = 'SELECT * FROM <table_name>'
# Query Snowflake and load into Modin DataFrame
df = session.sql(query).to_pandas()

Step 4: Write data back to Snowflake


# WRITING DATA BACK TO SNOWFLAKE
session.write_pandas(df, '<table_name>', auto_create_table=False, overwrite=True)

Practical Use Case

When working with Snowflake data in pandas (or Modin), there are common scenarios such as data cleaning, transformation, and aggregation. This basic example extracted from the Snowflake documentation shows how Snowpark and Modin can be leveraged to handle messy datasets and extract insights.

1. Set up example data:


# CREATE A SNOWPARK PANDAS DATAFRAME WITH SAMPLE DATA.
import modin.pandas as pd
df = pd.DataFrame(
[
[1, 'Big Bear', 8],
[2, 'Big Bear', 10],
[3, 'Big Bear', None],
[1, 'Tahoe', 3],
[2, 'Tahoe', None],
[3, 'Tahoe', 13],
[1, 'Whistler', None],
['Friday', 'Whistler', 40],
[3, 'Whistler', 25]
],
columns=["DAY", "LOCATION", "SNOWFALL"]
)

2. Clean and correct data:


# Inspect the DataFrame with an error in the "DAY" column
print(df)
# In-place point update to fix the data error
df.loc[df["DAY"] == "Friday", "DAY"] = 2
# Inspect the DataFrame after the update
print(df)

3. Handle Missing Data:


# Drop rows with null values
df_cleaned = df.dropna()
print(df_cleaned)

4. Aggregate data for insights:


# Compute the average daily snowfall across locations
df1 = df.groupby("LOCATION", as_index=False)["SNOWFALL"].mean()
print(df1)

The example demonstrated would be preformed directly within Snowflake, utilizing the Snowflake-snowpark-python[modin] library for processing. Transformations such as updates, dropping rows, and aggregations are executed on Snowflake’s infrastructure. This minimizes data movement and allows the full power of Snowflake’s compute resources to handle the operations. Since data is not pulled into local memory, you can also maintain the security and compliance benefits of keeping data within Snowflake.

When to use pandas on Snowflake

For In-Database Transformations: Snowpark with Modin allows you to execute pandas-like transformations directly within Snowflake. This comes in handy when you need the flexibility of pandas but still want to leverage Snowflake’s compute power. It also helps processing large datasets that would otherwise overwhelm local memory in traditional pandas workflows.

For Exploratory Workflows: Using Snowpark with Modin provides the same or a very similar experience as pandas, making it an easy choice for prototyping data transformations and analysis if there is already familiarity with the library.

For Large-Scale Data Manipulation: Since all operations are performed in Snowflake, you can handle datasets that exceed local machine capacity. Similarly, aggregations, groupings, and filtering are distributed across Snowflake’s infrastructure, ensuring scalability.

Wrap Up

The integration between pandas and Snowflake, leveraging Snowflake’s compute and pandas flexibility, opens up a new venue of possibilities for data professionals. This coupling allows you to execute complex data transformations, clean datasets, and derive insights without sacrificing performance or ease of use. With the proper planning, this setup offers a seamless and intuitive way to manage and analyze data. Partner with 7Rivers to unlock the full power of tools like Snowflake and pandas to make smarter business decisions and drive innovation.

Resources