This tutorial demonstrates Xorq’s caching with runnable code examples. You’ll execute the same query twice and observe the performance difference between cached and uncached execution.
You’ll learn how to add .cache() to expressions and understand when Xorq reuses cached results.
Why caching matters
Running the same query twice shouldn’t mean doing the work twice. Xorq caches expression results so repeated queries return instantly from the cache instead of recomputing.
Caching provides the biggest benefit for expensive operations:
Loading large datasets from remote databases
Training machine learning models
Calling external APIs
Running complex aggregations
TipSmart caching
Xorq uses content-addressed hashing to determine if an expression matches cached results. Xorq generates a hash from your expression structure. Identical expressions produce identical hashes, triggering a cache hit.
How to follow along
Each code example includes complete setup, so you can run any section independently. For best learning, run them in sequence.
Run the code using:
Python interactive shell: Open a terminal, run python, copy and paste each code block
Jupyter notebook: Run each code block in a separate cell
Python script: Copy code blocks into a .py file and run with python script.py
Set up caching
You’ll start by connecting to a backend and setting up a cache storage location.
import xorq.api as xofrom xorq.caching import SourceCachecon = xo.connect()storage = SourceCache.from_kwargs(source=con)print(f"Connected to: {con}")print(f"Cache storage ready!")
1
Connect to the embedded backend where cached data is stored.
2
Create a SourceCache object that manages the cache.
SourceCache stores cached results as tables in your backend using create_table(). This works with any backend that supports table creation like embedded, DuckDB, pandas, PostgreSQL, Snowflake, and more.
Cache your first expression
Now you’ll build an expression and add caching to it.
The .cache() method tells Xorq to store results from this expression. On the first run, Xorq computes and caches the results. On subsequent runs, it retrieves them directly from cache.
Observe cache miss
You’ll execute the expression for the first time. This will be a cache miss because no cached results exist yet.
The second execution is faster because Xorq retrieves cached results instead of recomputing the filter.
Note
Xorq computes a hash from your expression’s structure and data sources. If the expression is identical, then the hash matches, and you get a cache hit.
Understand cache invalidation
What happens if you change the expression? You’ll modify the filter and see cache invalidation in action.
The first execution is a cache miss. The second and third executions are cache hits and complete faster. This shows how caching eliminates redundant computation.
WarningCache storage
SourceCache keeps cached data in your backend as tables. Make sure you have enough storage space for cached results, especially with large datasets.
Chain cached expressions
You can cache multiple steps in a pipeline. Each cached expression can reuse results from previous runs.
import xorq.api as xofrom xorq.caching import SourceCache# Setup (from previous steps)con = xo.connect()storage = SourceCache.from_kwargs(source=con)iris = xo.examples.iris.fetch(backend=con)step1 = iris.filter(xo._.sepal_length >5).cache(cache=storage)step2 = step1.group_by("species").agg( avg_width=xo._.sepal_width.mean()).cache(cache=storage)print("First execution of step2...")result_a = step2.execute()print("\nSecond execution of step2...")result_b = step2.execute()print("\nBoth steps now cached!")print(result_a)
1
Cache the filtered dataset.
2
Build on the cached result and cache the aggregation too.
3
First execution caches both steps.
4
Second execution hits cache for both steps.
When you cache multiple steps, each cached step returns results instantly on re-execution. Xorq doesn’t recompute steps that are already cached.
Use persistent ParquetCache
So far, you’ve used SourceCache, which stores cached data in your backend. But what if you want a cache that persists across Python sessions? That’s where ParquetCache comes in.
SourceCache stores data in your backend. Depending on your backend configuration, cached data may not persist when you restart Python.
ParquetCache writes cached results as .parquet files to disk. These files persist across Python sessions.
You’ll see actual .parquet files in the xorq_cache directory. These files contain your cached results and persist even after you close Python.
TipWhen to use each cache type
SourceCache: Fast, session-scoped. Use for temporary caching during development.
ParquetCache: Persistent across sessions. Use when you want cache to survive Python restarts (local development, iterative analysis).
Now when you restart Python and rerun the same expression with ParquetCache pointing to the same directory, Xorq finds the cached .parquet files and returns results instantly without recomputation.
Complete example
Here’s a full caching workflow in one place:
import xorq.api as xofrom xorq.caching import SourceCache# Set up connection and load datacon = xo.connect()storage = SourceCache.from_kwargs(source=con)iris = xo.examples.iris.fetch(backend=con)# Build cached expressioncached_expr = ( iris .filter(xo._.sepal_length >6) .cache(cache=storage))# First run: cache missresult1 = cached_expr.execute()print("First run complete (cached)")# Second run: cache hitresult2 = cached_expr.execute()print("Second run complete (from cache)")
Next steps
Now that you understand how caching works, continue with these tutorials: