Train your first model
This tutorial walks you through training your first classification model with Xorq. You’ll use the Iris dataset to build a flower species classifier and see how deferred execution works.
After completing this tutorial, you’ll know how to wrap scikit-learn pipelines with Xorq and make predictions using deferred execution.
Prerequisites
You need:
- Xorq installed (see Install Xorq)
- Basic familiarity with scikit-learn
How to follow along
Run the code examples sequentially. Each section builds on previous ones. You can:
- Python interactive shell (recommended): Open a terminal, run
python, then copy and paste each code block in order - Jupyter notebook: Create a new notebook and run each code block in a separate cell
- Python script: Copy code blocks into
train_classifier.pyand run after each section
Variables like iris, xorq_pipeline, and fitted_pipeline are created in earlier blocks and used in later ones.
When you call .fit() in Xorq, training doesn’t happen immediately. Xorq builds a computation graph. Training only runs when you call .execute(). This lets Xorq cache trained models and reuse them across runs.
This tutorial trains and evaluates on the same dataset to focus on the mechanics of wrapping scikit-learn pipelines. In production, always split your data into separate training and test sets. See Split data for training to learn proper data splitting.
Load data and define your target
In this section, you’ll load the iris dataset and separate features from the target.
# train_classifier.py
import xorq.api as xo
iris = xo.examples.iris.fetch()
target = "species"
features = tuple(iris.drop(target).schema())
print(f"Loaded {iris.count().execute()} rows")
print(f"Target: {target}")
print(f"Features: {len(features)} columns")- 1
- Load the Iris dataset. This returns an expression, not actual data.
- 2
- Define what you’re predicting (target) and what the model uses (features).
- 3
- Check what you loaded.
Expected output:
Loaded 150 rows
Target: species
Features: 4 columns
The iris variable is an expression. It’s a description of data, not the data itself. Nothing executes until you call .execute(). This is deferred execution in action.
Once you’ve loaded your data, move on to building the pipeline.
Build and wrap your pipeline
Create a scikit-learn pipeline and wrap it with Xorq’s deferred execution layer.
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline as SklearnPipeline
from xorq.expr.ml import Pipeline
sklearn_pipeline = SklearnPipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier(n_neighbors=5))
])
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)
print("Pipeline wrapped with Xorq!")- 1
- Create a standard scikit-learn pipeline. It normalizes features, then classifies with k-nearest neighbors.
- 2
-
Wrap it with
Pipeline.from_instance(). This adds deferred execution.
K-nearest neighbors is distance-based. If one feature ranges from 0 to 100 and another from 0 to 1, the first feature dominates distance calculations. StandardScaler normalizes all features to the same scale.
Train the model
When you call .fit(), you’re describing the training operation. Actual training doesn’t happen until you call .execute().
fitted_pipeline = xorq_pipeline.fit(
iris,
features=features,
target=target
)
print("Training described (but hasn't run yet)!")- 1
- Describe the training operation. Xorq builds a graph node, but doesn’t execute.
The key insight: You’ve told Xorq what to do, but it hasn’t done it yet. Training only happens when you call .execute().
Understanding this timing helps you optimize workflows. You can describe complex pipelines, then execute once.
Make predictions and see results
Make predictions and trigger execution.
def as_struct(expr, name=None):
struct = xo.struct({c: expr[c] for c in expr.columns})
if name:
struct = struct.name(name)
return struct
ORIGINAL_ROW = "original_row"
predictions_expr = (
iris.mutate(as_struct(iris, name=ORIGINAL_ROW))
.pipe(fitted_pipeline.predict)
.drop(target)
.unpack(ORIGINAL_ROW)
)
predictions = predictions_expr.execute()
print("\nFirst 10 predictions:")
print(predictions[["species", "predicted"]].head(10))- 1
- Create a helper that packages data into a struct. This preserves original values.
- 2
- Build the prediction expression. Still deferred.
- 3
- Execute. Training and prediction happen now.
Expected output:
First 10 predictions:
species predicted
0 setosa setosa
1 setosa setosa
2 setosa setosa
3 setosa setosa
...
Your predictions match the actual species. When you called .execute(), Xorq trained the pipeline and made predictions in one execution.
Next, you’ll check how accurate your model is.
Check your accuracy
Check how well your model performed.
accuracy_expr = (
predictions_expr
.mutate(correct=xo._.species == xo._.predicted)
.agg(
total=xo._.species.count(),
correct_count=xo._.correct.sum().cast("int64"),
)
.mutate(accuracy=xo._.correct_count / xo._.total)
)
result = accuracy_expr.execute()
accuracy = result["accuracy"][0]
correct = result["correct_count"][0]
total = result["total"][0]
print(f"\nAccuracy: {accuracy:.1%}")
print(f"Got {correct} out of {total} correct")- 1
- Build an accuracy calculation. Create a boolean for correct predictions, count them, compute the ratio.
- 2
- Execute and display results.
Expected output:
Accuracy: 96.7%
Got 145 out of 150 correct
Why this pattern matters: Evaluation is also deferred. You describe metrics, then execute once. Most teams find that this simplifies evaluation code.
Complete example
Here’s the complete workflow in one file:
# train_classifier.py
import xorq.api as xo
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline as SklearnPipeline
from xorq.expr.ml import Pipeline
# Load data
iris = xo.examples.iris.fetch()
target = "species"
features = tuple(iris.drop(target).schema())
# Build and wrap pipeline
sklearn_pipeline = SklearnPipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier(n_neighbors=5))
])
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)
# Train (deferred)
fitted_pipeline = xorq_pipeline.fit(iris, features=features, target=target)
# Predict (deferred)
def as_struct(expr, name=None):
struct = xo.struct({c: expr[c] for c in expr.columns})
if name:
struct = struct.name(name)
return struct
ORIGINAL_ROW = "original_row"
predictions_expr = (
iris.mutate(as_struct(iris, name=ORIGINAL_ROW))
.pipe(fitted_pipeline.predict)
.drop(target)
.unpack(ORIGINAL_ROW)
)
# Evaluate (deferred)
accuracy_expr = (
predictions_expr
.mutate(correct=xo._.species == xo._.predicted)
.agg(
total=xo._.species.count(),
correct_count=xo._.correct.sum().cast("int64"),
)
.mutate(accuracy=xo._.correct_count / xo._.total)
)
# Execute and show results
result = accuracy_expr.execute()
accuracy = result["accuracy"][0]
correct = result["correct_count"][0]
total = result["total"][0]
print(f"Accuracy: {accuracy:.1%}")
print(f"Got {correct} out of {total} correct")
# Show sample predictions
predictions = predictions_expr.execute()
print("\nSample predictions:")
print(predictions[["species", "predicted"]].head(10))Run it:
python train_classifier.pyYou built multiple deferred operations (training, predictions, accuracy), then executed them. Xorq processed the entire graph.
What you learned
You built a complete ML workflow with deferred execution. Here’s what you accomplished:
- Loaded data as expressions (deferred)
- Wrapped a scikit-learn pipeline with Xorq
- Trained a model without immediate execution
- Made predictions using the struct pattern
- Evaluated accuracy with deferred metrics
The key insight? Deferred execution lets you describe complex workflows, then Xorq handles optimization and caching automatically.
Next steps
Now that you’ve trained your first model, continue learning:
- Compare model performance — Learn how to evaluate and compare multiple models
- Deploy your first model — Deploy your trained model to production