Full AutoML Pipeline Guide¶

A complete walkthrough: from a raw CSV to a trained, saved model — using only Vllama.

Overview¶

raw_data.csv
     ↓
vllama data     → cleaned data + visualizations
     ↓
vllama train    → 9 trained models + leaderboard + best_model.pkl
     ↓
results/report.html

Example Dataset¶

We'll use the classic Titanic survival dataset to demonstrate.

# Download the dataset
curl -O https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv

Step 1: Preprocess the Data¶

vllama data --path titanic.csv --target Survived --test_size 0.2

What Vllama does automatically:

Detects Survived is a binary classification target
Drops irrelevant columns (Name, Ticket, Cabin)
Fills missing Age values using KNN imputation
Encodes Sex and Embarked as numeric features
Scales numeric features
Splits into 80% train / 20% test

After it finishes, you'll see something like:

✓ Loaded dataset: 891 rows × 12 columns
✓ Removed duplicates: 0
✓ Handled missing values: Age (177), Cabin (687), Embarked (2)
✓ Encoded categoricals: Sex, Embarked
✓ Scaled features: Age, Fare, SibSp, Parch
✓ Feature selection: removed 2 low-variance features
✓ Saved train/test split to: output_folder_20240101_120000/

Step 2: Train All Models¶

# Use the folder name printed in Step 1
vllama train --path ./output_folder_20240101_120000 --target Survived

Vllama trains all 9 classification models (each with hyperparameter tuning via RandomizedSearchCV):

Logistic Regression
Random Forest
XGBoost
LightGBM
CatBoost
SVM
KNN
MLP
Naive Bayes

This takes a few minutes depending on your machine.

Step 3: Review Results¶

Open the generated report:

# macOS
open results/report.html

# Linux
xdg-open results/report.html

# Windows
start results/report.html

The report contains:

Leaderboard — all models ranked by accuracy / F1
Per-model metrics — precision, recall, AUC, confusion matrices
Best model — highlighted at the top

You can also check the leaderboard in the terminal:

cat results/model_summary.csv

Step 4: Use the Best Model¶

import joblib
import pandas as pd

# Load the best model
model = joblib.load("results/best_model.pkl")

# Predict on new data
new_data = pd.DataFrame({
    "Pclass": [3], "Sex": [1], "Age": [22],
    "SibSp": [1], "Parch": [0], "Fare": [7.25]
})
prediction = model.predict(new_data)
print("Survived:", prediction[0])

Tips¶

On a small dataset (< 1000 rows)? Training finishes in under a minute. Larger datasets (50k+ rows) may take 10–20 minutes due to hyperparameter search.

Getting a bad best model score? Check the visualizations in output_folder_*/visualizations/ — the correlation matrix and mutual information plots can tell you if features are informative.

Want to reproduce the preprocessing later? transformation_metadata.json in the output folder stores all encoders and scalers. You can load them to preprocess new inference data the same way.