`vllama data` — Data Preprocessing¶

Automatically cleans, encodes, scales, and splits any tabular dataset, and generates visualizations of the data.

Syntax¶

vllama data --path <dataset> [--target <column>] [--test_size <float>] [--output_dir <dir>]

Parameters¶

Parameter	Short	Default	Description
`--path`		required	Path to dataset file
`--target`		auto-detect	Name of the target column
`--test_size`	`-t`	`0.2`	Fraction of data to hold out for testing
`--output_dir`	`-o`	current dir	Where to save output files

Supported input formats: CSV, Excel (.xlsx), JSON, Parquet.

What It Does¶

When you run vllama data, it performs the following steps automatically:

Loads and inspects the dataset — shape, dtypes, missing value rates
Removes duplicates and rows with excessive missing data
Handles missing values — KNN imputation for numerics, mode filling for categoricals
Handles outliers — detects and caps extreme values using IQR
Encodes categoricals — label encoding, one-hot encoding, or frequency encoding based on cardinality
Scales numeric features using RobustScaler (outlier-resistant)
Feature selection — removes zero-variance and highly correlated (>0.95) features
Splits into train / test sets
Generates visualizations — missing values heatmap, correlation matrix, target distribution, mutual information scores

Examples¶

# Minimal — auto-detect target column
vllama data --path sales_data.csv

# Specify target column and 25% test split
vllama data --path housing.csv --target price --test_size 0.25

# Custom output directory
vllama data --path data.csv --target label -t 0.3 -o ./processed

Output Structure¶

All outputs go into a timestamped folder:

output_folder_YYYYMMDD_HHMMSS/
├── train_data.csv              ← Training set (80% by default)
├── test_data.csv               ← Test set (20% by default)
├── processed_full_data.csv     ← Full preprocessed dataset
├── preprocessing_log.json      ← Detailed JSON log of every step
├── preprocessing_log.txt       ← Human-readable log
├── summary_report.json         ← Summary statistics
├── transformation_metadata.json  ← Encoders & scalers (for inference later)
└── visualizations/
    ├── 01_missing_initial.png
    ├── 02_dtypes.png
    ├── 03_corr_processed.png
    ├── 04_target_processed.png
    └── 05_mi.png

Save the folder name

The timestamped output folder is what you pass to vllama train. Copy it after vllama data finishes.

Next Step¶

Pass the output folder to vllama train to automatically train and compare ML models on your preprocessed data.

vllama train --path ./output_folder_YYYYMMDD_HHMMSS --target price

vllama data — Data Preprocessing¶