vllama data — Data Preprocessing¶
Automatically cleans, encodes, scales, and splits any tabular dataset, and generates visualizations of the data.
Syntax¶
Parameters¶
| Parameter | Short | Default | Description |
|---|---|---|---|
--path |
required | Path to dataset file | |
--target |
auto-detect | Name of the target column | |
--test_size |
-t |
0.2 |
Fraction of data to hold out for testing |
--output_dir |
-o |
current dir | Where to save output files |
Supported input formats: CSV, Excel (.xlsx), JSON, Parquet.
What It Does¶
When you run vllama data, it performs the following steps automatically:
- Loads and inspects the dataset — shape, dtypes, missing value rates
- Removes duplicates and rows with excessive missing data
- Handles missing values — KNN imputation for numerics, mode filling for categoricals
- Handles outliers — detects and caps extreme values using IQR
- Encodes categoricals — label encoding, one-hot encoding, or frequency encoding based on cardinality
- Scales numeric features using
RobustScaler(outlier-resistant) - Feature selection — removes zero-variance and highly correlated (>0.95) features
- Splits into train / test sets
- Generates visualizations — missing values heatmap, correlation matrix, target distribution, mutual information scores
Examples¶
# Minimal — auto-detect target column
vllama data --path sales_data.csv
# Specify target column and 25% test split
vllama data --path housing.csv --target price --test_size 0.25
# Custom output directory
vllama data --path data.csv --target label -t 0.3 -o ./processed
Output Structure¶
All outputs go into a timestamped folder:
output_folder_YYYYMMDD_HHMMSS/
├── train_data.csv ← Training set (80% by default)
├── test_data.csv ← Test set (20% by default)
├── processed_full_data.csv ← Full preprocessed dataset
├── preprocessing_log.json ← Detailed JSON log of every step
├── preprocessing_log.txt ← Human-readable log
├── summary_report.json ← Summary statistics
├── transformation_metadata.json ← Encoders & scalers (for inference later)
└── visualizations/
├── 01_missing_initial.png
├── 02_dtypes.png
├── 03_corr_processed.png
├── 04_target_processed.png
└── 05_mi.png
Save the folder name
The timestamped output folder is what you pass to vllama train. Copy it after vllama data finishes.
Next Step¶
Pass the output folder to vllama train to automatically train and compare ML models on your preprocessed data.