PICO-LLM Research Pipeline
Modular LLM research pipeline (NYU CSCI-GA 2565) for training and evaluating K-Gram MLP, LSTM, and KV-cache Transformer architectures with 22+ experiment configs and rigorous cross-run analysis.
What this project proves
Experimentation and evaluation pipeline engineering
Modular LLM research pipeline spanning K-Gram MLP, LSTM, and KV-cache Transformer architectures with cross-run analysis.
Core challenge
Compare multiple language-model architectures without ad hoc experimentation or weak analysis.
Evaluation lens
Experiment reproducibility, metrics analysis, and model-comparison rigor.
A reusable research pipeline with strong evaluation visibility and reproducible results.
Overview
A modular research pipeline for training and evaluating small-scale language model architectures, built for NYU’s CSCI-GA 2565 Machine Learning course. The pipeline supports K-Gram MLP, LSTM, and KV-cache Transformer models with systematic cross-run analysis across 22+ experiment configurations.
The value of this project is not just that it trained several models. It created a reusable experimentation and evaluation pipeline for comparing architectures under the same dataset, metrics, and analysis framework.
What I Owned
- Built the modular training and evaluation pipeline
- Implemented support for K-Gram MLP, LSTM, and KV-cache Transformer variants
- Added experiment tracking and structured cross-run analysis
- Produced the comparison tooling used to interpret tradeoffs across architectures and hyperparameters
Best Results (KV-Cache Transformer)
| Metric | Score |
|---|---|
| Validation Loss | 1.665 |
| Perplexity | 6.389 |
| Token Accuracy | 73.21% |
See full result interpretation in the README ->
Hard Problems Solved
- Compare unlike architectures fairly: the pipeline needed a common harness across MLP, recurrent, and Transformer-style models
- Make experiment results legible: raw training logs are not enough; the project adds Pareto frontiers, heatmaps, residual analysis, and correlation views
- Preserve reproducibility: deterministic config-driven experimentation is what makes the conclusions useful rather than anecdotal
Model Comparisons
Hyperparameter Analysis
Key Features
- Multi-Architecture Support: K-Gram MLP, LSTM, and KV-cache Transformer training loops with shared evaluation harness.
- 22+ Experiment Configs: Systematic hyperparameter sweeps and architecture comparisons with deterministic config tracking and multi-model batch training.
- Rich Logging: 20+ tracked fields per run (loss, val_loss, perplexity, token_accuracy, gradient norms, learning rate, hyperparameters).
- Advanced Evaluation: Pareto frontier analysis, embedding similarity metrics, regression insights, and cross-run statistical analysis.
- Configurable Sampling: Greedy, top-p, and repetition-penalty decoding; monosemantic token probing; configurable generation pipelines.
- Reproducible: Deterministic seeding and config-driven experimentation throughout.
Why It Matters
This project shows research engineering discipline: not just training a model once, but building the tooling needed to compare architectures rigorously and understand why one setup wins.
Tech Stack
- ML: PyTorch, custom Transformer implementation with KV-cache
- Experiment Tracking: Weights & Biases (wandb)
- Analysis: NumPy, Pandas, Matplotlib