Skip to content
Back to Projects
Research 2025-04

PICO-LLM Research Pipeline

Modular LLM research pipeline (NYU CSCI-GA 2565) for training and evaluating K-Gram MLP, LSTM, and KV-cache Transformer architectures with 22+ experiment configs and rigorous cross-run analysis.

Token Accuracy 73.21%
PyTorchLLMTransformersLSTMKV-CachewandbResearch

What this project proves

Experimentation and evaluation pipeline engineering

Modular LLM research pipeline spanning K-Gram MLP, LSTM, and KV-cache Transformer architectures with cross-run analysis.

Core challenge

Compare multiple language-model architectures without ad hoc experimentation or weak analysis.

Evaluation lens

Experiment reproducibility, metrics analysis, and model-comparison rigor.

A reusable research pipeline with strong evaluation visibility and reproducible results.

Overview

A modular research pipeline for training and evaluating small-scale language model architectures, built for NYU’s CSCI-GA 2565 Machine Learning course. The pipeline supports K-Gram MLP, LSTM, and KV-cache Transformer models with systematic cross-run analysis across 22+ experiment configurations.

The value of this project is not just that it trained several models. It created a reusable experimentation and evaluation pipeline for comparing architectures under the same dataset, metrics, and analysis framework.

What I Owned

  • Built the modular training and evaluation pipeline
  • Implemented support for K-Gram MLP, LSTM, and KV-cache Transformer variants
  • Added experiment tracking and structured cross-run analysis
  • Produced the comparison tooling used to interpret tradeoffs across architectures and hyperparameters

Best Results (KV-Cache Transformer)

MetricScore
Validation Loss1.665
Perplexity6.389
Token Accuracy73.21%

See full result interpretation in the README ->

Hard Problems Solved

  • Compare unlike architectures fairly: the pipeline needed a common harness across MLP, recurrent, and Transformer-style models
  • Make experiment results legible: raw training logs are not enough; the project adds Pareto frontiers, heatmaps, residual analysis, and correlation views
  • Preserve reproducibility: deterministic config-driven experimentation is what makes the conclusions useful rather than anecdotal

Model Comparisons

Left: Token accuracy vs validation loss across all runs — KV-cache Transformers dominate the high-accuracy region. Right: Pareto frontier overlay identifying the 10 optimal runs.
Perplexity vs token accuracy (log scale) — the KV-cache Transformer achieves the lowest perplexity at highest accuracy, with LSTM and K-Gram MLP trailing significantly.

Hyperparameter Analysis

Hyperparameter correlation matrix — token_accuracy (−0.80 with val_loss) is the strongest signal; embed_size and block_size show meaningful positive correlation with inner_layers and batch_size.
Validation loss heatmaps (Embed Size vs k) per model type — larger embedding sizes consistently dominate for both K-Gram MLP and KV-cache Transformer.
Left: LSTM validation loss heatmap — higher k and embed size reduce loss but performance lags behind the Transformer. Right: Validation loss vs embed size by activation function — GELU consistently outperforms ReLU across all model types.
Regression residuals vs predicted validation loss — LSTM runs show the largest systematic bias; KV-cache Transformer residuals are tightest around zero, confirming better predictability.

Key Features

  • Multi-Architecture Support: K-Gram MLP, LSTM, and KV-cache Transformer training loops with shared evaluation harness.
  • 22+ Experiment Configs: Systematic hyperparameter sweeps and architecture comparisons with deterministic config tracking and multi-model batch training.
  • Rich Logging: 20+ tracked fields per run (loss, val_loss, perplexity, token_accuracy, gradient norms, learning rate, hyperparameters).
  • Advanced Evaluation: Pareto frontier analysis, embedding similarity metrics, regression insights, and cross-run statistical analysis.
  • Configurable Sampling: Greedy, top-p, and repetition-penalty decoding; monosemantic token probing; configurable generation pipelines.
  • Reproducible: Deterministic seeding and config-driven experimentation throughout.

Why It Matters

This project shows research engineering discipline: not just training a model once, but building the tooling needed to compare architectures rigorously and understand why one setup wins.

Tech Stack

  • ML: PyTorch, custom Transformer implementation with KV-cache
  • Experiment Tracking: Weights & Biases (wandb)
  • Analysis: NumPy, Pandas, Matplotlib