experiment_design

Perfect — adding a baseline strengthens your study a lot because you can see whether what you find is a transformer-specific effect or something general to distributional learning. Here’s the refined plan with the extra baseline experiment included:

Research Plan

Research Questions

How do the learning dynamics of embeddings differ between high-frequency and low-frequency words during LLM training?
Do LLMs acquire stable representations for frequent words earlier than for infrequent words?
How do embedding trajectories vary across model layers, from input (static embeddings) to deeper contextual layers?
Are these dynamics specific to transformers, or do they also appear in simpler embedding models like Word2Vec?

Experimental Setup

Primary Model (Transformer)

Train a DistilBERT model from scratch on a medium corpus (≈ 10–20M tokens).
Training budget: 5–10 epochs within one day on a MacBook M3 Pro.
Save checkpoints after each epoch.

Baseline Model (Distributional)

Train a skip-gram Word2Vec (or CBOW) model on the same corpus.
Train for the same number of epochs and track embedding vectors per checkpoint.
This baseline isolates frequency effects without contextualized layers, attention, or deep representation depth.

Word Sampling

Define two groups of words (shared across both models):
- High-frequency group: top 10–50 words.
- Low-frequency group: 10–50 words near the cutoff of inclusion.
Control for length, morphology, and part of speech when possible.

Embedding Extraction

DistilBERT

Extract:
1. Static embeddings (layer 0 embedding matrix).
2. Contextual embeddings at each of the 6 transformer layers.
  - Average contextual embeddings over multiple token occurrences.

Word2Vec

Extract word embeddings directly from the learned vectors at each checkpoint.

Analysis

Trajectory Analysis

Compute distances (Euclidean / cosine) to measure embedding movement:
- Across epochs (learning time).
- Across layers (transformer depth).
Plot learning curves separately for frequent and infrequent word groups.

Stability & Convergence

Define a stability criterion: movement below a threshold across successive epochs.
Compare stabilization speed across frequency groups and across models.

Visualization

Use PCA / t-SNE / UMAP to project embeddings into 2D/3D.
Animate trajectories over training epochs for frequent vs. infrequent words.
Compare DistilBERT vs. Word2Vec plots side by side.

Evaluation & Interpretation

Hypotheses
- Frequent words stabilize earlier in both models.
- In transformers, infrequent words may benefit more from contextualization (later layers reduce variability).
- Word2Vec embeddings may show clearer frequency effects due to lack of deep contextual processing.
Metrics
- Magnitude of movement: Euclidean / cosine distance.
- Rate of stabilization: epoch when displacement falls below threshold.
- Cross-model comparison: quantify differences in stabilization rates and final embedding distances.
Qualitative Case Studies
- Visualize trajectories of select pairs: common function word (“the”) vs. rare noun.
- Contrast behaviors in Word2Vec (purely co-occurrence-based) vs. DistilBERT (contextualized).

Adding the Word2Vec baseline makes the interpretation stronger:

If frequency effects show up in both models → they’re likely a general property of distributional learning.
If they diverge → frequency effects in DistilBERT might be influenced by transformer architecture or contextualization depth.

👉 Do you want me to also suggest a third, ultra-light baseline (like FastText) that handles subword units? That might help disentangle frequency effects from morphological generalization.