experiment_design
Perfect — adding a baseline strengthens your study a lot because you can see whether what you find is a transformer-specific effect or something general to distributional learning. Here’s the refined plan with the extra baseline experiment included:
Research Plan
Research Questions
- How do the learning dynamics of embeddings differ between high-frequency and low-frequency words during LLM training?
- Do LLMs acquire stable representations for frequent words earlier than for infrequent words?
- How do embedding trajectories vary across model layers, from input (static embeddings) to deeper contextual layers?
- Are these dynamics specific to transformers, or do they also appear in simpler embedding models like Word2Vec?
Experimental Setup
Primary Model (Transformer)
- Train a DistilBERT model from scratch on a medium corpus (≈ 10–20M tokens).
- Training budget: 5–10 epochs within one day on a MacBook M3 Pro.
- Save checkpoints after each epoch.
Baseline Model (Distributional)
- Train a skip-gram Word2Vec (or CBOW) model on the same corpus.
- Train for the same number of epochs and track embedding vectors per checkpoint.
- This baseline isolates frequency effects without contextualized layers, attention, or deep representation depth.
Word Sampling
-
Define two groups of words (shared across both models):
- High-frequency group: top 10–50 words.
- Low-frequency group: 10–50 words near the cutoff of inclusion.
-
Control for length, morphology, and part of speech when possible.
Embedding Extraction
DistilBERT
-
Extract:
-
Static embeddings (layer 0 embedding matrix).
-
Contextual embeddings at each of the 6 transformer layers.
- Average contextual embeddings over multiple token occurrences.
-
Word2Vec
- Extract word embeddings directly from the learned vectors at each checkpoint.
Analysis
Trajectory Analysis
-
Compute distances (Euclidean / cosine) to measure embedding movement:
- Across epochs (learning time).
- Across layers (transformer depth).
-
Plot learning curves separately for frequent and infrequent word groups.
Stability & Convergence
- Define a stability criterion: movement below a threshold across successive epochs.
- Compare stabilization speed across frequency groups and across models.
Visualization
- Use PCA / t-SNE / UMAP to project embeddings into 2D/3D.
- Animate trajectories over training epochs for frequent vs. infrequent words.
- Compare DistilBERT vs. Word2Vec plots side by side.
Evaluation & Interpretation
-
Hypotheses
- Frequent words stabilize earlier in both models.
- In transformers, infrequent words may benefit more from contextualization (later layers reduce variability).
- Word2Vec embeddings may show clearer frequency effects due to lack of deep contextual processing.
-
Metrics
- Magnitude of movement: Euclidean / cosine distance.
- Rate of stabilization: epoch when displacement falls below threshold.
- Cross-model comparison: quantify differences in stabilization rates and final embedding distances.
-
Qualitative Case Studies
- Visualize trajectories of select pairs: common function word (“the”) vs. rare noun.
- Contrast behaviors in Word2Vec (purely co-occurrence-based) vs. DistilBERT (contextualized).
Adding the Word2Vec baseline makes the interpretation stronger:
- If frequency effects show up in both models → they’re likely a general property of distributional learning.
- If they diverge → frequency effects in DistilBERT might be influenced by transformer architecture or contextualization depth.
👉 Do you want me to also suggest a third, ultra-light baseline (like FastText) that handles subword units? That might help disentangle frequency effects from morphological generalization.