Semantic role labeling with DistilBERT
Semantic role labeling with DistilBERT
In this notebook, we’ll finetune DistilBERT for the task of Semantic Role Labeling (SRL), using the English Universal Propbank 1.0 datasets.
The files resulting from the fine-tuning process are available here:
- Model: https://drive.google.com/drive/folders/1vdgy-pglPSYwsL5Vc5_acVNGip3O3Z-b?usp=share_link
- Tokenizer: https://drive.google.com/drive/folders/1NmYWIbMGLwvTyC9bexHocifTXUe6slFc?usp=sharing
- Trainer: https://drive.google.com/drive/folders/1gcbTuuVvFVRceP6nIgJEa6vCcwCeHarv?usp=sharing
Import libraries
import time
import pandas as pd
import transformers
import numpy as np
import torch
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset
from utils import read_data_as_sentence,map_labels_in_dataframe,tokenize_and_align_labels,get_label_mapping,get_labels_from_map,load_srl_model,load_dataset,compute_metrics,write_predictions_to_csv,compute_evaluation_metrics_from_csv, print_sentences
from bert_srl import main, define_args
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Step 1: Preprocess data
Unlike traditional token labeling methods, which assign labels to individual words in isolation, BERT performs sequence labeling. This means BERT assigns labels to individual tokens, while taking the full sentence context in consideration.
The English Universal PropBank 1.0 dataset is structured in CoNNL-U Plus format, in which lines represent individual tokens. So before you can train the model, you need to extract sentences and labels from the datasets, and preprocess the sentences by removing non-argument labels.
To preprocess the datasets and save the resulting DataFrame to a file, call the read_data_as_sentence() function, including:
| Parameter name | Required | Parameter description |
|---|---|---|
| positional 1 (string) | ✅️ | The filepath for the CoNNLU dataset. |
| positional 2 (string) | ✅ | The filepath to write the preprocessed DataFrame to. |
train_data = read_data_as_sentence('data/en_ewt-up-train.conllu', 'data/en_ewt-up-train.preprocessed.csv')
dev_data = read_data_as_sentence('data/en_ewt-up-dev.conllu', 'data/en_ewt-up-dev.preprocessed.csv')
test_data = read_data_as_sentence('data/en_ewt-up-test.conllu', 'data/en_ewt-up-test.preprocessed.csv')
The read_data_as_sentence() function returns DataFrames, where each row represents a sentence from the dataset passed to the function. Each sentence has been expanded based on its predicates, resulting in multiple copies of the same sentence, each focused on a different predicate.
The DataFrame has two columns:
input_form: a list of strings, where each string represents a words in the sentence, followed by two special tokens:- A special token (
[SEP]), which denotes the separation between the words of the sentence and the predicate form. - The predicate form, which corresponds to the
argumentvalues for the same row in the DataFrame.
- A special token (
argument: a list of strings, representing the arguments associated with each word in the sentence. The length of each list is equal to the number of words in the sentence, plus two additional elements, for the special token and predicate form. The arguments match the predicate appended to theinput_formfor the same row in the DataFrame.
Explore the DataFrame
Before you continue to tokenize the sentences and fine-tune the BERT model, it’s time to get more familiar with our data.
To explore the DataFrame, start by printing the head of the preprocessed DataFrame:
print(test_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3971 entries, 0 to 3970
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 input_form 3971 non-null object
1 argument 3971 non-null object
dtypes: object(2)
memory usage: 62.2+ KB
None
The Non-Null count for both columns should match, indicating there are as many lists of input_form values as there are lists of argument values, namely one for each sentence.
Next, print the words and their argument labels for the first 5 sentences of the test dataset:
print_sentences(test_data[:5])
form: What argument: _
form: if argument: _
form: Google argument: ARG1
form: Morphed argument: _
form: Into argument: _
form: GoogleOS argument: ARG2
form: ? argument: _
----------------------------------------
form: [SEP] argument: None
form: Morphed argument: None
========================================
form: What argument: _
form: if argument: _
form: Google argument: ARG0
form: expanded argument: _
form: on argument: _
form: its argument: _
form: search argument: _
form: - argument: _
form: engine argument: _
form: ( argument: _
form: and argument: _
form: now argument: _
form: e-mail argument: _
form: ) argument: _
form: wares argument: ARG1
form: into argument: _
form: a argument: _
form: full argument: _
form: - argument: _
form: fledged argument: _
form: operating argument: _
form: system argument: ARG4
form: ? argument: _
----------------------------------------
form: [SEP] argument: None
form: expanded argument: None
========================================
form: ( argument: _
form: And argument: _
form: , argument: _
form: by argument: _
form: the argument: _
form: way argument: ARGM-DIS
form: , argument: _
form: is argument: _
form: anybody argument: ARG1
form: else argument: _
form: just argument: _
form: a argument: _
form: little argument: _
form: nostalgic argument: ARG2
form: for argument: _
form: the argument: _
form: days argument: _
form: when argument: _
form: that argument: _
form: was argument: _
form: a argument: _
form: good argument: _
form: thing argument: _
form: ? argument: _
form: ) argument: _
----------------------------------------
form: [SEP] argument: None
form: is argument: None
========================================
form: ( argument: _
form: And argument: _
form: , argument: _
form: by argument: _
form: the argument: _
form: way argument: _
form: , argument: _
form: is argument: _
form: anybody argument: _
form: else argument: _
form: just argument: _
form: a argument: _
form: little argument: _
form: nostalgic argument: _
form: for argument: _
form: the argument: _
form: days argument: ARGM-TMP
form: when argument: R-ARGM-TMP
form: that argument: ARG1
form: was argument: _
form: a argument: _
form: good argument: _
form: thing argument: ARG2
form: ? argument: _
form: ) argument: _
----------------------------------------
form: [SEP] argument: None
form: is argument: None
========================================
form: This argument: _
form: BuzzMachine argument: ARG2
form: post argument: _
form: argues argument: _
form: that argument: _
form: Google argument: _
form: 's argument: _
form: rush argument: _
form: toward argument: _
form: ubiquity argument: _
form: might argument: _
form: backfire argument: _
form: -- argument: _
form: which argument: _
form: we argument: _
form: 've argument: _
form: all argument: _
form: heard argument: _
form: before argument: _
form: , argument: _
form: but argument: _
form: it argument: _
form: 's argument: _
form: particularly argument: _
form: well argument: _
form: - argument: _
form: put argument: _
form: in argument: _
form: this argument: _
form: post argument: _
form: . argument: _
----------------------------------------
form: [SEP] argument: None
form: post argument: None
========================================
As you can see, the sequence of word forms runs parallel to the sequence of argument labels. This means that for every index of input_form, the same index of argument gives its argument label.
Argument labels are:
- ’_’ for tokens that are not an argument (in the current predicate sense of the sentence).
- The token’s respective Propbank label for tokens that are an argument, e.g. ARG1
- None for the special separator token (
[SEP]) and the predicate token that follows the separator.
For example, in the the first sentence of the test data printed above (“What if Google Morphed Into GoogleOS?”), the predicate ‘Morphed’ evokes the frame morph.01. The frame’s arguments are:
ARG0-PAG: causer of transformationARG1-PPT: thing changingARG2-PRD: end stateARG3-VSP: start state
In this example, the ARG1 label is assigned to ‘Google’, and the ARG2 label is assigned to ‘GoogleOS’, which indicates ‘Google’ is the thing that is changing and ‘GoogleOS’ is its end state.
Step 2: Initialize a tokenizer
Now that you have extracted sentences and labels from the datasets, you need to prepare the sentences for the BERT model by tokenizing them.
Use HuggingFace’s AutoTokenizer to construct a DistilBERT tokenizer, which is based on the WordPiece algorithm.
# Set the model ID to use
model_id = "distilbert-base-uncased"
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Check the assertion that the tokenizer is an instance of transformers.PreTrainedTokenizerFast
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
To test the tokenizer(), tokenize the first sentence of the test data, including:
add_special_tokensset to True to add a[CLS]token to the start of every sentence.is_split_into_wordsset to True because the sentence is already split into words (based on the Universal Propbank 1.0 dataset)
# Tokenize the first example in the test data
example = test_data['input_form'][0]
tokenized_input = tokenizer(example,add_special_tokens=True, is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
# Print the example tokens and their corresponding IDs
for token, id in zip(tokens, tokenized_input["input_ids"]):
print(f"{token:>10} {id}")
[CLS] 101
what 2054
if 2065
google 8224
mor 22822
##ph 8458
##ed 2098
into 2046
google 8224
##os 2891
? 1029
[SEP] 102
mor 22822
##ph 8458
##ed 2098
[SEP] 102
You’ve successfully tokenized the sample sentence, splitting words up into subword tokens and fetching their token IDs from DistilBERT’s vocabulary.
Note: notice how the special tokens
[CLS]and[SEP]are tokenized as 101 and 102. These numbers are meaningful to BERT.
Step 3: Prepare the input for training
Before training the model, map the labels in the datasets to numerical values. This ensures consistency and facilitates the training process.
To get the label mapping, call get_label_mapping(), including:
| Parameter name | Required | Parameter description |
|---|---|---|
| positional 1 (DataFrame) | ✅️ | The training dataset for which to extract the label mapping. |
| positional 2 (DataFrame) | ✅ | The test dataset for which to extract the label mapping. |
| positional 3 (DataFrame) | ✅ | The dev dataset for which to extract the label mapping. |
label_map = get_label_mapping(train_data, test_data, dev_data)
The get_label_mapping() function returns an alphabetically-ordered dictionary mapping:
- _ to 0.
- String labels to integers, e.g. ARG0 to 1.
- None to None, to preserve the labels for special tokens and predicates. (You will replace None with -100 later to mask these tokens from being labeled.)
print(label_map)
{'_': 0, 'ARG0': 1, 'ARG1': 2, 'ARG1-DSP': 3, 'ARG2': 4, 'ARG3': 5, 'ARG4': 6, 'ARG5': 7, 'ARGA': 8, 'ARGM-ADJ': 9, 'ARGM-ADV': 10, 'ARGM-CAU': 11, 'ARGM-COM': 12, 'ARGM-CXN': 13, 'ARGM-DIR': 14, 'ARGM-DIS': 15, 'ARGM-EXT': 16, 'ARGM-GOL': 17, 'ARGM-LOC': 18, 'ARGM-LVB': 19, 'ARGM-MNR': 20, 'ARGM-MOD': 21, 'ARGM-NEG': 22, 'ARGM-PRD': 23, 'ARGM-PRP': 24, 'ARGM-PRR': 25, 'ARGM-REC': 26, 'ARGM-TMP': 27, 'C-ARG0': 28, 'C-ARG1': 29, 'C-ARG1-DSP': 30, 'C-ARG2': 31, 'C-ARG3': 32, 'C-ARG4': 33, 'C-ARGM-ADV': 34, 'C-ARGM-COM': 35, 'C-ARGM-CXN': 36, 'C-ARGM-DIR': 37, 'C-ARGM-EXT': 38, 'C-ARGM-GOL': 39, 'C-ARGM-LOC': 40, 'C-ARGM-MNR': 41, 'C-ARGM-PRP': 42, 'C-ARGM-PRR': 43, 'C-ARGM-TMP': 44, 'R-ARG0': 45, 'R-ARG1': 46, 'R-ARG2': 47, 'R-ARG3': 48, 'R-ARG4': 49, 'R-ARGM-ADJ': 50, 'R-ARGM-ADV': 51, 'R-ARGM-CAU': 52, 'R-ARGM-COM': 53, 'R-ARGM-DIR': 54, 'R-ARGM-GOL': 55, 'R-ARGM-LOC': 56, 'R-ARGM-MNR': 57, 'R-ARGM-TMP': 58, None: None}
Next, apply the label mapping to the datasets, adding the column mapped_labels to the DataFrames. This column contains arrays of integers representing the labels, based on the label mapping.
To apply the label mapping, call map_labels_in_dataframe(), including:
| Parameter name | Required | Parameter description |
|---|---|---|
| positional 1 | ✅️ | The DataFrame for which to convert the argument labels. |
| positional 2 | ✅ | The label mapping, created with get_label_mapping(). |
train_data = map_labels_in_dataframe(train_data, label_map)
dev_data = map_labels_in_dataframe(dev_data, label_map)
test_data = map_labels_in_dataframe(test_data, label_map)
As you can see, for each row in the DataFrame, the values in mapped_labels and arguments correspond to the mapping in label_map:
test_data.head()
| input_form | argument | mapped_labels | |
|---|---|---|---|
| 0 | [What, if, Google, Morphed, Into, GoogleOS, ?,... | [_, _, ARG1, _, _, ARG2, _, None, None] | [0, 0, 2, 0, 0, 4, 0, None, None] |
| 1 | [What, if, Google, expanded, on, its, search, ... | [_, _, ARG0, _, _, _, _, _, _, _, _, _, _, _, ... | [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ... |
| 2 | [(, And, ,, by, the, way, ,, is, anybody, else... | [_, _, _, _, _, ARGM-DIS, _, _, ARG1, _, _, _,... | [0, 0, 0, 0, 0, 15, 0, 0, 2, 0, 0, 0, 0, 4, 0,... |
| 3 | [(, And, ,, by, the, way, ,, is, anybody, else... | [_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, ... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
| 4 | [This, BuzzMachine, post, argues, that, Google... | [_, ARG2, _, _, _, _, _, _, _, _, _, _, _, _, ... | [0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
Now that you have initialized and tested the tokenizer() and added mapped labels to the DataFrames, it’s time to tokenize (and pad) all sentences.
Since WordPiece tokenization potentially breaks words up into subword tokens, the tokens and their labels have to be re-aligned. The tokenize_and_align_labels() function you’ll call for this iterates over each token and determines the appropriate label based on the provided dataset.
Special tokens are assigned a label of -100 to indicate they should be ignored in the loss function. Labels for the first token of each word are set accordingly, while labels for subsequent tokens within the same word are determined based on the label_all_tokens flag.
To tokenize the sentences and align the labels, call tokenize_and_align_labels(), including:
| Parameter name | Required | Parameter description |
|---|---|---|
positional 1 (transformers AutoTokenizer) | ✅️ | The tokenizer() for the pre-trained model. |
| positional 2 (DataFrame) | ✅ | The preprocessed datasets |
label_all_tokens (boolean) | Optional (defaults to True) | Whether all tokens should receive their own label, accounting for words split into subtokens |
tokenized_test = tokenize_and_align_labels(tokenizer, test_data, label_all_tokens=True)
tokenized_train = tokenize_and_align_labels(tokenizer, train_data, label_all_tokens=True)
tokenized_dev = tokenize_and_align_labels(tokenizer, dev_data, label_all_tokens=True)
Now that you have tokenized all three datasets, let’s examine the result.
The tokenized_ datasets are of the type transformers.tokenization_utils_base.BatchEncoding and have three attributes per row:
input_ids: an array of token IDs for the tokenized sentence. Starts with the token ID for the[CLS]token, followed by the tokenized sentence, the[SEP]token, the predicate, and a final[SEP]token.attention_mask: an array representing the attention mask for the sentence.labels: an array with numerical labels, aligned with the tokens.
Note: all three arrays are padded so that every sample per dataset is of equal length.
print(type(tokenized_test))
print(tokenized_test.keys())
print(tokenizer.convert_ids_to_tokens(tokenized_test["input_ids"][0]))
for key in tokenized_test.keys():
print(f"{key}: {tokenized_test[key][0]}")
<class 'transformers.tokenization_utils_base.BatchEncoding'>
dict_keys(['input_ids', 'attention_mask', 'labels'])
['[CLS]', 'what', 'if', 'google', 'mor', '##ph', '##ed', 'into', 'google', '##os', '?', '[SEP]', 'mor', '##ph', '##ed', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
input_ids: tensor([ 101, 2054, 2065, 8224, 22822, 8458, 2098, 2046, 8224, 2891,
1029, 102, 22822, 8458, 2098, 102, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0])
attention_mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0])
labels: [-100, 0, 0, 2, 0, 0, 0, 0, 4, 4, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
To confirm that you have padded all sentences in the tokenized_test dataset to be of equal length, let’s check the length of all three arrays for the first 10 sentences:
for i in range(10):
print(f"sentence {i}:", "input_ids:", len(tokenized_test["input_ids"][i]), "\tlabels:", len(tokenized_test["labels"][i]), "\tattention_mask:", len(tokenized_test["attention_mask"][i]))
sentence 0: input_ids: 97 labels: 97 attention_mask: 97
sentence 1: input_ids: 97 labels: 97 attention_mask: 97
sentence 2: input_ids: 97 labels: 97 attention_mask: 97
sentence 3: input_ids: 97 labels: 97 attention_mask: 97
sentence 4: input_ids: 97 labels: 97 attention_mask: 97
sentence 5: input_ids: 97 labels: 97 attention_mask: 97
sentence 6: input_ids: 97 labels: 97 attention_mask: 97
sentence 7: input_ids: 97 labels: 97 attention_mask: 97
sentence 8: input_ids: 97 labels: 97 attention_mask: 97
sentence 9: input_ids: 97 labels: 97 attention_mask: 97
Converting the tokenized data to datasets format with the function load_dataset
Now that you have tokenized and padded the sentences, and aligned the labels with the tokens, you’re ready to transform the tokenized datasets into Hugging Face’s datasets.arrow_dataset.Dataset.
To transform the tokenized datasets into Dataset objects, call the load_dataset() function, which calls the Dataset.from_dict() method, including:
| Parameter name | Required | Parameter description |
|---|---|---|
positional 1 (transformers.tokenization_utils_base.BatchEncoding) | ✅️ | The tokenized dataset. |
dataset_train = load_dataset(tokenized_train)
dataset_dev = load_dataset(tokenized_dev)
dataset_test = load_dataset(tokenized_test)
Let’s print the type of the resulting dataset, to confirm the transformation into datasets.arrow_dataset.Dataset:
print(type(dataset_test))
<class 'datasets.arrow_dataset.Dataset'>
Step 4: Fine-tune the model
Finally, the sentences have been transformed from CoNNL-U Plus format to Hugging Face Dataset objects: it’s time to fine-tune BERT!
Fine-tuning a BERT model on the full dataset can be a very computationally challenging task. To speed up the process, create subsets of the three datasets with 1000 samples per dataset, selected randomly:
small_train_dataset = dataset_train.shuffle(seed=42).select(range(1000))
small_eval_dataset = dataset_dev.shuffle(seed=42).select(range(1000))
small_test_dataset = dataset_test.shuffle(seed=42).select(range(1000))
To map the numerical labels back to their string representations, you need to convert the label_map dictionary to a list of labels (as strings).
To convert the label_map to a list of labels (as strings), call the get_labels_from_map() function, including:
| Parameter name | Required | Parameter description |
|---|---|---|
| positional 1 (dictionary) | ✅️ | The dictionary mapping labels as strings to their numerical represenation. |
label_list = get_labels_from_map(label_map)
Next, load the pretrained DistilBERT model using the AutoModelForTokenClassification.from_pretrained() method from the transformers library, together with the model name (distilbert-base-uncased), and the TrainingArguments neccesary for training.
To get the model, model name and TrainingArguments, call the load_srl_model() function, including:
| Parameter name | Required | Parameter description |
|---|---|---|
| positional 1 (string) | ✅️ | The model identifier. |
| positional 2 (list of strings) | ✅️ | The tokenized dataset. |
batch_size (integer) | Optional (defaults to 16) | The batch size for training and inference. |
model, args = load_srl_model(model_id, label_list)
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/transformers/training_args.py:1594: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
Now that you have a DistilBERT model, it’s time for fine-tuning the model for the task of semantic role labeling (SRL).
To fine-tune your model, instantiate a Trainer object from the transformers library, passing the model, args, tokenizer and datasets for training and inference. Then, call the Trainer.train() method to start the fine-tuning process.
Note: this process may take up to several hours, depending on your hardware.
trainer = Trainer(
model,
args,
train_dataset=dataset_train,
eval_dataset=dataset_dev,
tokenizer=tokenizer,
compute_metrics=lambda p: compute_metrics(*p, label_list)
)
trainer.train()
/var/folders/d9/p0hwqj9x1sx30sdq622dyn1r0000gn/T/ipykernel_21922/2598124579.py:1: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = Trainer(
<div>
<progress value='2096' max='2096' style='width:300px; height:20px; vertical-align: middle;'></progress>
[2096/2096 12:45, Epoch 1/1]
</div>
<table border="1" class="dataframe">
Epoch
Training Loss
Validation Loss
Precision
Recall
F1
Accuracy
1
0.169300
0.182447
0.304727
0.262154
0.270974
0.951440
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
TrainOutput(global_step=2096, training_loss=0.2462493722675411, metrics={'train_runtime': 766.5975, 'train_samples_per_second': 43.727, 'train_steps_per_second': 2.734, 'total_flos': 1549860696773034.0, 'train_loss': 0.2462493722675411, 'epoch': 1.0})
Now that you have fine-tuned the model, let’s evaluate its performance on the eval_dataset that you set when constructing the Trainer instance.
To evaluate the fine-tuned model, call the Trainer.evaluate() method.
metrics = trainer.evaluate()
print(metrics)
{'eval_loss': 0.1824471801519394, 'eval_precision': 0.3047272131930285, 'eval_recall': 0.26215390976526504, 'eval_f1': 0.2709737056104543, 'eval_accuracy': 0.9514398703135809, 'eval_runtime': 60.303, 'eval_samples_per_second': 68.703, 'eval_steps_per_second': 4.295, 'epoch': 1.0}
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Now that you have fine-tuned your DistilBERT model for semantic role labeling, and evaluated its performance on the development dataset, it’s time to infer the argument labels of the test dataset and compute a summary of the performance metrics.
First, call the Trainer.predict() method passing the test dataset. The method returns a tuple consisting of the model’s predictions on the test dataset, the labels, and metrics.
To compute a summary of the model’s perfomance metrics on the test dataset, call the compute_metrics() function, including:
| Parameter name | Required | Parameter description |
|---|---|---|
positional 1 (np.ndarray) | ✅️ | The array of predictions as returned from the Trainer.predict() method. |
positional 2 (np.ndarray) | ✅️ | The array of argument labels as returned from the Trainer.predict() method. |
| positional 3 (list of strings) | ✅️ | The list of argument labels as strings. |
predictions, labels, _ = trainer.predict(dataset_test)
argmax_predictions = np.argmax(predictions, axis=2)
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Because the Tokenizer we used tokenizes on a subword level, the predictions are for subword tokens. However, we’re interested in word-level argument labels. To obtain those, iterate over all the subwords to recombine them into words, for words spanning multiple subwords have multiple predicted labels associated to them (one for every subword). When recombining the subwords into words, multiple labels need to be reconciled into one label. This can be done according to multiple strategies, and in this notebook we choose to apply the label of a word’s first subword to the word.
predicted_labels = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(argmax_predictions, labels)
]
gold_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(argmax_predictions, labels)
]
tokens = []
for i, (predictions, gold_labels) in enumerate(zip(predicted_labels, gold_labels)):
subword_tokens = tokenizer.convert_ids_to_tokens(dataset_test[i]["input_ids"], skip_special_tokens=True)
word_tokens = []
word_labels_gold = []
word_labels_pred = []
current_word = ""
current_gold_label = None
current_pred_label = None
for idx, (subword, gold, pred) in enumerate(zip(subword_tokens, gold_labels, predictions)):
if subword.startswith("##"): # Continuation of a word
current_word += subword[2:]
else: # New word starts
if current_word: # Save the previous word and its label
word_tokens.append(current_word)
word_labels_gold.append(current_gold_label)
word_labels_pred.append(current_pred_label)
current_word = subword # Start new word
current_gold_label = gold # Take the first subword's label
current_pred_label = pred # Take the first subword's label
if current_word:
word_tokens.append(current_word)
word_labels_gold.append(current_gold_label)
word_labels_pred.append(current_pred_label)
tokens.extend(zip(word_tokens, word_labels_gold, word_labels_pred))
# Create a dataframe and write to CVS
df = pd.DataFrame(tokens, columns=["word", "gold_label", "pred_label"])
df.to_csv("predictions.csv", index=False)
Now that we have a dataframe with word-level labels, it’s time to evaluate the model’s performance.
Let’s create:
- Classification report
- Confusion matrix
def create_classification_report(gold_labels, predictions, label_set):
"""
Create a classification report.
:param gold_labels: The gold labels.
:param predictions: The predictions.
:param label_set: The set of labels.
"""
# Create a classification report and confusion matrix
report_dict = classification_report(gold_labels, predictions, digits=3, target_names=label_set, output_dict=True, zero_division=0.0)
report = classification_report(gold_labels, predictions, digits=3, target_names=label_set, zero_division=0.0)
# Print the classification report.
print(report)
return report_dict, report, label_set
def plot_confusion_matrix(gold_labels, predictions, label_set):
"""
Plot the confusion matrix using ConfusionMatrixDisplay.
:param gold_labels: The gold labels.
:param predictions: The predictions.
:param label_set: The set of labels.
"""
# Create a confusion matrix.
cf_matrix = confusion_matrix(gold_labels, predictions, labels=label_set)
# Create a display for the confusion matrix.
display = ConfusionMatrixDisplay(confusion_matrix=cf_matrix, display_labels=label_set)
# Create a plot for the confusion matrix.
fig, ax = plt.subplots(figsize=(15, 15))
# Display the confusion matrix.
display.plot(ax=ax)
plt.xticks(rotation=90) # Rotate x-axis labels for readability
plt.title("Confusion Matrix")
plt.show()
return cf_matrix
# Load your predictions CSV
df = pd.read_csv("predictions.csv")
# Get sorted unique labels
label_set = sorted(df["gold_label"].unique())
# Create a classification report and confusion matrix
report_dict, report, label_set = create_classification_report(df["gold_label"], df["pred_label"], label_set)
cf_matrix = plot_confusion_matrix(df["gold_label"], df["pred_label"], label_set)
precision recall f1-score support
ARG0 0.807 0.828 0.818 1742
ARG1 0.761 0.824 0.791 3307
ARG1-DSP 0.000 0.000 0.000 4
ARG2 0.640 0.635 0.637 1145
ARG3 0.000 0.000 0.000 76
ARG4 0.800 0.143 0.242 56
ARG5 0.000 0.000 0.000 1
ARGA 0.000 0.000 0.000 2
ARGM-ADJ 0.675 0.704 0.689 230
ARGM-ADV 0.661 0.425 0.518 496
ARGM-CAU 0.000 0.000 0.000 46
ARGM-COM 0.000 0.000 0.000 14
ARGM-CXN 0.000 0.000 0.000 12
ARGM-DIR 0.381 0.170 0.235 47
ARGM-DIS 0.634 0.560 0.595 182
ARGM-EXT 0.809 0.686 0.742 105
ARGM-GOL 0.000 0.000 0.000 24
ARGM-LOC 0.507 0.493 0.500 219
ARGM-LVB 0.721 0.638 0.677 69
ARGM-MNR 0.490 0.329 0.394 152
ARGM-MOD 0.895 0.949 0.921 468
ARGM-NEG 0.846 0.952 0.896 392
ARGM-PRD 0.000 0.000 0.000 44
ARGM-PRP 0.509 0.377 0.433 77
ARGM-PRR 0.000 0.000 0.000 69
ARGM-TMP 0.700 0.730 0.715 571
C-ARG0 0.000 0.000 0.000 3
C-ARG1 0.000 0.000 0.000 52
C-ARG1-DSP 0.000 0.000 0.000 1
C-ARG2 0.000 0.000 0.000 7
C-ARG3 0.000 0.000 0.000 2
C-ARGM-CXN 0.000 0.000 0.000 5
C-ARGM-LOC 0.000 0.000 0.000 1
R-ARG0 0.828 0.791 0.809 67
R-ARG1 0.722 0.750 0.736 52
R-ARG2 0.000 0.000 0.000 1
R-ARGM-ADJ 0.000 0.000 0.000 1
R-ARGM-ADV 0.000 0.000 0.000 1
R-ARGM-DIR 0.000 0.000 0.000 1
R-ARGM-LOC 0.000 0.000 0.000 9
R-ARGM-MNR 0.000 0.000 0.000 8
R-ARGM-TMP 0.000 0.000 0.000 2
_ 0.980 0.984 0.982 78121
accuracy 0.955 87884
macro avg 0.311 0.278 0.287 87884
weighted avg 0.950 0.955 0.952 87884
png
Next, store the tokenizer, trainer and model on disc using their built-in methods. For each method call, pass a string representing the directory to save the object and its configuration to.
This let’s you use the objects’ built-in from_pretrained() methods to reload their state.
# Use these codes to save model:
tokenizer.save_pretrained("tokenizer.save_pretrained.distillbert-base-uncased-finetuned-srl")
trainer.save_model("trainer.save_model.distillbert-base-uncased-finetuned-srl")
model.save_pretrained("model.save_pretrained.distillbert-base-uncased-finetuned-srl")
Finally, let’s create a standalone function we can use to classify individual sentences, used to run a CheckList experiment on this model.
def create_input_sequence(sentence, predicate_position, argument_labels):
"""
Creates a DataFrame with columns 'input_form' and 'argument' for a single sentence.
Parameters:
- sentence (list of str): The words in the sentence.
- predicate_position (list of int): One-hot encoding indicating the predicate position.
- argument_labels (list of str): The argument labels for each token in the sentence.
Returns:
- DataFrame with two columns: 'input_form' and 'argument'.
"""
# Ensure input lengths match
assert len(sentence) == len(predicate_position) == len(argument_labels), "Input lists must have the same length."
# Determine the predicate form based on the one-hot encoding
predicate_index = predicate_position.index(1)
predicate_form = sentence[predicate_index]
# Append special tokens to input_form and argument lists
input_form = sentence + ['[SEP]', predicate_form]
argument = argument_labels + [None, None]
# Create a DataFrame
df = pd.DataFrame([{"input_form": input_form, "argument": argument}])
return df
def map_labels_to_words(predicted_labels, gold_labels, dataset):
"""
Map the predicted and gold labels to the corresponding words in the sentence.
Args:
predicted_labels (list): List of predicted labels for each token.
gold_labels (list): List of gold labels for each token.
dataset: The dataset containing the input sequences.
"""
tokens = []
for i, (predictions, gold_labels) in enumerate(zip(predicted_labels, gold_labels)):
subword_tokens = tokenizer.convert_ids_to_tokens(dataset[i]["input_ids"], skip_special_tokens=True)
word_tokens = []
word_labels_gold = []
word_labels_pred = []
current_word = ""
current_gold_label = None
current_pred_label = None
for idx, (subword, gold, pred) in enumerate(zip(subword_tokens, gold_labels, predictions)):
if subword.startswith("##"): # Continuation of a word
current_word += subword[2:] # Append the subword to the current word
else: # New word starts
if current_word:
word_tokens.append(current_word)
word_labels_gold.append(current_gold_label)
word_labels_pred.append(current_pred_label)
current_word = subword # Start new word
current_gold_label = gold #
current_pred_label = pred
if current_word:
word_tokens.append(current_word)
word_labels_gold.append(current_gold_label)
word_labels_pred.append(current_pred_label)
tokens.extend(zip(word_tokens, word_labels_gold, word_labels_pred))
# Create a dataframe and write to CSV file
df = pd.DataFrame(tokens, columns=["word", "gold_label", "predicted_label"])
return df
def classify_sentence_bert(sentence, predicate_location, argument_labels, predicate_sense, trainer, tokenizer):
"""
The standalone function that takes a sentence and predicts the argument labels, using DistilBERT.
Args:
sentence (list): A list of words.
predicate_location (list): A one-hot vector, indicating the location of the predicate.
argument_labels (list): A list of argument labels.
predicate_sense (str): The sense label of the predicate. Added to match the interface of logistic regression model.
trainer: The HuggingFace Trainer instance to predict labels with.
tokenizer: The HuggingFace Tokenizer to tokenize input sequences with.
"""
inference_input = create_input_sequence(sentence, predicate_location, argument_labels)
label_map = {'_': 0, 'ARG0': 1, 'ARG1': 2, 'ARG1-DSP': 3, 'ARG2': 4, 'ARG3': 5, 'ARG4': 6, 'ARG5': 7, 'ARGA': 8, 'ARGM-ADJ': 9, 'ARGM-ADV': 10, 'ARGM-CAU': 11, 'ARGM-COM': 12, 'ARGM-CXN': 13, 'ARGM-DIR': 14, 'ARGM-DIS': 15, 'ARGM-EXT': 16, 'ARGM-GOL': 17, 'ARGM-LOC': 18, 'ARGM-LVB': 19, 'ARGM-MNR': 20, 'ARGM-MOD': 21, 'ARGM-NEG': 22, 'ARGM-PRD': 23, 'ARGM-PRP': 24, 'ARGM-PRR': 25, 'ARGM-REC': 26, 'ARGM-TMP': 27, 'C-ARG0': 28, 'C-ARG1': 29, 'C-ARG1-DSP': 30, 'C-ARG2': 31, 'C-ARG3': 32, 'C-ARG4': 33, 'C-ARGM-ADV': 34, 'C-ARGM-COM': 35, 'C-ARGM-CXN': 36, 'C-ARGM-DIR': 37, 'C-ARGM-EXT': 38, 'C-ARGM-GOL': 39, 'C-ARGM-LOC': 40, 'C-ARGM-MNR': 41, 'C-ARGM-PRP': 42, 'C-ARGM-PRR': 43, 'C-ARGM-TMP': 44, 'R-ARG0': 45, 'R-ARG1': 46, 'R-ARG2': 47, 'R-ARG3': 48, 'R-ARG4': 49, 'R-ARGM-ADJ': 50, 'R-ARGM-ADV': 51, 'R-ARGM-CAU': 52, 'R-ARGM-COM': 53, 'R-ARGM-DIR': 54, 'R-ARGM-GOL': 55, 'R-ARGM-LOC': 56, 'R-ARGM-MNR': 57, 'R-ARGM-TMP': 58, None: None}
inference_data = map_labels_in_dataframe(inference_input, label_map)
tokenized_input = tokenize_and_align_labels(tokenizer, inference_data, label_all_tokens=True)
dataset_inference_sample = load_dataset(tokenized_input)
label_list = get_labels_from_map(label_map)
predictions, labels, _ = trainer.predict(dataset_inference_sample)
argmax_predictions = np.argmax(predictions, axis=2)
predicted_labels = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(argmax_predictions, labels)
]
gold_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(argmax_predictions, labels)
]
results = map_labels_to_words(predicted_labels, gold_labels, dataset_inference_sample)
return results
Let’s predict the role for “boy” for the predicate (ran), which should be the proto-agentic semantic role (ARG0).
sentence = ["The", "boy", "ran", "and", "the", "man", "fell", "."]
predicate_location = [0, 0, 1, 0, 0, 0, 0, 0]
argument_labels = ['_', 'ARG0', '_', '_', '_', '_', '_', '_']
predicate_sense = "run.01"
tokenizer = AutoTokenizer.from_pretrained("tokenizer.save_pretrained.distillbert-base-uncased-finetuned-srl")
model = AutoModelForTokenClassification.from_pretrained("model.save_pretrained.distillbert-base-uncased-finetuned-srl")
training_args = TrainingArguments(output_dir="trainer.save_model.distillbert-base-uncased-finetuned-srl")
trainer = Trainer(
model=model,
args=training_args,
tokenizer=tokenizer
)
result = classify_sentence_bert(sentence, predicate_location, argument_labels, predicate_sense, trainer, tokenizer)
print(result)
/var/folders/d9/p0hwqj9x1sx30sdq622dyn1r0000gn/T/ipykernel_8427/2896619161.py:11: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = Trainer(
word gold_label predicted_label
0 the _ _
1 boy ARG0 ARG0
2 ran _ _
3 and _ _
4 the _ _
5 man _ _
6 fell _ _
7 . _ _
Next, let’s predict the role for “man” for the second predicate (fall), which should be the proto-patientic semantic role (ARG1).
sentence = ["The", "boy", "ran", "and", "the", "man", "fell", "."]
predicate_location = [0, 0, 0, 0, 0, 0, 1, 0]
argument_labels = ['_', '_', '_', '_', '_', 'ARG1', '_', '_']
predicate_sense = "fall.01"
result = classify_sentence_bert(sentence, predicate_location, argument_labels, predicate_sense, trainer, tokenizer)
print(result)
word gold_label predicted_label
0 the _ _
1 boy _ _
2 ran _ _
3 and _ _
4 the _ _
5 man ARG1 ARG1
6 fell _ _
7 . _ _
It works! The standalone function takes the argument position to encode the sentence using the predicate form: [CLS] The boy ran and the man fell. [SEP] fell [SEP].