lexical note hlt · adv-nlp

Semantic role labeling with DistilBERT

Semantic role labeling with DistilBERT

In this notebook, we’ll finetune DistilBERT for the task of Semantic Role Labeling (SRL), using the English Universal Propbank 1.0 datasets.

The files resulting from the fine-tuning process are available here:

Import libraries

import time
import pandas as pd
import transformers
import numpy as np
import torch
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset
from utils import read_data_as_sentence,map_labels_in_dataframe,tokenize_and_align_labels,get_label_mapping,get_labels_from_map,load_srl_model,load_dataset,compute_metrics,write_predictions_to_csv,compute_evaluation_metrics_from_csv, print_sentences
from bert_srl import main, define_args
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Step 1: Preprocess data

Unlike traditional token labeling methods, which assign labels to individual words in isolation, BERT performs sequence labeling. This means BERT assigns labels to individual tokens, while taking the full sentence context in consideration.

The English Universal PropBank 1.0 dataset is structured in CoNNL-U Plus format, in which lines represent individual tokens. So before you can train the model, you need to extract sentences and labels from the datasets, and preprocess the sentences by removing non-argument labels.

To preprocess the datasets and save the resulting DataFrame to a file, call the read_data_as_sentence() function, including:

Parameter nameRequiredParameter description
positional 1 (string)✅️The filepath for the CoNNLU dataset.
positional 2 (string)The filepath to write the preprocessed DataFrame to.
train_data = read_data_as_sentence('data/en_ewt-up-train.conllu', 'data/en_ewt-up-train.preprocessed.csv')
dev_data = read_data_as_sentence('data/en_ewt-up-dev.conllu', 'data/en_ewt-up-dev.preprocessed.csv')
test_data = read_data_as_sentence('data/en_ewt-up-test.conllu', 'data/en_ewt-up-test.preprocessed.csv')

The read_data_as_sentence() function returns DataFrames, where each row represents a sentence from the dataset passed to the function. Each sentence has been expanded based on its predicates, resulting in multiple copies of the same sentence, each focused on a different predicate.

The DataFrame has two columns:

  • input_form: a list of strings, where each string represents a words in the sentence, followed by two special tokens:
    1. A special token ([SEP]), which denotes the separation between the words of the sentence and the predicate form.
    2. The predicate form, which corresponds to the argument values for the same row in the DataFrame.
  • argument: a list of strings, representing the arguments associated with each word in the sentence. The length of each list is equal to the number of words in the sentence, plus two additional elements, for the special token and predicate form. The arguments match the predicate appended to the input_form for the same row in the DataFrame.

Explore the DataFrame

Before you continue to tokenize the sentences and fine-tune the BERT model, it’s time to get more familiar with our data.

To explore the DataFrame, start by printing the head of the preprocessed DataFrame:

print(test_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3971 entries, 0 to 3970
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   input_form  3971 non-null   object
 1   argument    3971 non-null   object
dtypes: object(2)
memory usage: 62.2+ KB
None

The Non-Null count for both columns should match, indicating there are as many lists of input_form values as there are lists of argument values, namely one for each sentence.

Next, print the words and their argument labels for the first 5 sentences of the test dataset:

print_sentences(test_data[:5])
form: What            argument: _
form: if              argument: _
form: Google          argument: ARG1
form: Morphed         argument: _
form: Into            argument: _
form: GoogleOS        argument: ARG2
form: ?               argument: _
----------------------------------------
form: [SEP]           argument: None
form: Morphed         argument: None

========================================

form: What            argument: _
form: if              argument: _
form: Google          argument: ARG0
form: expanded        argument: _
form: on              argument: _
form: its             argument: _
form: search          argument: _
form: -               argument: _
form: engine          argument: _
form: (               argument: _
form: and             argument: _
form: now             argument: _
form: e-mail          argument: _
form: )               argument: _
form: wares           argument: ARG1
form: into            argument: _
form: a               argument: _
form: full            argument: _
form: -               argument: _
form: fledged         argument: _
form: operating       argument: _
form: system          argument: ARG4
form: ?               argument: _
----------------------------------------
form: [SEP]           argument: None
form: expanded        argument: None

========================================

form: (               argument: _
form: And             argument: _
form: ,               argument: _
form: by              argument: _
form: the             argument: _
form: way             argument: ARGM-DIS
form: ,               argument: _
form: is              argument: _
form: anybody         argument: ARG1
form: else            argument: _
form: just            argument: _
form: a               argument: _
form: little          argument: _
form: nostalgic       argument: ARG2
form: for             argument: _
form: the             argument: _
form: days            argument: _
form: when            argument: _
form: that            argument: _
form: was             argument: _
form: a               argument: _
form: good            argument: _
form: thing           argument: _
form: ?               argument: _
form: )               argument: _
----------------------------------------
form: [SEP]           argument: None
form: is              argument: None

========================================

form: (               argument: _
form: And             argument: _
form: ,               argument: _
form: by              argument: _
form: the             argument: _
form: way             argument: _
form: ,               argument: _
form: is              argument: _
form: anybody         argument: _
form: else            argument: _
form: just            argument: _
form: a               argument: _
form: little          argument: _
form: nostalgic       argument: _
form: for             argument: _
form: the             argument: _
form: days            argument: ARGM-TMP
form: when            argument: R-ARGM-TMP
form: that            argument: ARG1
form: was             argument: _
form: a               argument: _
form: good            argument: _
form: thing           argument: ARG2
form: ?               argument: _
form: )               argument: _
----------------------------------------
form: [SEP]           argument: None
form: is              argument: None

========================================

form: This            argument: _
form: BuzzMachine     argument: ARG2
form: post            argument: _
form: argues          argument: _
form: that            argument: _
form: Google          argument: _
form: 's              argument: _
form: rush            argument: _
form: toward          argument: _
form: ubiquity        argument: _
form: might           argument: _
form: backfire        argument: _
form: --              argument: _
form: which           argument: _
form: we              argument: _
form: 've             argument: _
form: all             argument: _
form: heard           argument: _
form: before          argument: _
form: ,               argument: _
form: but             argument: _
form: it              argument: _
form: 's              argument: _
form: particularly    argument: _
form: well            argument: _
form: -               argument: _
form: put             argument: _
form: in              argument: _
form: this            argument: _
form: post            argument: _
form: .               argument: _
----------------------------------------
form: [SEP]           argument: None
form: post            argument: None

========================================

As you can see, the sequence of word forms runs parallel to the sequence of argument labels. This means that for every index of input_form, the same index of argument gives its argument label.

Argument labels are:

  • ’_’ for tokens that are not an argument (in the current predicate sense of the sentence).
  • The token’s respective Propbank label for tokens that are an argument, e.g. ARG1
  • None for the special separator token ([SEP]) and the predicate token that follows the separator.

For example, in the the first sentence of the test data printed above (“What if Google Morphed Into GoogleOS?”), the predicate ‘Morphed’ evokes the frame morph.01. The frame’s arguments are:

  • ARG0-PAG: causer of transformation
  • ARG1-PPT: thing changing
  • ARG2-PRD: end state
  • ARG3-VSP: start state

In this example, the ARG1 label is assigned to ‘Google’, and the ARG2 label is assigned to ‘GoogleOS’, which indicates ‘Google’ is the thing that is changing and ‘GoogleOS’ is its end state.

Step 2: Initialize a tokenizer

Now that you have extracted sentences and labels from the datasets, you need to prepare the sentences for the BERT model by tokenizing them.

Use HuggingFace’s AutoTokenizer to construct a DistilBERT tokenizer, which is based on the WordPiece algorithm.

# Set the model ID to use
model_id = "distilbert-base-uncased"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Check the assertion that the tokenizer is an instance of transformers.PreTrainedTokenizerFast
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

To test the tokenizer(), tokenize the first sentence of the test data, including:

  • add_special_tokens set to True to add a [CLS] token to the start of every sentence.
  • is_split_into_words set to True because the sentence is already split into words (based on the Universal Propbank 1.0 dataset)
# Tokenize the first example in the test data
example = test_data['input_form'][0]
tokenized_input = tokenizer(example,add_special_tokens=True, is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

# Print the example tokens and their corresponding IDs
for token, id in zip(tokens, tokenized_input["input_ids"]):
    print(f"{token:>10} {id}")
     [CLS] 101
      what 2054
        if 2065
    google 8224
       mor 22822
      ##ph 8458
      ##ed 2098
      into 2046
    google 8224
      ##os 2891
         ? 1029
     [SEP] 102
       mor 22822
      ##ph 8458
      ##ed 2098
     [SEP] 102

You’ve successfully tokenized the sample sentence, splitting words up into subword tokens and fetching their token IDs from DistilBERT’s vocabulary.

Note: notice how the special tokens [CLS] and [SEP] are tokenized as 101 and 102. These numbers are meaningful to BERT.

Step 3: Prepare the input for training

Before training the model, map the labels in the datasets to numerical values. This ensures consistency and facilitates the training process.

To get the label mapping, call get_label_mapping(), including:

Parameter nameRequiredParameter description
positional 1 (DataFrame)✅️The training dataset for which to extract the label mapping.
positional 2 (DataFrame)The test dataset for which to extract the label mapping.
positional 3 (DataFrame)The dev dataset for which to extract the label mapping.
label_map = get_label_mapping(train_data, test_data, dev_data)

The get_label_mapping() function returns an alphabetically-ordered dictionary mapping:

  • _ to 0.
  • String labels to integers, e.g. ARG0 to 1.
  • None to None, to preserve the labels for special tokens and predicates. (You will replace None with -100 later to mask these tokens from being labeled.)
print(label_map)
{'_': 0, 'ARG0': 1, 'ARG1': 2, 'ARG1-DSP': 3, 'ARG2': 4, 'ARG3': 5, 'ARG4': 6, 'ARG5': 7, 'ARGA': 8, 'ARGM-ADJ': 9, 'ARGM-ADV': 10, 'ARGM-CAU': 11, 'ARGM-COM': 12, 'ARGM-CXN': 13, 'ARGM-DIR': 14, 'ARGM-DIS': 15, 'ARGM-EXT': 16, 'ARGM-GOL': 17, 'ARGM-LOC': 18, 'ARGM-LVB': 19, 'ARGM-MNR': 20, 'ARGM-MOD': 21, 'ARGM-NEG': 22, 'ARGM-PRD': 23, 'ARGM-PRP': 24, 'ARGM-PRR': 25, 'ARGM-REC': 26, 'ARGM-TMP': 27, 'C-ARG0': 28, 'C-ARG1': 29, 'C-ARG1-DSP': 30, 'C-ARG2': 31, 'C-ARG3': 32, 'C-ARG4': 33, 'C-ARGM-ADV': 34, 'C-ARGM-COM': 35, 'C-ARGM-CXN': 36, 'C-ARGM-DIR': 37, 'C-ARGM-EXT': 38, 'C-ARGM-GOL': 39, 'C-ARGM-LOC': 40, 'C-ARGM-MNR': 41, 'C-ARGM-PRP': 42, 'C-ARGM-PRR': 43, 'C-ARGM-TMP': 44, 'R-ARG0': 45, 'R-ARG1': 46, 'R-ARG2': 47, 'R-ARG3': 48, 'R-ARG4': 49, 'R-ARGM-ADJ': 50, 'R-ARGM-ADV': 51, 'R-ARGM-CAU': 52, 'R-ARGM-COM': 53, 'R-ARGM-DIR': 54, 'R-ARGM-GOL': 55, 'R-ARGM-LOC': 56, 'R-ARGM-MNR': 57, 'R-ARGM-TMP': 58, None: None}

Next, apply the label mapping to the datasets, adding the column mapped_labels to the DataFrames. This column contains arrays of integers representing the labels, based on the label mapping.

To apply the label mapping, call map_labels_in_dataframe(), including:

Parameter nameRequiredParameter description
positional 1✅️The DataFrame for which to convert the argument labels.
positional 2The label mapping, created with get_label_mapping().
train_data = map_labels_in_dataframe(train_data, label_map)
dev_data = map_labels_in_dataframe(dev_data, label_map)
test_data = map_labels_in_dataframe(test_data, label_map)

As you can see, for each row in the DataFrame, the values in mapped_labels and arguments correspond to the mapping in label_map:

test_data.head()
vertical-align: top; } .dataframe thead th { text-align: right; }
input_form argument mapped_labels
0 [What, if, Google, Morphed, Into, GoogleOS, ?,... [_, _, ARG1, _, _, ARG2, _, None, None] [0, 0, 2, 0, 0, 4, 0, None, None]
1 [What, if, Google, expanded, on, its, search, ... [_, _, ARG0, _, _, _, _, _, _, _, _, _, _, _, ... [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ...
2 [(, And, ,, by, the, way, ,, is, anybody, else... [_, _, _, _, _, ARGM-DIS, _, _, ARG1, _, _, _,... [0, 0, 0, 0, 0, 15, 0, 0, 2, 0, 0, 0, 0, 4, 0,...
3 [(, And, ,, by, the, way, ,, is, anybody, else... [_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [This, BuzzMachine, post, argues, that, Google... [_, ARG2, _, _, _, _, _, _, _, _, _, _, _, _, ... [0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Now that you have initialized and tested the tokenizer() and added mapped labels to the DataFrames, it’s time to tokenize (and pad) all sentences.

Since WordPiece tokenization potentially breaks words up into subword tokens, the tokens and their labels have to be re-aligned. The tokenize_and_align_labels() function you’ll call for this iterates over each token and determines the appropriate label based on the provided dataset.

Special tokens are assigned a label of -100 to indicate they should be ignored in the loss function. Labels for the first token of each word are set accordingly, while labels for subsequent tokens within the same word are determined based on the label_all_tokens flag.

To tokenize the sentences and align the labels, call tokenize_and_align_labels(), including:

Parameter nameRequiredParameter description
positional 1 (transformers AutoTokenizer)✅️The tokenizer() for the pre-trained model.
positional 2 (DataFrame)The preprocessed datasets
label_all_tokens (boolean)Optional (defaults to True)Whether all tokens should receive their own label, accounting for words split into subtokens
tokenized_test = tokenize_and_align_labels(tokenizer, test_data, label_all_tokens=True)
tokenized_train = tokenize_and_align_labels(tokenizer, train_data, label_all_tokens=True)
tokenized_dev = tokenize_and_align_labels(tokenizer, dev_data, label_all_tokens=True)

Now that you have tokenized all three datasets, let’s examine the result.

The tokenized_ datasets are of the type transformers.tokenization_utils_base.BatchEncoding and have three attributes per row:

  1. input_ids: an array of token IDs for the tokenized sentence. Starts with the token ID for the [CLS] token, followed by the tokenized sentence, the [SEP] token, the predicate, and a final [SEP] token.
  2. attention_mask: an array representing the attention mask for the sentence.
  3. labels: an array with numerical labels, aligned with the tokens.

Note: all three arrays are padded so that every sample per dataset is of equal length.

print(type(tokenized_test))
print(tokenized_test.keys())
print(tokenizer.convert_ids_to_tokens(tokenized_test["input_ids"][0]))
for key in tokenized_test.keys():
    print(f"{key}: {tokenized_test[key][0]}")
<class 'transformers.tokenization_utils_base.BatchEncoding'>
dict_keys(['input_ids', 'attention_mask', 'labels'])
['[CLS]', 'what', 'if', 'google', 'mor', '##ph', '##ed', 'into', 'google', '##os', '?', '[SEP]', 'mor', '##ph', '##ed', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
input_ids: tensor([  101,  2054,  2065,  8224, 22822,  8458,  2098,  2046,  8224,  2891,
         1029,   102, 22822,  8458,  2098,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])
attention_mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0])
labels: [-100, 0, 0, 2, 0, 0, 0, 0, 4, 4, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]

To confirm that you have padded all sentences in the tokenized_test dataset to be of equal length, let’s check the length of all three arrays for the first 10 sentences:

for i in range(10):
    print(f"sentence {i}:", "input_ids:", len(tokenized_test["input_ids"][i]), "\tlabels:", len(tokenized_test["labels"][i]), "\tattention_mask:", len(tokenized_test["attention_mask"][i]))
sentence 0: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 1: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 2: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 3: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 4: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 5: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 6: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 7: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 8: input_ids: 97 	labels: 97 	attention_mask: 97
sentence 9: input_ids: 97 	labels: 97 	attention_mask: 97

Converting the tokenized data to datasets format with the function load_dataset

Now that you have tokenized and padded the sentences, and aligned the labels with the tokens, you’re ready to transform the tokenized datasets into Hugging Face’s datasets.arrow_dataset.Dataset.

To transform the tokenized datasets into Dataset objects, call the load_dataset() function, which calls the Dataset.from_dict() method, including:

Parameter nameRequiredParameter description
positional 1 (transformers.tokenization_utils_base.BatchEncoding)✅️The tokenized dataset.
dataset_train = load_dataset(tokenized_train)
dataset_dev = load_dataset(tokenized_dev)
dataset_test = load_dataset(tokenized_test)

Let’s print the type of the resulting dataset, to confirm the transformation into datasets.arrow_dataset.Dataset:

print(type(dataset_test))
<class 'datasets.arrow_dataset.Dataset'>

Step 4: Fine-tune the model

Finally, the sentences have been transformed from CoNNL-U Plus format to Hugging Face Dataset objects: it’s time to fine-tune BERT!

Fine-tuning a BERT model on the full dataset can be a very computationally challenging task. To speed up the process, create subsets of the three datasets with 1000 samples per dataset, selected randomly:

small_train_dataset = dataset_train.shuffle(seed=42).select(range(1000))
small_eval_dataset = dataset_dev.shuffle(seed=42).select(range(1000))
small_test_dataset = dataset_test.shuffle(seed=42).select(range(1000))

To map the numerical labels back to their string representations, you need to convert the label_map dictionary to a list of labels (as strings).

To convert the label_map to a list of labels (as strings), call the get_labels_from_map() function, including:

Parameter nameRequiredParameter description
positional 1 (dictionary)✅️The dictionary mapping labels as strings to their numerical represenation.
label_list = get_labels_from_map(label_map)

Next, load the pretrained DistilBERT model using the AutoModelForTokenClassification.from_pretrained() method from the transformers library, together with the model name (distilbert-base-uncased), and the TrainingArguments neccesary for training.

To get the model, model name and TrainingArguments, call the load_srl_model() function, including:

Parameter nameRequiredParameter description
positional 1 (string)✅️The model identifier.
positional 2 (list of strings)✅️The tokenized dataset.
batch_size (integer)Optional (defaults to 16)The batch size for training and inference.
model, args = load_srl_model(model_id, label_list)
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/transformers/training_args.py:1594: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(

Now that you have a DistilBERT model, it’s time for fine-tuning the model for the task of semantic role labeling (SRL).

To fine-tune your model, instantiate a Trainer object from the transformers library, passing the model, args, tokenizer and datasets for training and inference. Then, call the Trainer.train() method to start the fine-tuning process.

Note: this process may take up to several hours, depending on your hardware.

trainer = Trainer(
        model,
        args,
        train_dataset=dataset_train,
        eval_dataset=dataset_dev,
        tokenizer=tokenizer,
        compute_metrics=lambda p: compute_metrics(*p, label_list)
    )
trainer.train()
/var/folders/d9/p0hwqj9x1sx30sdq622dyn1r0000gn/T/ipykernel_21922/2598124579.py:1: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(




<div>

  <progress value='2096' max='2096' style='width:300px; height:20px; vertical-align: middle;'></progress>
  [2096/2096 12:45, Epoch 1/1]
</div>
<table border="1" class="dataframe">
Epoch Training Loss Validation Loss Precision Recall F1 Accuracy 1 0.169300 0.182447 0.304727 0.262154 0.270974 0.951440

/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))





TrainOutput(global_step=2096, training_loss=0.2462493722675411, metrics={'train_runtime': 766.5975, 'train_samples_per_second': 43.727, 'train_steps_per_second': 2.734, 'total_flos': 1549860696773034.0, 'train_loss': 0.2462493722675411, 'epoch': 1.0})

Now that you have fine-tuned the model, let’s evaluate its performance on the eval_dataset that you set when constructing the Trainer instance.

To evaluate the fine-tuned model, call the Trainer.evaluate() method.

metrics = trainer.evaluate()
print(metrics)
{'eval_loss': 0.1824471801519394, 'eval_precision': 0.3047272131930285, 'eval_recall': 0.26215390976526504, 'eval_f1': 0.2709737056104543, 'eval_accuracy': 0.9514398703135809, 'eval_runtime': 60.303, 'eval_samples_per_second': 68.703, 'eval_steps_per_second': 4.295, 'epoch': 1.0}


/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

Now that you have fine-tuned your DistilBERT model for semantic role labeling, and evaluated its performance on the development dataset, it’s time to infer the argument labels of the test dataset and compute a summary of the performance metrics.

First, call the Trainer.predict() method passing the test dataset. The method returns a tuple consisting of the model’s predictions on the test dataset, the labels, and metrics.

To compute a summary of the model’s perfomance metrics on the test dataset, call the compute_metrics() function, including:

Parameter nameRequiredParameter description
positional 1 (np.ndarray)✅️The array of predictions as returned from the Trainer.predict() method.
positional 2 (np.ndarray)✅️The array of argument labels as returned from the Trainer.predict() method.
positional 3 (list of strings)✅️The list of argument labels as strings.
predictions, labels, _ = trainer.predict(dataset_test)
argmax_predictions = np.argmax(predictions, axis=2)
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/Users/krisstallenberg/anaconda3/envs/srl-with-bert/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

Because the Tokenizer we used tokenizes on a subword level, the predictions are for subword tokens. However, we’re interested in word-level argument labels. To obtain those, iterate over all the subwords to recombine them into words, for words spanning multiple subwords have multiple predicted labels associated to them (one for every subword). When recombining the subwords into words, multiple labels need to be reconciled into one label. This can be done according to multiple strategies, and in this notebook we choose to apply the label of a word’s first subword to the word.

predicted_labels = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(argmax_predictions, labels)
    ]
gold_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(argmax_predictions, labels)
    ]

tokens = []

for i, (predictions, gold_labels) in enumerate(zip(predicted_labels, gold_labels)):
    subword_tokens = tokenizer.convert_ids_to_tokens(dataset_test[i]["input_ids"], skip_special_tokens=True)
    
    word_tokens = []
    word_labels_gold = []
    word_labels_pred = []

    current_word = ""
    current_gold_label = None
    current_pred_label = None

    for idx, (subword, gold, pred) in enumerate(zip(subword_tokens, gold_labels, predictions)):
        if subword.startswith("##"):  # Continuation of a word
            current_word += subword[2:]
        else:  # New word starts
            if current_word:  # Save the previous word and its label
                word_tokens.append(current_word)
                word_labels_gold.append(current_gold_label)
                word_labels_pred.append(current_pred_label)
            
            current_word = subword  # Start new word
            current_gold_label = gold  # Take the first subword's label
            current_pred_label = pred  # Take the first subword's label

    if current_word:
        word_tokens.append(current_word)
        word_labels_gold.append(current_gold_label)
        word_labels_pred.append(current_pred_label)

    tokens.extend(zip(word_tokens, word_labels_gold, word_labels_pred))

# Create a dataframe and write to CVS
df = pd.DataFrame(tokens, columns=["word", "gold_label", "pred_label"])
df.to_csv("predictions.csv", index=False)

Now that we have a dataframe with word-level labels, it’s time to evaluate the model’s performance.

Let’s create:

  • Classification report
  • Confusion matrix
def create_classification_report(gold_labels, predictions, label_set):
    """
    Create a classification report.
    
    :param gold_labels: The gold labels.
    :param predictions: The predictions.
    :param label_set: The set of labels.
    """

    # Create a classification report and confusion matrix
    report_dict = classification_report(gold_labels, predictions, digits=3, target_names=label_set, output_dict=True, zero_division=0.0)
    report = classification_report(gold_labels, predictions, digits=3, target_names=label_set, zero_division=0.0)

    # Print the classification report.
    print(report)

    return report_dict, report, label_set

def plot_confusion_matrix(gold_labels, predictions, label_set):
    """
    Plot the confusion matrix using ConfusionMatrixDisplay.
    
    :param gold_labels: The gold labels.
    :param predictions: The predictions.
    :param label_set: The set of labels.
    """
    # Create a confusion matrix.
    cf_matrix = confusion_matrix(gold_labels, predictions, labels=label_set) 

    # Create a display for the confusion matrix.
    display = ConfusionMatrixDisplay(confusion_matrix=cf_matrix, display_labels=label_set)

    # Create a plot for the confusion matrix.
    fig, ax = plt.subplots(figsize=(15, 15)) 

    # Display the confusion matrix.
    display.plot(ax=ax) 
    plt.xticks(rotation=90)  # Rotate x-axis labels for readability
    plt.title("Confusion Matrix")
    plt.show() 
    
    return cf_matrix
# Load your predictions CSV
df = pd.read_csv("predictions.csv")

# Get sorted unique labels
label_set = sorted(df["gold_label"].unique())

# Create a classification report and confusion matrix
report_dict, report, label_set = create_classification_report(df["gold_label"], df["pred_label"], label_set)
cf_matrix = plot_confusion_matrix(df["gold_label"], df["pred_label"], label_set)
              precision    recall  f1-score   support

        ARG0      0.807     0.828     0.818      1742
        ARG1      0.761     0.824     0.791      3307
    ARG1-DSP      0.000     0.000     0.000         4
        ARG2      0.640     0.635     0.637      1145
        ARG3      0.000     0.000     0.000        76
        ARG4      0.800     0.143     0.242        56
        ARG5      0.000     0.000     0.000         1
        ARGA      0.000     0.000     0.000         2
    ARGM-ADJ      0.675     0.704     0.689       230
    ARGM-ADV      0.661     0.425     0.518       496
    ARGM-CAU      0.000     0.000     0.000        46
    ARGM-COM      0.000     0.000     0.000        14
    ARGM-CXN      0.000     0.000     0.000        12
    ARGM-DIR      0.381     0.170     0.235        47
    ARGM-DIS      0.634     0.560     0.595       182
    ARGM-EXT      0.809     0.686     0.742       105
    ARGM-GOL      0.000     0.000     0.000        24
    ARGM-LOC      0.507     0.493     0.500       219
    ARGM-LVB      0.721     0.638     0.677        69
    ARGM-MNR      0.490     0.329     0.394       152
    ARGM-MOD      0.895     0.949     0.921       468
    ARGM-NEG      0.846     0.952     0.896       392
    ARGM-PRD      0.000     0.000     0.000        44
    ARGM-PRP      0.509     0.377     0.433        77
    ARGM-PRR      0.000     0.000     0.000        69
    ARGM-TMP      0.700     0.730     0.715       571
      C-ARG0      0.000     0.000     0.000         3
      C-ARG1      0.000     0.000     0.000        52
  C-ARG1-DSP      0.000     0.000     0.000         1
      C-ARG2      0.000     0.000     0.000         7
      C-ARG3      0.000     0.000     0.000         2
  C-ARGM-CXN      0.000     0.000     0.000         5
  C-ARGM-LOC      0.000     0.000     0.000         1
      R-ARG0      0.828     0.791     0.809        67
      R-ARG1      0.722     0.750     0.736        52
      R-ARG2      0.000     0.000     0.000         1
  R-ARGM-ADJ      0.000     0.000     0.000         1
  R-ARGM-ADV      0.000     0.000     0.000         1
  R-ARGM-DIR      0.000     0.000     0.000         1
  R-ARGM-LOC      0.000     0.000     0.000         9
  R-ARGM-MNR      0.000     0.000     0.000         8
  R-ARGM-TMP      0.000     0.000     0.000         2
           _      0.980     0.984     0.982     78121

    accuracy                          0.955     87884
   macro avg      0.311     0.278     0.287     87884
weighted avg      0.950     0.955     0.952     87884



png

Next, store the tokenizer, trainer and model on disc using their built-in methods. For each method call, pass a string representing the directory to save the object and its configuration to.

This let’s you use the objects’ built-in from_pretrained() methods to reload their state.

# Use these codes to save model:
tokenizer.save_pretrained("tokenizer.save_pretrained.distillbert-base-uncased-finetuned-srl")
trainer.save_model("trainer.save_model.distillbert-base-uncased-finetuned-srl")
model.save_pretrained("model.save_pretrained.distillbert-base-uncased-finetuned-srl")

Finally, let’s create a standalone function we can use to classify individual sentences, used to run a CheckList experiment on this model.

def create_input_sequence(sentence, predicate_position, argument_labels):
    """
    Creates a DataFrame with columns 'input_form' and 'argument' for a single sentence.

    Parameters:
    - sentence (list of str): The words in the sentence.
    - predicate_position (list of int): One-hot encoding indicating the predicate position.
    - argument_labels (list of str): The argument labels for each token in the sentence.

    Returns:
    - DataFrame with two columns: 'input_form' and 'argument'.
    """
    # Ensure input lengths match
    assert len(sentence) == len(predicate_position) == len(argument_labels), "Input lists must have the same length."
    
    # Determine the predicate form based on the one-hot encoding
    predicate_index = predicate_position.index(1)
    predicate_form = sentence[predicate_index]
    
    # Append special tokens to input_form and argument lists
    input_form = sentence + ['[SEP]', predicate_form]
    argument = argument_labels + [None, None]
    
    # Create a DataFrame
    df = pd.DataFrame([{"input_form": input_form, "argument": argument}])
    return df

def map_labels_to_words(predicted_labels, gold_labels, dataset):
    """
    Map the predicted and gold labels to the corresponding words in the sentence.
    
    Args:
        predicted_labels (list): List of predicted labels for each token.
        gold_labels (list): List of gold labels for each token.
        dataset: The dataset containing the input sequences.
    """
    
    tokens = []
    for i, (predictions, gold_labels) in enumerate(zip(predicted_labels, gold_labels)):
        subword_tokens = tokenizer.convert_ids_to_tokens(dataset[i]["input_ids"], skip_special_tokens=True)
        
        word_tokens = []
        word_labels_gold = []
        word_labels_pred = []
    
        current_word = ""
        current_gold_label = None
        current_pred_label = None
    
        for idx, (subword, gold, pred) in enumerate(zip(subword_tokens, gold_labels, predictions)):
            if subword.startswith("##"):  # Continuation of a word
                current_word += subword[2:] # Append the subword to the current word
            else:  # New word starts
                if current_word:
                    word_tokens.append(current_word)
                    word_labels_gold.append(current_gold_label)
                    word_labels_pred.append(current_pred_label)
                
                current_word = subword  # Start new word
                current_gold_label = gold # 
                current_pred_label = pred  
    
        if current_word:
            word_tokens.append(current_word)
            word_labels_gold.append(current_gold_label)
            word_labels_pred.append(current_pred_label)
    
        tokens.extend(zip(word_tokens, word_labels_gold, word_labels_pred))
    
    # Create a dataframe and write to CSV file
    df = pd.DataFrame(tokens, columns=["word", "gold_label", "predicted_label"])
    return df

def classify_sentence_bert(sentence, predicate_location, argument_labels, predicate_sense, trainer, tokenizer):
    """
    The standalone function that takes a sentence and predicts the argument labels, using DistilBERT.
    
    Args:
        sentence (list): A list of words.
        predicate_location (list): A one-hot vector, indicating the location of the predicate.
        argument_labels (list): A list of argument labels.
        predicate_sense (str): The sense label of the predicate. Added to match the interface of logistic regression model.
        trainer: The HuggingFace Trainer instance to predict labels with.
        tokenizer: The HuggingFace Tokenizer to tokenize input sequences with. 
    """
    inference_input = create_input_sequence(sentence, predicate_location, argument_labels)
    label_map = {'_': 0, 'ARG0': 1, 'ARG1': 2, 'ARG1-DSP': 3, 'ARG2': 4, 'ARG3': 5, 'ARG4': 6, 'ARG5': 7, 'ARGA': 8, 'ARGM-ADJ': 9, 'ARGM-ADV': 10, 'ARGM-CAU': 11, 'ARGM-COM': 12, 'ARGM-CXN': 13, 'ARGM-DIR': 14, 'ARGM-DIS': 15, 'ARGM-EXT': 16, 'ARGM-GOL': 17, 'ARGM-LOC': 18, 'ARGM-LVB': 19, 'ARGM-MNR': 20, 'ARGM-MOD': 21, 'ARGM-NEG': 22, 'ARGM-PRD': 23, 'ARGM-PRP': 24, 'ARGM-PRR': 25, 'ARGM-REC': 26, 'ARGM-TMP': 27, 'C-ARG0': 28, 'C-ARG1': 29, 'C-ARG1-DSP': 30, 'C-ARG2': 31, 'C-ARG3': 32, 'C-ARG4': 33, 'C-ARGM-ADV': 34, 'C-ARGM-COM': 35, 'C-ARGM-CXN': 36, 'C-ARGM-DIR': 37, 'C-ARGM-EXT': 38, 'C-ARGM-GOL': 39, 'C-ARGM-LOC': 40, 'C-ARGM-MNR': 41, 'C-ARGM-PRP': 42, 'C-ARGM-PRR': 43, 'C-ARGM-TMP': 44, 'R-ARG0': 45, 'R-ARG1': 46, 'R-ARG2': 47, 'R-ARG3': 48, 'R-ARG4': 49, 'R-ARGM-ADJ': 50, 'R-ARGM-ADV': 51, 'R-ARGM-CAU': 52, 'R-ARGM-COM': 53, 'R-ARGM-DIR': 54, 'R-ARGM-GOL': 55, 'R-ARGM-LOC': 56, 'R-ARGM-MNR': 57, 'R-ARGM-TMP': 58, None: None}
    inference_data = map_labels_in_dataframe(inference_input, label_map)
    tokenized_input = tokenize_and_align_labels(tokenizer, inference_data, label_all_tokens=True)
    dataset_inference_sample = load_dataset(tokenized_input)
    label_list = get_labels_from_map(label_map)
    predictions, labels, _ = trainer.predict(dataset_inference_sample)
    argmax_predictions = np.argmax(predictions, axis=2)

    predicted_labels = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(argmax_predictions, labels)
    ]
    gold_labels = [
            [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
            for prediction, label in zip(argmax_predictions, labels)
        ]
    
    results = map_labels_to_words(predicted_labels, gold_labels, dataset_inference_sample)
    return results

Let’s predict the role for “boy” for the predicate (ran), which should be the proto-agentic semantic role (ARG0).

sentence = ["The", "boy", "ran", "and", "the", "man", "fell", "."]

predicate_location = [0, 0, 1, 0, 0, 0, 0, 0] 
argument_labels = ['_', 'ARG0', '_', '_', '_', '_', '_', '_']
predicate_sense = "run.01"

tokenizer = AutoTokenizer.from_pretrained("tokenizer.save_pretrained.distillbert-base-uncased-finetuned-srl")
model = AutoModelForTokenClassification.from_pretrained("model.save_pretrained.distillbert-base-uncased-finetuned-srl")
training_args = TrainingArguments(output_dir="trainer.save_model.distillbert-base-uncased-finetuned-srl")

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer
)

result = classify_sentence_bert(sentence, predicate_location, argument_labels, predicate_sense, trainer, tokenizer)
print(result)
/var/folders/d9/p0hwqj9x1sx30sdq622dyn1r0000gn/T/ipykernel_8427/2896619161.py:11: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(






   word gold_label predicted_label
0   the          _               _
1   boy       ARG0            ARG0
2   ran          _               _
3   and          _               _
4   the          _               _
5   man          _               _
6  fell          _               _
7     .          _               _

Next, let’s predict the role for “man” for the second predicate (fall), which should be the proto-patientic semantic role (ARG1).

sentence = ["The", "boy", "ran", "and", "the", "man", "fell", "."]
predicate_location = [0, 0, 0, 0, 0, 0, 1, 0]
argument_labels = ['_', '_', '_', '_', '_', 'ARG1', '_', '_']
predicate_sense = "fall.01"

result = classify_sentence_bert(sentence, predicate_location, argument_labels, predicate_sense, trainer, tokenizer)
print(result)
   word gold_label predicted_label
0   the          _               _
1   boy          _               _
2   ran          _               _
3   and          _               _
4   the          _               _
5   man       ARG1            ARG1
6  fell          _               _
7     .          _               _

It works! The standalone function takes the argument position to encode the sentence using the predicate form: [CLS] The boy ran and the man fell. [SEP] fell [SEP].