Unsupervised Subword Tokenizers vs. Morphology

Let’s explore how unsupervised tokenizers, commonly used in Deep Learning, relate to the more linguistic aspects of Morphology. Your task is to tweek the code in order to see if subword tokenization could be a proxy for real morphological analysis.

Things you may need to do before running the code

Install NLTK and Tokenizers packages:

pip install tokenizers
pip install nltk

Download the Brown Corpus from NLTK

import nltk
nltk.download('brown')

# !pip install tokenizers
!pip install nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Requirement already satisfied: nltk in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (3.9.1)
Requirement already satisfied: click in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from nltk) (2024.11.6)
Requirement already satisfied: tqdm in /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages (from nltk) (4.67.1)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

import nltk
from nltk.corpus import brown

corpus_f = open("brown-corpus.txt", "w+")

count = 0
vocab = set()
for s in brown.sents():
    corpus_f.write(" ".join(s) + '\n')
    
    words =str(s).split()
    count += len(words)
    vocab.update(words)

print("No. of words:", count)
print("No. of unique words:", len(vocab))

No. of words: 1161192
No. of unique words: 60346

Tokenizers

from tokenizers import Tokenizer

from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Whitespace

from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

VOCAB_SIZE = 4000   # You should be playing with this threshold

Byte-Pair Encoding (BPE) tokenization

BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

trainer = BpeTrainer(vocab_size=VOCAB_SIZE, 
                     special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

BPE_tokenizer.pre_tokenizer = Whitespace()    # This is optional...

files = ["brown-corpus.txt"]

BPE_tokenizer.train(files, trainer)

BPE_tokenizer.save("BPE-tokenizer.json")

Wordpiece tokenization

WP_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

WP_trainer = WordPieceTrainer(vocab_size=VOCAB_SIZE,
                              special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

WP_tokenizer.pre_tokenizer = Whitespace()    # This is optional...

files = ["brown-corpus.txt"]

WP_tokenizer.train(files, WP_trainer)

WP_tokenizer.save("WP-tokenizer.json")

Unigram tokenization

UG_tokenizer = Tokenizer(Unigram())

UG_trainer = UnigramTrainer(vocab_size=VOCAB_SIZE,
                            unk_token="<UNK>",
                            special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

UG_tokenizer.pre_tokenizer = Whitespace()    # This is optional... 

files = ["brown-corpus.txt"]

UG_tokenizer.train(files, UG_trainer)

UG_tokenizer.save("UG-tokenizer.json")

Let’s compare the tokenizers

Your task here will be to use a small evaluation corpus to test how the different algorithms perform against one another, while varying the size of the vocabulary above.

Feel free to add other words see how they are segmented (but you need to provide a gold segmentation for it to work).

# Some data extracted from https://github.com/sigmorphon/2022SegmentationST
test_corpus = [
    ["assistant", ["assist","ant"]],
    ["assistants", ["assist","ant","s"]],
    ["assist", ["assist"]],
    ["assisted",["assist","ed"]],
    ["assisting", ["assist","ing"]],
    ["assistance",["assist", "ance"]],
    ["assistive", ["assist","ive"]],
    ["assistful", ["assist","ful"]],
    ["assister", ["assist","er"]],
    ["unassisted", ["un","assist","ed"]],
    ["coassistance", ["co","assist","ance"]],
    ["coassists", ["co","assist","s"]],
    ["overassisting",["over","assist","ing"]],
    ["entaming", ["en", "tame", "ing"]],
    ["hoarders", ["hoard", "er", "s"]],
    ["visitorship", ["visit","or","ship"]],
    ["reorganises", ["re","organise","s"]],
    ["wargamer", ["war","game","er"]],               
    ["encodability", ["en","code","ability"]],
    ["healthy", ["health","y"]],
    ["buildings", ["build","ing","s"]],
    ["socioeconomy", ["socio","economy"]],
]

for instance in test_corpus:
    print(instance)

['assistant', ['assist', 'ant']]
['assistants', ['assist', 'ant', 's']]
['assist', ['assist']]
['assisted', ['assist', 'ed']]
['assisting', ['assist', 'ing']]
['assistance', ['assist', 'ance']]
['assistive', ['assist', 'ive']]
['assistful', ['assist', 'ful']]
['assister', ['assist', 'er']]
['unassisted', ['un', 'assist', 'ed']]
['coassistance', ['co', 'assist', 'ance']]
['coassists', ['co', 'assist', 's']]
['overassisting', ['over', 'assist', 'ing']]
['entaming', ['en', 'tame', 'ing']]
['hoarders', ['hoard', 'er', 's']]
['visitorship', ['visit', 'or', 'ship']]
['reorganises', ['re', 'organise', 's']]
['wargamer', ['war', 'game', 'er']]
['encodability', ['en', 'code', 'ability']]
['healthy', ['health', 'y']]
['buildings', ['build', 'ing', 's']]
['socioeconomy', ['socio', 'economy']]

count_wp, count_bpe, count_ug = 0, 0, 0

report = ""
for word, morphs  in test_corpus:
    
    wp = WP_tokenizer.decode(WP_tokenizer.encode(word).ids).replace("#",'').split()
    bpe = BPE_tokenizer.decode(BPE_tokenizer.encode(word).ids).split()
    ug = UG_tokenizer.decode(UG_tokenizer.encode(word).ids).split()

    if wp==morphs:
        count_wp += 1
    if bpe==morphs:
        count_bpe += 1
    if ug==morphs:
        count_ug += 1

        
    report = report + "GOLD: " + " ".join(morphs) + "\n"

    report = report + "Wordpiece: " + WP_tokenizer.decode(WP_tokenizer.encode(word).ids).replace("#",'') + "\n"

    report = report + "BPE: " + BPE_tokenizer.decode(BPE_tokenizer.encode(word).ids) + "\n"

    report = report + "Unigram: " + UG_tokenizer.decode(UG_tokenizer.encode(word).ids) + "\n"
    
    report = report + "------------------------------------------\n"


print("\n")
print("------------------------------------------")
print("RESULTS:")
print("------------------------------------------")
print("Wordpiece:", count_wp)
print("BPE:", count_bpe)
print("Unigram:", count_ug)
print("------------------------------------------")
print("\n\n")
print(report)

------------------------------------------
RESULTS:
------------------------------------------
Wordpiece: 8
BPE: 8
Unigram: 11
------------------------------------------



GOLD: assist ant
Wordpiece: assist ant
BPE: assist ant
Unigram: assistant
------------------------------------------
GOLD: assist ant s
Wordpiece: assist ants
BPE: assist ants
Unigram: assistant s
------------------------------------------
GOLD: assist
Wordpiece: assist
BPE: assist
Unigram: assist
------------------------------------------
GOLD: assist ed
Wordpiece: assist ed
BPE: assist ed
Unigram: assist ed
------------------------------------------
GOLD: assist ing
Wordpiece: assist ing
BPE: assist ing
Unigram: assist ing
------------------------------------------
GOLD: assist ance
Wordpiece: assistance
BPE: assistance
Unigram: as sistance
------------------------------------------
GOLD: assist ive
Wordpiece: assist ive
BPE: assist ive
Unigram: assist ive
------------------------------------------
GOLD: assist ful
Wordpiece: assist ful
BPE: assist ful
Unigram: assist ful
------------------------------------------
GOLD: assist er
Wordpiece: assist er
BPE: ass ister
Unigram: assist er
------------------------------------------
GOLD: un assist ed
Wordpiece: un ass isted
BPE: un assist ed
Unigram: un assist ed
------------------------------------------
GOLD: co assist ance
Wordpiece: co ass ist ance
BPE: co assistance
Unigram: co as sistance
------------------------------------------
GOLD: co assist s
Wordpiece: co ass ists
BPE: co ass ists
Unigram: co assist s
------------------------------------------
GOLD: over assist ing
Wordpiece: over ass ist ing
BPE: over assist ing
Unigram: over assist ing
------------------------------------------
GOLD: en tame ing
Wordpiece: ent am ing
BPE: ent aming
Unigram: ent a m ing
------------------------------------------
GOLD: hoard er s
Wordpiece: ho ard ers
BPE: ho ard ers
Unigram: h o ard ers
------------------------------------------
GOLD: visit or ship
Wordpiece: visit ors hip
BPE: vis itor ship
Unigram: visit or ship
------------------------------------------
GOLD: re organise s
Wordpiece: re or gan ises
BPE: re organ ises
Unigram: re or gan is e s
------------------------------------------
GOLD: war game er
Wordpiece: war g ame r
BPE: war g am er
Unigram: war game r
------------------------------------------
GOLD: en code ability
Wordpiece: enc od ability
BPE: en c od ability
Unigram: en co d abilit y
------------------------------------------
GOLD: health y
Wordpiece: health y
BPE: he al thy
Unigram: health y
------------------------------------------
GOLD: build ing s
Wordpiece: buildings
BPE: buildings
Unigram: building s
------------------------------------------
GOLD: socio economy
Wordpiece: soc io ec onom y
BPE: soci o economy
Unigram: s o c i o econom y
------------------------------------------