All FastHugs code can be found in my FastHugs GitHub

Things You Might Like (❤️ ?)

FastHugsTokenizer: A tokenizer wrapper than can be used with fastai-v2's tokenizer.

FastHugsModel: A model wrapper over the HF models, more or less the same to the wrapper's from HF fastai-v1 articles mentioned below

Padding: Padding settings for the padding token index and on whether the transformer prefers left or right padding

Model Splitters: Functions to split the classification head from the model backbone in line with fastai-v2's new definition of Learner (in splitters.py

Housekeeping

Pretrained Transformers only for now 😐

Initially, this notebook will only deal with finetuning HuggingFace's pretrained models. It covers BERT, DistilBERT, RoBERTa and ALBERT pretrained classification models only. These are the core transformer model architectures where HuggingFace have added a classification head. HuggingFace also has other versions of these model architectures such as the core model architecture and language model model architectures.

If you'd like to try train a model from scratch HuggingFace just recently published an article on How to train a new language model from scratch using Transformers and Tokenizers. Its well worth reading to see how their tokenizers library can be used, independent of their pretrained transformer models.

Read these first 👇

This notebooks heavily borrows from this notebook , which in turn is based off of this tutorial and accompanying article. Huge thanks to Melissa Rajaram and Maximilien Roberti for these great resources, if you're not familiar with the HuggingFace library please given them a read first as they are quite comprehensive.

fastai-v2 ✌️2️⃣

This paper introduces the v2 version of the fastai library and you can follow and contribute to v2's progress on the forums. This notebook uses the small IMDB dataset and is based off the fastai-v2 ULMFiT tutorial. Huge thanks to Jeremy, Sylvain, Rachel and the fastai community for making this library what it is. I'm super excited about the additinal flexibility v2 brings. šŸŽ‰

Dependencies 📥

If you haven't already, install HuggingFace's transformers library with: pip install transformers

#collapse
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
df = pd.read_csv(path/'texts.csv')

FastHugs Tokenizer

This tokenizer wrapper is initialised with the pretrained HF tokenizer, you can also specify the max_seq_len if you want longer/shorter sequences. Given text it returns tokens and adds separator tokens depending on the model type being used.

#collapse
class FastHugsTokenizer():
    """ 
        transformer_tokenizer : takes the tokenizer that has been loaded from the tokenizer class
        model_name : model type set by the user
        max_seq_len : override default sequence length, typically 512 for bert-like models
        sentence_pair : whether a single sentence (sequence) or pair of sentences are used
    """
    def __init__(self, transformer_tokenizer=None, model_name = 'roberta', max_seq_len=None, 
                 sentence_pair=False, **kwargs): 
        self.tok, self.max_seq_len=transformer_tokenizer, max_seq_len
        if self.max_seq_len:
            if self.max_seq_len<=self.tok.max_len: 
                print('WARNING: max_seq_len is larger than the model default transformer_tokenizer.max_len')
        if sentence_pair: self.max_seq_len=ifnone(max_seq_len, self.tok.max_len_sentences_pair) 
        else: self.max_seq_len=ifnone(max_seq_len, self.tok.max_len_single_sentence)
        self.model_name = model_name
        
    def do_tokenize(self, o:str):
        """Limits the maximum sequence length and add the special tokens"""
        CLS, SEP=self.tok.cls_token, self.tok.sep_token
        
        # Add prefix space, depending on model selected
        if 'roberta' in model_name: tokens=self.tok.tokenize(o, add_prefix_space=True)[:self.max_seq_len]
        else: tokens = self.tok.tokenize(o)[:self.max_seq_len]
        
        # order of 'tokens', 'SEP' and 'CLS'
        if 'xlnet' in model_name: return tokens + [SEP] +  [CLS]
        else: return [CLS] + tokens + [SEP]

    def __call__(self, items): 
        for o in items: yield self.do_tokenize(o)

FastHugs Model

This nn.module wraps the pretrained transformer model and initialises it with its config file.

The forward of this module is taken straight from Melissa's notebook above and its purpose is to create the attention mask and grab only the logits from the output of the model (as the HappyFace transformer models also output the loss).

#collapse
class FastHugsModel(nn.Module):
    'Inspired by https://www.kaggle.com/melissarajaram/roberta-fastai-huggingface-transformers/data'
    def __init__(self, transformer_cls, config_dict, n_class, pretrained=True):
        super(FastHugsModel, self).__init__()
        self.config = config_dict  
        self.config._num_labels = n_class
        # load model
        if pretrained: self.transformer = transformer_cls.from_pretrained(model_name, config=self.config)
        else: self.transformer = transformer_cls.from_config(config=self.config)
        
    def forward(self, input_ids, attention_mask=None):
        attention_mask = (input_ids!=1).type(input_ids.type()) 
        logits = self.transformer(input_ids, attention_mask = attention_mask)[0] 
        return logits

The HuggingFace bit

Define HuggingFace Model + Config

  • AutoModelForSequenceClassification will define our model. When this is padded to the FastHugsModel class below then model will be instantiated and the weights downloaded (if you are using a pretrained model)
  • AutoConfig will define the model architecture and settings
  • model_name is the model architecture (and optionally model weights) you'd like to use.
    • Models tested: bert-base-uncased, roberta-base, distilbert-base-cased, albert-base-v2
    • You can find all of HuggingFace's models at https://huggingface.co/models, although not all of them are supported by AutoModel,AutoConfig and AutoTokenizer
model_name = 'roberta-base' 
model_class = AutoModelForSequenceClassification
config_dict = AutoConfig.from_pretrained(model_name)

HuggingFace Config changes

Some config settings can be changed even when using pretrained weights. For example in the FastHugsModel class below _num_labels is set when the model (pretrained or not) is instantiated, depending on how many classes you have in your dataloader.

When creating a non-pretrained model you can load a config with:

config_dict = AutoConfig.for_model(model_name)

Alternatively you could load a pretrained config and modify that. For example if your are not using a pretrained model you can change the size of your input embeddings by changing config_dict.max_position_embeddings = 1024. (This won't work when using pretrained models as the pre-trained weights need the default max_position_embeddings size).

HuggingFace Tokenizer & Vocab

  • AutoTokenizer will load our tokenizer and enable us grab our vocab

fastai expects vocab to be a list, however HuggingFace's get_vocab returns a token : index dict. We need to convert this dict to a list to be able to use it in fastai

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer_vocab=tokenizer.get_vocab() 
tokenizer_vocab_ls = [k for k, v in sorted(tokenizer_vocab.items(), key=lambda item: item[1])]
len(tokenizer_vocab_ls)
50265

The Fastai bit

Get fastai model splitter function

In order to be able to fine-tune our classifier head we need to first split the HuggingFace model's classifier head from the body. These functions are dependent on the specific architecture and can be found in splitter.py of this repo

splitter_nm = model_name.split('-')[0] + '_cls_splitter'
model_splitter = splitters[splitter_nm]

fasthugstok and our tok_fn

Lets incorporate the tokenizer from HuggingFace into fastai-v2's framework by specifying a function called fasthugstok that we can then pass on to Tokenizer.from_df. (Note .from_df is the only method I have tested)

Max Seqence Length

max_seq_len is the longest sequece our tokenizer will output. We can also the max sequence length for the tokenizer by changing max_seq_len. It uses the tokenizer's default, typically 512. 1024 or even 2048 can also be used depending on your GPU memory. Note when using pretrained models you won't be able to use a max_seq_len larger than the default.

max_seq_len = None  
sentence_pair=False

fasthugstok = partial(FastHugsTokenizer, transformer_tokenizer=tokenizer, model_name=model_name, 
                      max_seq_len=max_seq_len, sentence_pair=sentence_pair)

Set up fastai's Tokenizer.from_df, we pass rules=[] to override fastai's default text processing rules

fastai_tokenizer = Tokenizer.from_df(text_cols='text', res_col_name='text', tok_func=fasthugstok, rules=[])

Setup Dataloaders

Create Dataset

Lets add our custom tokenizer function (tok_fn) and transformer_vocab here

splits = ColSplitter()(df)
x_tfms = [attrgetter("text"), fastai_tokenizer, Numericalize(vocab=tokenizer_vocab_ls)]
dsets = Datasets(df, splits=splits, tfms=[x_tfms, [attrgetter("label"), Categorize()]], dl_type=SortedDL)

Padding

We need to make sure our padding is done correctly as some transformer models prefer padding on the left while others prefer it on the right. tokenizer.padding_side will tell us which side is correct. e.g., BERT, Roberta prefers padding to the right, so we set pad_first=False

#collapse
def transformer_padding(tokenizer=None, max_seq_len=None, sentence_pair=False): 
    if tokenizer.padding_side == 'right': pad_first=False
    else: pad_first=True
    max_seq_len = ifnone(max_seq_len, tokenizer.max_len) 
    return partial(pad_input_chunk, pad_first=pad_first, pad_idx=tokenizer.pad_token_id, seq_len=max_seq_len)

Dataloaders

bs = 4
padding=transformer_padding(tokenizer)
dls = dsets.dataloaders(bs=bs, before_batch=[padding])
dls.show_batch(max_n=3, trunc_at=60)
text category
0 <s> Ä I Ä was Ä fortunate Ä enough Ä to Ä meet Ä George Ä Pal Ä ( and Ä still Ä have Ä my Ä DS : TM OB Ä poster Ä aut ographed Ä by Ä him ) Ä at Ä a Ä convention Ä shortly Ä after Ä the Ä release , Ä and Ä asked Ä him Ä why Ä he Ä chose Ä to Ä do Ä the Ä film Ä " camp ". Ä Before Ä he Ä could Ä answer , Ä two Ä studio Ä fl acks Ä intercepted Ä and Ä lect ured Ä me Ä on negative
1 <s> Ä D ressed Ä to Ä Kill Ä starts Ä off Ä with Ä Kate Ä Miller Ä ( Ang ie Ä Dickinson ) Ä having Ä a Ä sexually Ä explicit Ä nightmare , Ä later Ä on Ä that Ä day Ä she Ä visits Ä her Ä psychiatrist Ä Dr . Ä Robert Ä Elliott Ä ( Michael Ä C aine ) Ä for Ä a Ä session Ä in Ä which Ä she Ä admits Ä to Ä be Ä sexually Ä frustrated Ä & Ä un ful filled Ä in Ä her Ä current Ä marriage . Ä Kate Ä then positive
2 <s> Ä SHALL OW Ä G RA VE Ä begins Ä with Ä either Ä a Ä tribute Ä or Ä a Ä rip Ä off Ä of Ä the Ä shower Ä scene Ä in Ä PS Y CHO . Ä ( I 'm Ä leaning Ä toward Ä rip Ä off .) Ä After Ä that Ä it Ä gets Ä worse Ä and Ä then Ä surprisingly Ä gets Ä better , Ä almost Ä to Ä the Ä point Ä of Ä being Ä original . Ä Bad Ä acting Ä and Ä amateur ish Ä directing Ä bog Ä down Ä a negative

(Alternatively) Factory dataloader

Here we set:

  • tok_tfm=tok_fn to use our HF tokenizer
  • text_vocab=transformer_vocab to load our pretrained vocab
  • before_batch=transformer_padding(transformer_tokenizer) to use our custom padding function
fct_dls = TextDataLoaders.from_df(df, text_col="text", tok_tfm=fastai_tokenizer, text_vocab=tokenizer_vocab_ls,
                              before_batch=[padding], label_col='label', valid_col='is_valid', bs=bs)
fct_dls.show_batch(max_n=3, trunc_at=60)
text category
0 <s> Ä I Ä was Ä fortunate Ä enough Ä to Ä meet Ä George Ä Pal Ä ( and Ä still Ä have Ä my Ä DS : TM OB Ä poster Ä aut ographed Ä by Ä him ) Ä at Ä a Ä convention Ä shortly Ä after Ä the Ä release , Ä and Ä asked Ä him Ä why Ä he Ä chose Ä to Ä do Ä the Ä film Ä " camp ". Ä Before Ä he Ä could Ä answer , Ä two Ä studio Ä fl acks Ä intercepted Ä and Ä lect ured Ä me Ä on negative
1 <s> Ä **** Don 't Ä read Ä this Ä review Ä if Ä you Ä want Ä the Ä shocking Ä conclusion Ä of Ä " The Ä Cr ater Ä Lake Ä Monster " Ä to Ä be Ä a Ä total Ä surprise **** < br Ä / >< br Ä /> A Ä clay m ation Ä pl es ios aur Ä rises Ä from Ä the Ä depths Ä of Ä Cr ater Ä Lake Ä to Ä wre ak Ä havoc Ä on Ä a Ä group Ä of Ä local Ä red ne negative
2 <s> Ä This Ä is Ä the Ä last Ä of Ä four Ä sw ash buck lers Ä from Ä France Ä I 've Ä scheduled Ä for Ä viewing Ä during Ä this Ä Christmas Ä season : Ä the Ä others Ä ( in Ä order Ä of Ä viewing ) Ä were Ä the Ä un inspired Ä THE Ä BLACK Ä T UL IP Ä ( 1964 ; Ä from Ä the Ä same Ä director Ä as Ä this Ä one Ä but Ä not Ä nearly Ä as Ä good ), Ä the Ä surprisingly Ä effective Ä L positive

Create our learner

opt_func = partial(Adam, decouple_wd=True)
loss = LabelSmoothingCrossEntropy()

fasthugs_model = FastHugsModel(transformer_cls=model_class, config_dict=config_dict, n_class=dls.c, pretrained=True)

learn = Learner(dls, fasthugs_model, opt_func=opt_func, splitter=model_splitter, 
                loss_func=loss, metrics=[accuracy]).to_fp16()

Stage 1 training

Lets freeze the model backbone and only train the classifier head. freeze_to(1) means that only the classifier head is trainable

learn.freeze_to(1)  

Lets find a learning rate to train our classifier head

learn.lr_find(suggestions=True)
SuggestedLRs(lr_min=9.999999747378752e-07, lr_steep=0.10000000149011612)
learn.recorder.plot_lr_find()
plt.vlines(9.999e-7, 0.65, 1.1)
plt.vlines(0.10, 0.65, 1.1)
<matplotlib.collections.LineCollection at 0x7f7582802450>
learn.fit_one_cycle(3, lr_max=1e-3)
epoch train_loss valid_loss accuracy time
0 0.692172 0.653464 0.550000 00:07
1 0.573368 0.591558 0.635000 00:07
2 0.522324 0.533852 0.810000 00:07
learn.save('roberta-fasthugs-stg1-1e-3')
learn.recorder.plot_loss()

Stage 2 training

And now lets train the full model with differential learning rates

learn.unfreeze()
learn.lr_find(suggestions=True)
SuggestedLRs(lr_min=6.309573450380412e-08, lr_steep=0.03981071710586548)
learn.recorder.plot_lr_find()
plt.vlines(6.30e-8, 0.6, 1.2)
plt.vlines(0.039, 0.6, 1.2)
<matplotlib.collections.LineCollection at 0x7f7582464490>
learn.fit_one_cycle(3, lr_max=slice(1e-5, 1e-4))
epoch train_loss valid_loss accuracy time
0 0.425518 0.354511 0.910000 00:31
1 0.278425 0.372734 0.910000 00:32
2 0.272590 0.366681 0.925000 00:31
learn.save('roberta-fasthugs-stg2-3e-5')
learn.recorder.plot_loss()

Lets Look at the model's predictions

learn.predict("This was a really good movie, i loved it")
('positive', tensor(1), tensor([0.1498, 0.8502]))
from fastai2.interpret import *
#interp = Interpretation.from_learner(learn)
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_top_losses(3)
input target predicted probability loss
0 <s> Ä This Ä movie Ä is Ä horrible - Ä in Ä a Ä ' so Ä bad Ä it 's Ä good ' Ä kind Ä of Ä way .< br Ä / >< br Ä /> The Ä storyline Ä is Ä re h ashed Ä from Ä so Ä many Ä other Ä films Ä of Ä this Ä kind , Ä that Ä I 'm Ä not Ä going Ä to Ä even Ä bother Ä describing Ä it . Ä It 's Ä a Ä sword / s or cery Ä picture , Ä has Ä a Ä kid Ä hoping Ä to Ä realize Ä how Ä important Ä he Ä is Ä in Ä this Ä world , Ä has Ä a Ä " nom adic " Ä adventurer , Ä an Ä evil Ä aide / s orce rer , Ä a Ä princess , Ä a Ä hairy Ä creature .... you Ä get Ä the Ä point .< br Ä / >< br Ä /> The Ä first Ä time Ä I Ä caught Ä this Ä movie Ä was Ä during Ä a Ä very Ä harsh Ä winter . Ä I Ä don 't Ä know Ä why Ä I Ä decided Ä to Ä continue Ä watching Ä it Ä for Ä an Ä extra Ä five Ä minutes Ä before Ä turning Ä the Ä channel , Ä but Ä when Ä I Ä caught Ä site Ä of Ä Gulf ax positive negative 0.970687747001648 3.354750156402588
1 <s> Ä In Ä 17 th Ä Century Ä Japan , Ä there Ä lived Ä a Ä samurai Ä who Ä would Ä set Ä the Ä standard Ä for Ä the Ä ages . Ä His Ä name Ä was Ä May eda . Ä He Ä is Ä sent Ä on Ä an Ä epic Ä journey Ä across Ä the Ä world Ä to Ä acquire Ä 5 , 000 Ä mus cats Ä from Ä the Ä King Ä of Ä Spain . Ä Whilst Ä at Ä sea Ä a Ä violent Ä storm Ä swall ows Ä their Ä precious Ä gold Ä intended Ä to Ä buy Ä the Ä weapons Ä and Ä almost Ä takes Ä their Ä lives . Ä May eda Ä must Ä battle Ä all Ä odds Ä to Ä survive Ä and Ä the Ä secure Ä the Ä fate Ä of Ä his Ä beloved Ä Japan . Ä Shogun Ä May eda Ä is Ä a Ä multi Ä million Ä dollar Ä action Ä adventure Ä epic Ä set Ä across Ä three Ä continents .< br Ä / >< br Ä /> Star ring Ä cinema Ä legends Ä Sho Ä Kos ugi Ä ( T ench u : Ä Stealth Ä Assassins ), Ä Christopher Ä Lee Ä ( Star Ä Wars , Ä Lord Ä of Ä the Ä Rings Ä Trilogy ), Ä John Ä Rh ys Ä Davies Ä ( Lord Ä of Ä the Ä Rings Ä Trilogy , Ä Indiana Ä Jones negative positive 0.9588471055030823 3.033039093017578
2 <s> Ä " How Ä To Ä Lose Ä Friends Ä & Ä Alien ate Ä People " Ä is Ä not Ä based Ä on Ä Tiger Ä Woods ' Ä inf idel ities . Ä It Ä is Ä a Ä mediocre Ä romantic Ä comedy Ä based Ä on Ä Toby Ä Young 's Ä book Ä on Ä his Ä experiences Ä working Ä as Ä a Ä journalist Ä covering Ä celebrities . Ä The Ä film Ä stars Ä Simon Ä Pe gg Ä as Ä Sidney Ä Young , Ä a Ä z any Ä British Ä journalist Ä who Ä takes Ä a Ä job Ä in Ä an Ä illustrious Ä celebrity Ä magazine Ä in Ä New Ä York . Ä Young Ä is Ä restless Ä in Ä getting Ä caught Ä up Ä all Ä type Ä of Ä shenanigans Ä to Ä alien ate Ä all Ä around Ä him , Ä hence Ä movie Ä title . Ä He Ä is Ä upro arious , Ä daring , Ä and Ä mor onic . Ä But Ä nevertheless Ä for Ä some Ä very Ä bizarre Ä reason , Ä he Ä is Ä a Ä somewhat Ä lik able Ä character . Ä Sidney Ä be friends Ä a Ä fellow Ä journalist , Ä the Ä composed Ä Alison Ä Olsen , Ä played Ä quite Ä adm ir ably Ä by Ä Kirst en Ä Dun st . Ä However , Ä Sidney Ä is Ä primarily Ä longing positive negative 0.942081868648529 2.7092723846435547