This post has been updated to show how to use HuggingFace's normalizers functions for your text pre-processing

In the following post, I'll cover the following using the HuggingFace Datasets libray:

Loading data, single or multiple files, csv, txt or dataframes, train/test splits
Processing data with 11 text processing functions
Tokenizing data for use with MobileBERT
Saving processed data to disk
Datasets tips and tricks along the way
Note: Click the colab button to open this notebook in Google Colab and run it end to end. This script was written with Transformers 3.3.1, Datasets 1.1 and Pytorch 1.6

I would love to hear your feedback, what could have been written better or clearer, let me know what you think on twitter: @mcgenergy

!transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 3.3.1
- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.7.5
- PyTorch version (GPU?): 1.4.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

HuggingFace Datasets Library

Why Should I use this "Datasets" library?

Lets see what the docs have to say:

Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2 Lightweight and fast with a transparent and pythonic API Strive on large datasets

🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.

Smart caching:never wait for your data to process several times > - 🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.

You can browse the full set of datasets with the live 🤗Datasets viewer

My fav

For me personally, I am irrationally fond of this library. It just has so many useful features for handling your text data! I have really enjoyed the speed of data processing and the fact that caching means that running your processing a second time is lightening fast! I've spent about 6 weeks working with it and I feel I've only scratched the surface of what it can do in some areas.

So, huge kudos to the team working on Datasets, the library and docs are now really great! But enough of what I think, lets get stuck in some data processing woop woop!

Lets Go 🚦

Lets start our guide to using the Datasets library to get your data ready to train. Note that a couple of the examples in this post are taken from the 🤗 Datasets docs, becasue "why fix it if it ain't broken!".

To start, lets install the library with a handy to remember pip install:

!pip install datasets --upgrade

Loading our Data

Now we have the library, lets load a dataset. If we are loading from one or more .txt or .csv files we can load like so:

# Single file
dataset = load_dataset('text', data_files='my_file.txt')
# Multiple files
dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])

Train/Test Split

Train/Test Split by File

If we would like to define our Train/Test split there are a few differant ways to do that. If your training data is already split by file we can do the following:

dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], 'test': 'my_test_file.csv'})

Splitting a Single FIle

Alternatively we can split a single file ourselves. Lets grab some Shakespeare text from Andrej Karpathy. Because this is a sinlge file, lets do a 80/20 train/test split

# collapse-hide
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

We can see that after loading, this dataset contains a DatasetDict with a sinlge key called train, which in turn has Dataset object with a sinlge column called text, with 32,777 rows of text:

#collapse-hide
full_ds = datasets.load_dataset('text', data_files='input.txt')
full_ds

DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 40000)})

We'll have to index into the dictionary with the train key and the name of the column(s) we'd like to inspect the text

full_ds['train'][:10]['text']

['First Citizen:',
 'Before we proceed any further, hear me speak.',
 '',
 'All:',
 'Speak, speak.',
 '',
 'First Citizen:',
 'You are all resolved rather to die than to famish?',
 '',
 'All:']

Tip: You can specify the cache_dir when loading a dataset if the default cache in your root directory has limited disk space, for example when procesing large files on Kaggle your working directory has a 5GB limit, however ../../tmp has a much higher limit which you can use for your active session

Loading only a small section of our data file

If we only want to take a small part of the dataset to enable us to develop rapidly we can specify the number of rows we would like to load, lets take 400 rows for example. Here we use the ReadInstruction method, have a look through the docs for even more interesting ways to use this.

mini_ds = load_dataset('text', data_files='input.txt', split=ReadInstruction('train', from_=0, to=400, unit='abs'))
mini_ds

Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 400)

80/20 Split

Since this is a single block of text lets create an 80/20 train/test split for ourselves by specifying a split when loading the data, like so: split=['train[:80%]']. There are additional useful examples of splits such as, K-fold cross validation, in the docs here

train_ds = datasets.load_dataset('text', data_files='input.txt', split=['train[:80%]'])[0]
val_ds = datasets.load_dataset('text', data_files='input.txt', split=['train[80%:]'])[0]

train_ds, val_ds

(Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 32000),
 Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 8000))

Selecting Specific Row Indices

If we like, we can also specify the exact rows we would like to extract using select() on an already-loaded dataset. Here we select 50 random indices from the full dataset

r = np.random.rand(50).tolist()
rand_dataset = full_ds['train'].select(r)
rand_dataset

Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 50)

This covers some typical ways one might want to load data, however there are many more options to explore, including loading from pandas dataframes and creating your own loading script, see the docs for more

Processing our Data

[UPDATE] See the next section below for how to use the `normalizers` from the HuggingFace `tokenizers` library to do some of this pre-processing event faster!

Now we have data loaded lets take a look at some processing options. .map() will be the main tool we'll use to apply processing functions our text. Note here are additional modifications you can make including shuffling and sorting with .shuffle() and .sort() respectively, but I'll leave those to you to explore in the docs 🔎

The `map` Function

map applies a function to our dataset. Below you can see how to lowercase our data by passing the lower_case function to map. When applying map you can choose to feed your function a batch of items (with batched=True or a single item. You can also adjust this batch size, the default is 1000. Feeding batches can be handy when using functions like tokenizers that can efficiently processes batches. Note that the structure of the processing

def lower_case(example):
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(e.lower())
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

train_ds = train_ds.map(lower_case, batched=True)
print(' '.join(train_ds['text'][200:203]))

note me this, good friend; your most grave belly was deliberate, not rash like his accusers, and thus answer'd:

11 Processing Functions Ready to Use with Datasets

Below are 11 useful text processing functions that you might need as part of your workflow; from html removal, to punctuation fixes, replacing username handles (e.g. twitter handles), dealing with emojis and more

#collapse-hide

'''
  Below are a selection of often useful processing functions to apply to your text. 
  As currently written, these functions require that your text column in your dataset
  is called "text"

  The functions are written to be able to deal with either a batch of samples 
  being passed or a single sample being passed.

  Most pre-processing functions are taken from the covid-twitter-bert processing file, here:
      https://github.com/digitalepidemiologylab/covid-twitter-bert/blob/d5a87550bb9d2424672d1ea56c84786f462321a3/utils/preprocess.py
  or else from fastai's processing rules here:
      https://docs.fast.ai/text.core#Preprocessing-rules
'''

# compile regexes
username_regex = re.compile(r'(^|[^@\w])@(\w{1,15})\b')
url_regex = re.compile(r'((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))')
control_char_regex = re.compile(r'[\r\n\t]+')

# Get unk character from your tokenizer of choice
tokenizer = AutoTokenizer.from_pretrained('google/mobilebert-uncased')
unk = tokenizer.special_tokens_map['unk_token']

# processing functions
def standardise_punc(example):
    transl_table = dict([(ord(x), ord(y)) for x, y in zip(u"‘’´“”–-",  u"'''\"\"--")])
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(e.translate(transl_table))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def remove_control_char(example):
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(re.sub(control_char_regex, ' ', e))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def remove_remaining_control_chars(example):
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(''.join(ch for ch in e if unicodedata.category(ch)[0] != 'C'))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def remove_multi_space(example):
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(' '.join(e.split()))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def remove_accented_characters(example):
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(unidecode.unidecode(e))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def remove_unicode_symbols(example):
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(''.join(ch for ch in e if unicodedata.category(ch)[0] != 'So'))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def lower_case(example):
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append(e.lower())
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def replace_usernames(example):
    filler,tmp_ls = '<user>',[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        occ = e.count('@')
        for _ in range(occ):
            e = e.replace('@<user>', f'{filler}')
            e = re.sub(username_regex, filler, e)    # replace other user handles by filler
            e = e.replace(filler, f' {filler} ')     #  add spaces between, and remove double spaces again
            e = ' '.join(e.split())
        tmp_ls.append(e)
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def replace_urls(example):
    filler,tmp_ls = '<url>',[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        occ = e.count('www.') + e.count('http:') + e.count('https:')
        for _ in range(occ):
            e = re.sub(url_regex, filler, e)    # replace other urls by filler
            e = e.replace(filler, f' {filler} ')    # add spaces between, and remove double spaces again
            e = ' '.join(e.split())
        tmp_ls.append(e)
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

def asciify_emojis(example):
    """
    Converts emojis into text aliases. E.g. 👍 becomes :thumbs_up:
    For a full list of text aliases see: https://www.webfx.com/tools/emoji-cheat-sheet/
    """
    tmp_ls = []
    example['text'] = _listify(example['text']) 
    for e in example['text']: 
        tmp_ls.append(emoji.demojize(e))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

    
def fix_html(example):
    "From fastai: 'Fix messy things we've seen in documents'"
    tmp_ls = []
    example['text'] = _listify(example['text']) 
    for e in example['text']: 
        e = e.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace('nbsp;', ' ').replace(
        '#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace('<br />', "\n").replace(
        '\\"', '"').replace('<unk>',unk).replace(' @.@ ','.').replace(' @-@ ','-').replace('...',' …')
        tmp_ls.append(html.unescape(e))
    if len(tmp_ls) == 1: return {'text': tmp_ls[0]}
    else: return {'text': tmp_ls}

Tip :To keep your code a little cleaner you could compose your processing functions together into a single list, so that you would then only have to apply map once, instead of calling it multiple times. In the example below I use the compose function from the fastcore library.

# Lets add "yo!" to the beginning of each of our items
def add_yo(example):
    '''Add "yo! " to each example'''
    tmp_ls=[]
    example['text'] = _listify(example['text']) 
    for e in example['text']:
        tmp_ls.append('yo! ' + e)
    if len(tmp_ls) == 1: return {'text': tmp_ls}
    else: return {'text': tmp_ls}

# Compose our lower_case and add_yo functions  
my_processing_funcs = compose(*[lower_case, add_yo])

# Apply both functions with map
train_ds = train_ds.map(my_processing_funcs, batched=True)

# We have lowercased and added "yo!" to to each item in a single call to map!
train_ds['text'][200:203]

['yo! yo! note me this, good friend;',
 'yo! yo! your most grave belly was deliberate,',
 "yo! yo! not rash like his accusers, and thus answer'd:"]

#collapse-hide

'''
   Do processing of the train and validation set
'''

do_batched = True

train_ds = train_ds.map(fix_html, batched=do_batched)
train_ds = train_ds.map(lower_case, batched=do_batched)
train_ds = train_ds.map(standardise_punc, batched=do_batched)
train_ds = train_ds.map(remove_control_char, batched=do_batched)
train_ds = train_ds.map(remove_remaining_control_chars, batched=do_batched)
train_ds = train_ds.map(remove_multi_space, batched=do_batched)
train_ds = train_ds.map(remove_accented_characters, batched=do_batched)
train_ds = train_ds.map(remove_unicode_symbols, batched=do_batched)
train_ds = train_ds.map(replace_usernames, batched=do_batched)
train_ds = train_ds.map(replace_urls, batched=do_batched)
train_ds = train_ds.map(asciify_emojis, batched=do_batched)    # 3-4x slower than the others

val_ds = val_ds.map(fix_html, batched=do_batched)
val_ds = val_ds.map(lower_case, batched=do_batched)
val_ds = val_ds.map(standardise_punc, batched=do_batched)
val_ds = val_ds.map(remove_control_char, batched=do_batched)
val_ds = val_ds.map(remove_remaining_control_chars, batched=do_batched)
val_ds = val_ds.map(remove_multi_space, batched=do_batched)
val_ds = val_ds.map(remove_accented_characters, batched=do_batched)
val_ds = val_ds.map(remove_unicode_symbols, batched=do_batched)
val_ds = val_ds.map(replace_usernames, batched=do_batched)
val_ds = val_ds.map(replace_urls, batched=do_batched)
val_ds = val_ds.map(asciify_emojis, batched=do_batched)

[UPDATE] Using `normalizers` from the `tokenizers` library for your preprocessing

The day I originally published this article, Sylvain Gugger at HuggingFace also tweeted that the tokenizers library had been updated, including updated docs

Well there go half of the processing functions I mentioned 😅

Stoked about the Normalizers and Pre-Tokenizers though, was one of the things I thought was missing (or maybe it was there and I missed it)https://t.co/gSheva9I2p
— Morgan McGuire (@mcgenergy) October 9, 2020

The new docs outline a number of "normalizer" functions similar to the preprocessing functions above such as lowercasing, removing white spaces etc. Turns out they were already in the library but not documented! So here is a quick update on how to use these functions as part of your pre-processing workflow.

Available Normalizers

As of writing the normalizers available, according to the docs, are:

NFD, NFKD, NFC: NFD, NFKD and NFC unicode normalization algorithms **
Lowercase: Replaces all uppercase to lowercase
Strip: Removes all whitespace characters on the specified sides (left, right or both) of the input
StripAccents: Removes all accent symbols in unicode (to be used with NFD for consistency)
Replace: Replaces a custom string or regexp and changes it with given content

** I'm not familiar with these normalizers, but if it is any help, the documentation uses NFD in their BERT tokenizer example

Applying a Normalizer to a string

We can apply a normalizer to a string by instantiating it and then calling .normalize_str, like so:

from tokenizers.normalizers import Lowercase

lc = Lowercase()
lc.normalize_str('ho wy iii KKKK')

'ho wy iii kkkk'

Applying `normalizers` to Datasets

Applying normalizers with .map is also quite straightforward. Note that we do not use map's batching functionality here as normalize_str requires that a string be passed to it.

tmp = train_ds.map(lambda e: {'text' : lc.normalize_str(e['text'])}, batched=False)

Composing Normalizers and Applying to Datasets

Below we compose multiple normalizers; NFD Unicode normalization, StripAccents and Lowercase. To do this we use simply pass a list of our normalizers to tokenizers.normalizers.Sequence which applies each of the given normalizers in the order given.

import tokenizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents

# Compose our normalizers
normalizer = tokenizers.normalizers.Sequence([Lowercase(), NFD(), StripAccents()])

# Apply to string (example shamelessly copied from the tokenizers docs)
print(normalizer.normalize_str("Héllò hôw are ü?"))

# Apply to Dataset
tmp = train_ds.map(lambda e: {'text' : normalizer.normalize_str(e['text'])}, batched=False)

hello how are u?

We can even then append this normalizer to our tokenzier!

tokenizer = AutoTokenizer.from_pretrained('google/mobilebert-uncased')

tokenizer.normalizer = normalizer

tokenizer.normalizer.normalize_str("Héllò hôw are ü?")

'hello how are u?'

After processing our data with all the pre-processing/normalizer function above (click the button to show all funcs used) we're now ready for tokenization!

Tokenization

Combining HuggingFaces "Fast" tokenizers with the Datasets library is a real dream, the speed is something else! Here we'll instantiate a tokenizer compatible with the MobileBERT model.

Tip: HuggingFace’s AutoTokenizer class makes loading tokenizers super simple, removing the need to import the specific tokenizer class for each different model you use. AutoModel is the equivalent for model loading and we’ll use that in the next part of this series

tokenizer = AutoTokenizer.from_pretrained('google/mobilebert-uncased', return_dict=True)

'lambda' and 'map'

Here we use a lambda function with map to apply the tokenizer to the train and validation sets. With HuggingFace tokenizers we have map options such as adding padding, truncating the text and setting a max_length and more. We use batched=True to take full advantages of our tokenizers ability to handle batches

Tip: In order to save precious GPU memory when training some of the -large transformer models I found that truncating the training text and setting a max length to be really useful. It worth experimenting with, if your text has very long sequences then truncation might degrade performance to an unacceptable level. In my case I was dealing with tweet data so I knew I wasn’t chopping too much from my texts. I didn’t truncate the validation text as the evaluation phases is generally less memory intensive than the training phase, so the model could handle the full text. You’ll want to consider when pursuing this strategy if you want to validate against the full text or truncated text.

Given the above, lets do our tokenization like so:

train_ds = train_ds.map(lambda e: tokenizer(e['text'], padding=False, truncation=True, max_length=200), batched=True)
val_ds = val_ds.map(lambda e: tokenizer(e['text'], padding=True, truncation=False), batched=True)

Set Format

After tokenization, our tokenized data are all in lists. To be able to use them in our model we need to encode the data as either Pytorch or Tensorflow tensors. Here we convert the relevant columns to pytorch tensors, we can set type = "tensorflow" (or "tf") if we are using Tensorflow here. You can see here we also specify only a subset of our columns as that is all that is needed for training our model.

train_ds.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
val_ds.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])

Sweet, Now Let Me Go Training!

Our data has been loaded, processed, tokenized and formatted, you are now go for training right? Well, one more thing you might want to think about before jumping into your modelling is if you need to use your data on different machines...

Saving and Loading Data

If you typically only use one machine consistently there is probably no need to save your data as Datasets keeps a cache of everything you have done to it.

However if your processing takes a significant amount of time and you need to move your data between machines, if you are using Kaggle notebooks, then I recommend saving your data for easy loading like so:

train_ds.save_to_disk('20M_processed_tokenized_pt_train_dataset')

You can then easily load your data again like so:

train_ds = load_from_disk('20M_processed_tokenized_pt_train_dataset')

Ready to Train 🎉

Now that our data is loaded, processed, tokenized and formatted we are ready to train! Check out the next part in this series too see how how we fine-tune our Transformer Language Model!

Coming Up in Post 2: Training your Language Model Transformer with 🤗 Trainer

Coming up in Post 2:

Getting your data collator
Setting up all Training Arguments
Make sure Weights and Biases is tracking what you need
Training a MobileBERT model
Training on TPUs
Saving your model model

Thanks for Reading This Far 🙏

As always, I would love to hear your feedback, what could have been written better or clearer, you can find me on twitter: @mcgenergy