Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
We earn a commission if you make a purchase, at no additional cost to you.
03/28/2023 04:13 am GMT
For example, consider the sentence: “Never lose hope”.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-lose-hope. As each token is a word, it becomes an example of word tokenization.
Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:
Character tokens: s-m-a-r-t-e-r
Subword tokens: smart-er
As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
For example, Transformer based models – the State of The Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering each unique token in the corpus or by considering the top K frequently occurring words.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the vocabulary is treated as a unique feature:
In Advanced Deep Learning-based NLP architectures, vocabulary is used to create the tokenized input sentences. Finally, the tokens of these sentences are passed as inputs to the model.
References
Burchfiel, Anni. “What Is NLP (Natural Language Processing) Tokenization?” Tokenex, 16 May 2022, https://www.tokenex.com/blog/ab-what-is-nlp-natural-language-processing-tokenization/. Accessed 6 Feb. 2023.
Pai, Aravindpai. “What Is Tokenization.” Analytics Vidhya, 25 May 2020, https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/. Accessed 6 Feb. 2023.
Introduction: What is Tokenization in NLP?
Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.
Source: YouTube | Tokenization.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
Also Read: Using artificial intelligence to make publishing profitable.
For example, consider the sentence: “Never lose hope”.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-lose-hope. As each token is a word, it becomes an example of word tokenization.
Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”:
Character tokens: s-m-a-r-t-e-r
Subword tokens: smart-er
As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
Also Read: Democracy will win with improved artificial intelligence.
For example, Transformer based models – the State of The Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.
Hence, Tokenization is the foremost step while modeling text data. Tokenization is performed on the corpus to obtain tokens. The following tokens are then used to prepare a vocabulary. Vocabulary refers to the set of unique tokens in the corpus. Remember that vocabulary can be constructed by considering each unique token in the corpus or by considering the top K frequently occurring words.
Also Read: Robotics and manufacturing.
Now, let’s understand the usage of the vocabulary in Traditional and Advanced Deep Learning-based NLP methods.
Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the vocabulary is treated as a unique feature:
In Advanced Deep Learning-based NLP architectures, vocabulary is used to create the tokenized input sentences. Finally, the tokens of these sentences are passed as inputs to the model.
References
Burchfiel, Anni. “What Is NLP (Natural Language Processing) Tokenization?” Tokenex, 16 May 2022, https://www.tokenex.com/blog/ab-what-is-nlp-natural-language-processing-tokenization/. Accessed 6 Feb. 2023.
Pai, Aravindpai. “What Is Tokenization.” Analytics Vidhya, 25 May 2020, https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/. Accessed 6 Feb. 2023.
Share this: