Natural Language Processing (NLP) simplified : A step-by-step guide

By Dibyendu Banerjee

Highlights

The field of study that focuses on the interactions between human language and computers is called Natural Language Processing or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics.

Share this Article:

Quick introduction – What is NLP?

The field of study that focuses on the interactions between human language and computers is called Natural Language Processing or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics (Wikipedia).

NLP is Artificial Intelligence or Machine Learning or a Deep Learning?

The answer is here. The question itself is not fully correct! Sometime people incorrectly use the terms AI, ML and DL. Why not we simplify those first and then come back.

Clearing the Confusion: AI vs. Machine Learning vs. Deep Learning Differences

The commencements of modern AI can be traced to classical philosophers’ attempts to describe human thinking as a symbolic system. But the field of AI wasn’t formally founded until 1956, at a conference at Dartmouth College, in Hanover, New Hampshire, where the term “artificial intelligence” was coined.

Timeline view about when these jargons were first introduced…

Now, let us take a look what exactly AI, ML and Deep Learning is, in a very concise way. The relationship of AL, ML and DL can be treated as below.

NLP: How Does NLP Fit into the AI World?

With basic understanding of Artificial Intelligence, Machine Learning and Deep Leaning, lets revisit our very first query NLP is Artificial Intelligence or Machine Learning or a Deep Learning?

The words AI, NLP, and ML (machine learning) are sometimes used almost interchangeably. However, there is an order to the madness of their relationship.

Hierarchically, natural language processing is considered a subset of machine learning while NLP and ML both fall under the larger category of artificial intelligence.

Natural Language Processing combines Artificial Intelligence (AI) and computational linguistics so that computers and humans can talk seamlessly.

NLP endeavours to bridge the divide between machines and people by enabling a computer to analyse what a user said (input speech recognition) and process what the user meant. This task has proven quite complex.

To converse with humans, a program must understand syntax (grammar), semantics (word meaning), and morphology (tense), pragmatics (conversation). The number of rules to track can seem overwhelming and explains why earlier attempts at NLP initially led to disappointing results.

With a different system in place, NLP slowly improved moving from a cumbersome-rule based to a pattern learning based computer programming methodology. Siri appeared on the iPhone in 2011. In 2012, the new discovery of use of graphical processing units (GPU) improved digital neural networks and NLP.

NLP empowers computer programs to comprehend unstructured content by utilizing AI and machine learning to make derivations and give context to language, similarly as human brains do. It is a device for revealing and analysing the “signals” covered in unstructured information. Organizations would then be able to get a deeper comprehension of public perception around their products, services and brand, just as those of their rivals.

Now Google has released its own neural-net-based engine for eight language pairs, closing much of the quality gap between its old system and a human translator and fuelling increasing interest in the technology. Computers today can already produce an eerie echo of human language if fed with the appropriate material.

Over the past few years, Deep Learning (DL) architectures and algorithms have made impressive advances in fields such as image recognition and speech processing.

Their application to Natural Language Processing (NLP) was less impressive at first, but has now proven to make significant contributions, yielding state-of-the-art results for some common NLP tasks. Named entity recognition (NER), part of speech (POS) tagging or sentiment analysis are some of the problems where neural network models have outperformed traditional approaches. The progress in machine translation is perhaps the most remarkable among all.

NLP: Game changers in our daily life, examples for Businesses

 

NLP is not Just About Creating Intelligent bots.

NLP is a tool for computers to analyse, comprehend, and derive meaning from natural language in an intelligent and useful way. This goes way beyond the most recently developed chatbots and smart virtual assistants. In fact, natural language processing algorithms are everywhere from search, online translation, spam filters and spell checking.

So, by using NLP, developers can organize and structure the mass of unstructured data to perform tasks such as intelligent:

Below are some of the widely used areas of NLPs.

Components of NLP

NLP can be divided into two basic components.

  • Natural Language Understanding
  • Natural Language Generation

Natural Language Understanding (NLU)

NLU is naturally harder than NLG tasks. Really? Let’s see what are all challenges faced by a machine while understanding.

There are lot of ambiguity while learning or trying to interpret a language.

Lexical Ambiguity can occur when a word carries different sense, i.e. having more than one meaning and the sentence in which it is contained can be interpreted differently depending on its correct sense. Lexical ambiguity can be resolved to some extent using parts-of-speech tagging techniques.

Syntactical Ambiguity means when we see more than one meaning in a sequence of words. It is also termed as grammatical ambiguity.

Referential Ambiguity: Very often a text mentions as entity (something/someone), and then refers to it again, possibly in a different sentence, using another word. Pronoun causing ambiguiyty when it is not clear which noun it is refering to

Natural Language Generation (NLG)

It is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation.

It involves −

  • Text planning − It includes retrieving the relevant content from knowledge base.
  • Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of the sentence.
  • Text Realization − It is mapping sentence plan into sentence structure.

Levels of NLP

In the previous sections we have discussed different problem associated to NLP. Now let us see what are all typical steps involved while performing NLP tasks. We should keep in mind that the below section describes some standard workflow, it may however differ drastically as we do real life implementations basis on our problem statement or requirements.

The source of Natural Language could be speech (sound) or Text.

Phonological Analysis: This level is applied only if the text origin is a speech. It deals with the interpretation of speech sounds within and across words. Speech sound might give a big hint about the meaning of a word or a sentence.

It is study of organizing sound systematically. This requires a broad discussion and is out of scope of our current note.

Morphological Analysis: Deals with understanding distinct words according to their morphemes ( the smallest units of meanings) . Taking, for example, the word: “unhappiness ”. It can be broken down into three morphemes (prefix, stem, and suffix), with each conveying some form of meaning: the prefix un- refers to “not being”, while the suffix -ness refers to “a state of being”. The stem happy is considered as a free morpheme since it is a “word” in its own right. Bound morphemes (prefixes and suffixes) require a free morpheme to which it can be attached to, and can therefore not appear as a “word” on their own.

Lexical Analysis: It involves identifying and analysing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and words. I order to deal with lexical analysis, we often need to perform Lexicon Normalization.

The most common lexicon normalization practices are Stemming:

  • Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
  • Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

Syntactic Analysis: Deals with analysing the words of a sentence so as to uncover the grammatical structure of the sentence. E.g.. "Colourless green idea." This would be rejected by the Symantec analysis as colourless here; green doesn't make any sense.

Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactics.

Semantic Analysis: Determines the possible meanings of a sentence by focusing on the interactions among word-level meanings in the sentence. Some people may thing it’s the level which determines the meaning, but actually all the level do. The semantic analyser disregards sentence such as “hot ice-cream”.

Discourse Integration: Focuses on the properties of the text as a whole that convey meaning by making connections between component sentences. It means a sense of the context. The meaning of any single sentence which depends upon that sentences. It also considers the meaning of the following sentence. For example, the word "that" in the sentence "He wanted that" depends upon the prior discourse context.

Pragmatic Analysis: Explains how extra meaning is read into texts without actually being encoded in them. This requires much world knowledge, including the understanding of intentions, plans, and goals. Consider the following two sentences:

  • The city police refused the demonstrators a permit because they feared violence.
  • The city police refused the demonstrators a permit because they advocated revolution.

The meaning of “they” in the 2 sentences is different. In order to figure out the difference, world knowledge in knowledge bases and inference modules should be utilized.

Pragmatic analysis helps users to discover this intended effect by applying a set of rules that characterize cooperative dialogues. E.g., "close the window?" should be interpreted as a request instead of an order.

Widely used NLP Libraries

There are many libraries, packages, tools available in market. Each of them has its own pros and cons. As a market trend Python is the language which has most compatible libraries. Below table will gives a summarised view of features of some of the widely used libraries. Most of them provide the basic NLP features which we discussed earlier. Each NLP libraries were built with certain objectives, hence it is quite obvious that a single library might not provide solutions for everything, it is the developer who need to use those and that is where experience and knowledge matters when and where to use what.

NLP Hands on Using Python NLTK (Simple Examples)

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

Latest version: NLTK 3.5 release: April 2020, add support for Python 3.8, drop support for Python 2.

NLTK comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/.

Before we start doing experiments on some of the techniques which are widely used during Natural Language Processing task, let’s first get hands on into the installation.

NLTK Installation

If you are using Windows or Linux or Mac, you can install NLTK using pip:

$ pip install nltk

Optionally you can also use Anaconda prompt.

$ conda install nltk

 

If everything goes fine, that means you’ve successfully installed NLTK library.Once you’ve installed NLTK, you should install the NLTK packages by running the following code:

Open your Jupyter Notebook and run the below commands.

 

This will show the NLTK downloader to choose what packages need to be installed. You can install all packages since they have small sizes, so no problem. Now let’s start the show.

 

 

Basic NLP Operations: Do Yourself

Tokenize Text

Tokenization is the first step in NLP. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.

A word (Token) is the minimal unit that a machine can understand and process. So any text string cannot be further processed without going through tokenization. Tokenization is the process of splitting the raw string into meaningful tokens. The complexity of tokenization varies according to the need of the NLP application, and the complexity of the language itself. For example, in English it can be as simple as choosing only words and numbers through a regular expression. But for Chinese and Japanese, it will be a very complex task.

Sentence Tokenization

Sentence tokenizer breaks text paragraph into sentences.

 

Word Tokenization

Word tokenizer breaks text paragraph into words.

Stopwords Removal

Stopwords considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an, the, etc.

We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages.

You can see that the words is, my have been removed from the sentence.

Part of speech tagging

In your childhood, you may have heard the term Part of Speech (POS). It can really take good amount of time to get the hang of what adjectives and adverbs actually are. What exactly is the difference? Think about building a system where we can encode all this knowledge. It may look very easy, but for many decades, coding this knowledge into a machine learning model was a very hard NLP problem. POS tagging algorithms can predict the POS of the given word with a higher degree of precision. You can get the POS of individual words as a tuple

If you want to know the details of the POS, here is the way. Note we might need to download the ‘tagset’. Below example shows NN is noun.

 For better understanding below is the other POS that we found in our example.

The meanings of all available POS codes are given below for your reference.

Now look into an interesting though of information retrieval using POS tagging. I got an article about Cricket, trying to see what countries are mentioned in the document. Country names are proper noun, so using POS I can easily filter and get only the proper nouns. Apart from countries it may retrieve more words which are proper noun, but it make our job easy as none of the country name will missed out.

Stemming and Lemmatization

Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

Based on the applicability you can choose any of the below lemmatizer

  • Wordnet Lemmatizer
  • Spacy Lemmatizer
  • TextBlob
  • CLiPS Pattern
  • Stanford CoreNLP
  • Gensim Lemmatizer
  • TreeTagger

Here is one quick example using Wordnet lemmatizer.

How to get Word Meanings, Synonyms and Antonyms

WordNet is a large lexical database of English. It is a widely used NLTK corpus. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

You can simple import using

from nltk.corpus import wordnet

In the below simple example, let try to see how easily we can get the synonym and antonym of the word ‘love’. It’s really cool!

Work Frequency: Quick Visualization

In the below example lets’ try to read some text from live url and see the frequencies of words.

 

NLP, what is the future?

As we have seen, NLP provides a wide set of techniques and tools which can be applied in all the areas of life. By learning them and using them in our everyday interactions, our life quality would highly improve, as well as we could also improve the lives of those who surround us.

NLP techniques help us improving our communications, our goal reaching and the outcomes we receive from every interaction. They also allow as overcome personal obstacles and psychological problems. NLP help us using tools and techniques we already have in us without being aware of it.

Everything is a lot faster and better because we can now communicate with machines, thanks to natural language processing technology. Natural language processing has afforded major companies the ability to be flexible with their decisions thanks to its insights of aspects such as customer sentiment and market shifts. Smart organizations now make decisions based not on data only, but on the intelligence derived from that data by NLP-powered machines.

As NLP becomes more mainstream in the future, there may be a massive shift toward this intelligence-driven way of decision making across global markets and industries.

If there is one thing we can guarantee will happen in the future, it is the integration of natural language processing in almost every aspect of life as we know it. The past five years have been a slow burn of what NLP can do, thanks to integration across all manner of devices, from computers and fridges to speakers and automobiles.

Humans, for one, have shown more enthusiasm than a dislike for the human-machine interaction process. NLP-powered tools have also proven their abilities in such a short time.

These factors are going to trigger increased integration of NLP: ever-growing amounts of data generated in business dealings worldwide, increasing smart device use and higher demand for elevated service by customers.

As regards natural language processing, the sky is the limit. The future is going to see some massive changes as the technology becomes more mainstream and more advancement in the ability are explored. As a major facet of artificial intelligence, natural language processing is also going to contribute to the proverbial invasion of robots in the workplace, so industries everywhere have to start preparing.


About the author

Dibyendu Banerjee

Dibyendu Banerjee is a Senior Architect at Cognizant’s AI and Analytics practice. Passionate technologist with interest and proven experience in diverse technology competence and project management skills. Overall 14+ years of IT experience, his area of current expertise is in Python, R, Java, and open source technologies. He likes to communicate the latest trends around cutting-edge technologies through blogs, whitepapers, etc.

Image by Roberto Iriondo from Pixabay 

Previous Article

AI is sowing seeds of productivity and sustainability in India

Next Article

The Industry 4.0 espionage – Cybersecurity challenges

Want to get your article featured?

Leave your email address here so our team can contact you.

Suggested Articles