What is NLP (NATURAL LANGUAGE PROCESSING):-

Saurabhmirgane
11 min readOct 12, 2020

Before the evolution of, NLP or AI, the most difficult task was to understand peoples opinion, reviews, etc. but now it’s as simple as we know, the AI has taken over the world into different angle, let’s start everything starting with live example.

Things get ponder when comes to the situation to solve the complicate problems, where the hypothetical situations requires the support of updated technology, Thanks to AI and NLP which is one of the fastest task performed by any machine, lets cover the most important and interesting topic of AI technologies is NLP.

Introduction to NLP:-

It allows the machine to work with human language, interpreting and breaking down the words.

One of the most quickly evolving AI technologies today is NLP, which allows believing the potential of AI-powered human-to-machine learning technologies.

Let’s see some of the live examples, used case in both public and private organizations using NLP.

Updated on 20 /10/2020.

1. NLP-Powered Epidemiological Investigation

2. NLP-Powered Epidemiological Investigation

3. Security Authentication With NLP

4. NLP-Based Brand Awareness and Market Research

5. Chatbots for Customer Support and Engagement

6. NLP-Powered Competitive Analysis

7. Report Auto-Generation With the Help of NLP

8. Real-Time Intelligence Gathering on Specific Financial Stocks

9. Defence Departments And Secret Services Using AI

  • But have ever thought why one has to understand about NLP?
  • What are its uses in everyday life?
  • How it’s becoming very popular year-over-year?
  • What might be the reason behind this?

Lets understand each and relevant information this all today.

The very best example of NLP had overtake these years are chat bots, spam filters, search engines, translation software etc.

One has to be very clear, with all the below points mentioned in this content following as:-

1. What is Natural Language Processing (NLP)?

2. How does it work?

3. What are the tools and techniques used to work with NLP?

4. NLP with used cases and real time examples.

What is Natural Language Processing (NLP):-

It combines the power of human and machine, to solve the complex problems into simple solutions. Which consist the great combination of computer science to study the rules of of language, and create intelligent systems (run on machine learning and NLP algorithms) capable of understanding, analyzing, and extracting meaning from text and speech.

NLP is a field of Artificial Intelligence(AI) which converts the human language into machine language. Which works with different aspects like syntax, semantics, pragmatics, and morphology etc.

The Gmail contains Promotions, Social, Primary, or Spam, here every message is been separated so properly with the help of keyword extraction where it extracts the word and find its relevant place, and store the mails only by reading the single word out of the entire mail.

Top advantages of natural language processing include:

Large-scale analysis.

NLP helps to work with huge amount of unstructured text data, like comments on twitter for particular product, customer support tickets, online reviews, likes and dislikes of product etc.

Automated processes in real-time. It works quickly, efficiently, accurately without the interaction of the human, and sort the route information using their tools.

Tailored to your industry. Natural language processing algorithms can be tailored to your needs and criteria, like complex, industry-specific language even sarcasm and misused words.

How Does Natural Language Processing Work?

Initial Stages of Text Processing

Tokenization:- It is the process of cutting the character sequence, into word tokens. Few steps need to follow for Tokenization process those are:-

Indexer Step 1: Token Sequence

Token Sequence:- This is the method of modifying the token with the help of paired Document ID.

Example of Extracting the words

Doc1: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me.

Doc 2:- So let it be with Caesar. The noble Brutus hath told you Caesar was am.

So here each and every word is segregated, and find out the total time it repeats into the complete created doc.. .like “I ” word is repeated twice in the doc 1, and Caesar is repeated twice in the second doc.

Indexer Step 2: Sort by terms.

Arranging documents into the alphabetical order there will be sorted by words first then doc wise. It is also called “Core Indexing Concept

Indexer Step3 : Dictionary & Postings

  • Multiple term entries in a single document are merged.
  • Split into Dictionary and Postings.
  • Doc. frequency information is added.

Multiple entries in a single document are merged, once the sorting is done, the words are identifies first, then will merge with no of documents occurred Doc info is added lastly.

Query Processing: AND

  • Consider processing the query: This is the process where the words are connected with gateway option (AND) provided in the document. where Brutus AND Caesar Locate Brutus in the Dictionary;
  • Retrieve its postings:- This process will help the document to find the word, which merge the two postings which intersect the document sets. {This is the process that query will connect the words with AND word, Then will join the words, these will identify the word in the dictionary, then will merge the posting, they will compare with all the nearest documents, and then find its matches then will give the output and also compare with the largest document.}

The Merge:-

Merge word in statistics or any simple language, makes the easy understanding which combines the two things together.

  • Walk through the two postings simultaneously, in time linear the total number of postings entries, If the list lengths are x and y, the merge takes O(x+y) operations. (or) it merge those values, which connects the different operations provided in the doc ID.

Note: postings sorted by doc ID.

Boolean Queries: Exact Match

  • The Boolean retrieval model is being able to ask a query that is a Boolean expression.
  • Boolean Queries are queries using AND, OR and NOT to join query terms.
  • Views each document as a set of words.
  • Is precise: document matches condition or not. which is the simplest model to build in IR(Information Retrieval) system.
  • Many search systems you still use are Boolean like(Email, library catalog, Mac OS, X- Spotlight).

Query Optimization

What is the best order for query processing?

The best order for query processing is Consider a query that is an AND of ‘n ’terms. For each of the n terms, first get its postings, then AND them together.

Few steps to follow:-

Phrase Queries

The concept of phrase queries has proven easily understood by users, one of the few “advanced search” ideas that works like, Many more queries are implicit phrase queries.

Example:-

Thus the sentence “I went to university at Stanford” is not a match

  • For this, its no longer suffices to store only entries
  • We want to be able to answer queries such as “Stanford university” — as a phrase.
  • So this phrase queries picks, the most suited word from the provided sentence and send to the search engine.

A First Attempt: Bi-word Indexes

It works in the following way:-

  • Index every consecutive pair of terms in the text as a phrase
  • For example the text “Friends, Romans, Countrymen” would generate the bi-words — friends romans — romans countrymen
  • Each of these bi-words is now a dictionary term.
  • Two-word phrase query-processing is now immediate.

which is as simple as combining two words, together with the next neighboring word.

Longer Phrase Queries

  • This is the process for the Longer phrases which can be processed by breaking them down.

Example:-

  • Stanford university Palo alto can be broken into the Boolean query on bi-words: Stanford university AND university Palo AND Palo alto Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.

Issues of Bi-word Indexes

  • Index blowup due to bigger dictionary.
  • Infeasible for more than bi-words.
  • Bi-word indexes are not the standard solution, (for all bi-words) but can be part of a compound strategy.

Positional Indexes:-

This is the process of posting the each term with those of positions, in which tokens of it are appeared. For phrase queries, we use a merge algorithm recursively at the document level.

Processing a Phrase Query

  • Extract inverted index entries for each distinct term: to, be, or, not.
  • Merge their doc: position lists to enumerate all positions with “to be or not to be”. Ranked Retrieval
  • Thus far, our queries have all been Boolean. — Documents either match or don’t.
  • Good for expert users with precise understanding of their needs and the collection. — Also good for applications: Applications can easily consume 1000s of results.
  • Not good for the majority of users. — Most users are incapable of writing Boolean queries. — Most users don’t want to wade through 1000s of results.
  • The term frequency tft,d of term t in document d is defined as the number of times that t occurs in .
  • Relevance does not increase proportionally with term frequency.

Inverse Document Frequency IDF

  • IDF affects the ranking of documents for queries with at least two terms For the query capricious person, IDF weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.
  • Frequent terms are less informative than rare terms
  • 𝑑𝑓𝑡 is the document frequency of 𝑡: the number of documents that contain 𝑡 — 𝑑𝑓𝑡 ≤ 𝑁
  • IDF has no effect on ranking one term queries

TF-IDF Weighting

  • The tf-idf weight of a term is the product of its tf weight and its idf weight.
  • There are many variants — How “tf” is computed (with/without logs Whether the terms in the query are also weighted

Deal with “John’s”, a state-of-the-art solution

Normalization:-

Normalization is the process of converting the words, into same form where map, text and query term has to be same format.

Example:- You want U.S.A. and USA to match

Stemming:-

It is the process where we need different forms for matching its root of the word.

Example:-

authorize, authorization

Stop words:-

The most easiest way to omit very common words appeared in the sentence like [or, not, a, to, of, off] etc. for easy processing.

NLP in Python:-

This activity demonstrates an end-to-end, data science & natural language processing pipeline, starting with raw data and running through preparing, modeling, visualizing, and analyzing the data. We’ll touch on the following points:

Introduction of the dataset

In order to work with dataset in NLP, one requires to follow the basic steps to complete the process follow the below points.

1. Basic text cleaning

2. Contraction

3. Remove special characters

4. Remove accented characters

5. Introduction to text processing with spaCy

6. Lemmatization

7. NER

8. POS Tagging

9. Sentence extraction

10. TF-IDF construction

11. Count Vectorizer

12. Clustering

These are necessary points to be considered before we start working on any nlp projects..

import os

import numpy as np

import pandas as pd

import random

random.seed(123)

#ignore warnings

import warnings

warnings.filterwarnings(‘ignore’)

Why SpaCy?

There are many different libraries that can be used for text related processing. let’s work with SpaCy.

SpaCy is a free and open-source library developed by Explosion AI. It works well for simple to complex language understanding tasks and is designed specifically for production use.

SpaCy provides trained models for 48 different languages and has a model for multi-language as well.

Check this link for various English models : https://spacy.io/models/en

Before jumping in, let’s have a look at various features provided by popular NLP related libraries and their performance in comparison to SpaCy.
All charts are referenced from SpaCy Docs

  1. Feature Comparison

2. Speed Comparison

3. SpaCy Installation

To get started with SpaCy, install the package using pip in Terminal (for Mac) or Command Line (for Windows)

The language pre-trained model packages can be downloaded using the “spacy download” command.

Text Preprocessing

Basic steps in text pre-processing are :

1. Tokenization

2. Stop Words removal

3. Lemmatization

4. Vectorization

5. Understanding SpaCy Objects

l. NLP object

When we load the SpaCy model, it creates the SpaCy object. We define it with the variable name nlp

This object contains the language specific vocabulary, model weights and processing pipeline like tokenization rules, stop words, POS rules etc.

Look for pipeline component names using pipe names attribute

When we process some text with nlp object, it creates a doc object, short for document.

II. Doc Object

Token objects represent the word tokens in the document. To get a token at a specific position, simply index the Doc object like any python object.

III. Span Object

A Span object is a slice of the document consisting of one or more tokens. Again to view a span, simply index with start and end position seperated by : like any python object.

STOP WORDS

Words that occur very frequently in the documents that they don’t add any meaning or value are called stop words. It is best to remove these words that are useless and consume a lot of resources.

Remember : Stop words are language & domain dependent.

Lexical Attributes

Attributes that don’t hold any contextual information are called lexical attributes

Let’s explore other available token attributes :

i — index

text — token text

is_alpha — alphabetical character (True/False)

is_punct — punctuation (True/False)

like_num — numeric character (True/False)

Lemmatization

Lemmas are root form of a word. It is helpful to reduce the bag of words by using the same root word for all similar kind of words.

Statistical Models

Now, let us look at some context based attributes. All the information to make these predictions is also loaded with the model. These models are Similarity

When we load the language model, it also loads the 300-dimensional vector representation for the words. The vector representation has been computed using the Word2Vec algorithm on large Web text.

You will learn more about Word2Vec in Deep Learning Module.

I. Part of Speech

POS(Part of Speech)

POS means labeling words in a sentence as nouns, adjectives, verbs, tense etc. This is particularly helpful for identifying homophones in speech to text analysis. eg. If you accidentally drank a bottle of fabric dye, you might die. [ GOOD TO KNOW 😀 ]

Named Entity Recognition

Named entities are “real world objects” that are assigned a name, such as people, places, things, locations, currencies, and more.

Named entities can be accessed by using doc.ents which returns and iterator of span objects.

Expanding Contractions

Contractions are shortened version of words or syllables. They often exist in either written or spoken forms in the English language. These shortened versions or contractions of words are created by removing specific letters and sounds. In case of English contractions, they are often created by removing one of the vowels from the word. Examples[ would be, do not to don’t and I would to I’d]. Converting each contraction to its expanded, original form helps with text standardization.

We leverage a standard set of contractions available in the contractions.py file

Removing accented characters

Usually in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters. A simple example — converting é to e.

  • def remove_accented_chars(text):
  • Remove non ascii characters.
  • Work on only larger blog posts.
  • Add a new column “word_count” which specifies the number of tokens in each document.
  • data[‘word_count’] = [len(each_blog_text.split(‘ ‘)).
  • for each_blog_text in data[‘Text’]]

Sentence detection

  • Named entity detection
  • Part of speech tagging
  • Text normalization, like stemming/lemmatization and shape analysis

Others:- Variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?

  • Stopword
  • Punctuation
  • Whitespace

Stemming:-

Stemming refers to reducing a word to its root form. While performing natural language processing tasks, you will encounter various scenarios where you find different words with the same root. For instance, compute, computer, computing, computed, etc. You may want to reduce the words to their root form for the sake of uniformity. This is where stemming comes in to play.

It might be surprising to you but spaCy doesn’t contain any function for stemming as it relies on lemmatization only. Therefore, in this section, we will use NLTK for stemming.

Hope this basic information gave brief introduction about NLP. Don’t stop acquiring knowledge about related topics like,

https://saurabhmirgane007.medium.com/what-are-object-oriented-programming-in-python-63562574f94d

https://saurabhmirgane007.medium.com/what-is-apriori-method-in-machine-learning-3ad4994d2ef0

Happy Learning…

--

--