At Bewgle, we read your customer reviews, so you don’t have to.
About 80–90% of potentially usable business information originates from unstructured data.
Unstructured data can be sourced from multiple facets.
- Text (Website Data, Reviews Data, Logs Data, over 500 new websites are created every minute of the day)
- Videos and Audio (Over 200 billion HD Movies exist worldwide, to binge watch them it would take 37 million years)
- Images (By 2015, a staggering 1 Trillion photos have been captured)
1. We have generated more unstructured data in the past 3 years than in the entire history of the human race.
2. Less than 0.5% of the world’s unstructured data is ever analysed (What a shame).
With a mission to derive business value from unstructured data, this blog post aim’s to shine a light on our unique method of extracting usable information in the form of KeyPhrases.
For contiguous information extraction, generating Word NGRAMS is the industry standard method. NGRAMS are groups of N-words that occur one after the other , since people may express opinions in 2 words or 3 words or n-words, it is a logically sound approach.
For the sentence “This camera is very good”
Bigrams are : This Camera, Camera is, is very, very good
It is observable that only one of the above 4 bigrams is actually useful “Very Good”.
This is the biggest hurdle when working with ngrams or contiguous text.
Volume of the ngram exponentially grows with the increase in number of reviews.For a list of 1000 reviews, generation of all ngrams results in approx. over 40000 bigrams or 20000 trigrams.
This leads to methods of filtering out useful or information-rich ngrams:
Frequency-based Filtering — Some phrases that are used together quite often, do not mean anything significant (’This is’, or ‘I am’), therefore this method is often faulty.
Pointwise Mutual Information — This method uses a measure of association to rank and filter ngrams, but it is also faulty because it doesn’t filter enough junk.
We created our very own pipeline to ingest raw review data, pre-process it to our standards, and extract meaningful phrases of varied lengths that can be utilized in business logic.Gone are the days of endless excel files full of Ngrams that no one can ever go through. Leveraging Explosion AI’s Sense2Vec repo for pre-processing and an Intelligent Rule Based Filtration Algorithm based on dependency parsing and POS Tagging of phrases, we are generating extremely domain specific phrases that are information-rich and adequate in volume.
Given Below is a comparison of the phrases extracted through previous methods and our KeyPhrase Filtration Algorithm:
Description of Dataset: 1009 Reviews scraped from a retail banking company website.
Previous Method — Generation of all Contiguous Ngrams followed by Filtration based on PMI Ranking
Our Method — Bewgle KeyPhrase Extraction and Filtration Algorithm
Using this methodology, we have been able to extract not just the interesting snippets but also adjectives and sentiments associated with them, which enables brands understand not only what topic was spoken about, but also how satisfied the customers were with that product attribute.
With our iterative style of progressing, this catchy phrases algorithm has more to come..Stay tuned.
To know more, www.bewgle.com