While sentiment analysis remains the quintessential text classification problem, it is largely considered ‘solved’. The State of the Art (SOTA) results on binary sentiment analysis routinely cross 95-97% accuracy (see https://paperswithcode.com/sota/sentiment-analysis-on-imdb ) and further improvements are likely only of academic interest. Far more interesting is aspect based sentiment analysis, an area Bewgle actively focuses on, but that’s a topic for another post.
Beyond this, in the domain that Bewgle operates in (analysing multi-channel customer verbatim data) real life text classification problems are far more complex. We call them ad-hoc text classification – i.e classifying text for reasons which are not predetermined, e.g ‘What new features would customers like to see in our products’, ‘Is this a benefit of using the product validated by the user’ and so on. The eventual goal would be to have a (near) real-time system for solving this, but I do not think NLP has reached that stage yet (caveats exist, notably I do not have access to GPT3 – more on this below). For now, we focus on having a quickest turnaround to such questions with minimal or no human effort. Specifically, we do not have the luxury (time, effort, money) to spend on generating large swathes of tagged data.
Also as a side note – We would also like to differentiate this problem from Information Retrieval (search and similar). Bewgle provides actionable insights based on the results of NLP analysis, so a generic text-retrieval approach only solves part of the problem. Search (semantic or otherwise) helps in data exploration, but is of limited use in drawing inferences leading to concrete decision-making for our customers.
For low resource classification tasks (limited labeled data), here’s a run down of a few prominent approaches we have tried and the outcome in real life so far.
We were quite excited about UDA which promises SOTA result on imdb sentiment classification data with just 20 training examples! https://github.com/google-research/uda With their code we were able to replicate the results as well. Their paper https://arxiv.org/pdf/1904.12848 is also intuitive and makes sense. Excited about this, we set off trying to generate new models with some of our tagged data. However, after many days of training using various hyperparameters, various combinations of unsupervised data and a number of other attempts to make it work on our data, it did not produce good results. The model either did not converge, or it overfit or performed extremely poorly on real life data after training. We are not seeing the UDA approach being used much either, nor do we see many other papers referencing the UDA paper. Many github issues also speak of the failure in using UDA outside the use case already demonstrated. Overall, we kicked and screamed but eventually gave up. 🙁
Zero Shot Learning (ZSL)
Meanwhile, Joe Davison wrote about the Zero Shot way of doing NLP tasks using pretrained models, primarily inference based models https://joeddav.github.io/blog/2020/05/29/ZSL.html His approach is based on Yin, et al https://arxiv.org/pdf/1909.00161.pdf. This is certainly a major breakthrough in thinking about classification (look ma, no training!), and indeed, huggingface now has off-the-shelf pipelines to do zero shot classification – https://discuss.huggingface.co/t/new-pipeline-for-zero-shot-text-classification/681r for details. Overall, this sets a high bar for any model to beat. We have definitely found this to be useful and continue to use it. In addition to the MNLI trained model mentioned in the blog and the paper, also take a look at some Squad (Question Answering) trained models and see if those perform better. Depending on the task at hand, NLI, Squad or an ensemble should get you quite far.
And yet, in real life this cannot be fully relied upon for a few reasons
- Hard to improve, train, etc
While this zero shot method is a fantastic start, given that we have very limited training data, how do we further improve the model? We tried further finetuning the pretrained MNLI models but they didn’t really change the output because the MNLI dataset is massive, but our samples are low in number. (Joe acknowledges this challenge in fine tuning in his blog post as well)
- Training data is generally not in the right format
The MNLI data is ternary, squad data has its own representation while the tagged data we have is not in the squad format. Of course, data wrangling at such a low scale is not a big deal, but on a continuous basis will we keep receiving similarly tagged data?
- RTE (Recognizing Textual Entailment) models, not a great outcome either
We also tried RTE models, which have a desirable property of being binary (as opposed to ternary as is the case with MNLI/SNLI). This did not give any great improvement in accuracy either, likely because the RTE dataset used is smaller compared to MNLI/SNLI, so the RTE model ‘knows’ less than SNLI+MNLI to begin with.
- Inconsistent models
However, by far the biggest issue with the MNLI trained models is how inconsistent they are. A trivial example is here –
Now, if you go by the method in Joe’s blog, ignoring the neutral logits you will find that “I like you” and “I love you” are a contradictory pair 🤦 (of course, “I like you” and “I hate you” are also contradictory pairs to a stronger degree)
While this is a trivial example, many such examples abound where a sentence pair produces the same output with the hypothesis negated and there is basically no way to fix this at scale on a MNLI trained model. (As an aside – should the model _ever_ do this?)
Once again, this problem exists because we are operating with limited tagging data, so taking just a language model and fine tuning on our downstream task does not work for us as it leads to overfitting since we have limited samples.
- Facebook ANLI
A note must be made about facebook’s ANLI (Adversarial NLI) https://arxiv.org/abs/1910.14599 here. The drawbacks of MNLI trained models are obvious and the ANLI authors recognized this and proposed new methods to tackle MNLI/SNLI drawbacks. (Not the first time, as just one example – McCoy et al point out the same issues and developed the HANS dataset – https://arxiv.org/abs/1902.01007) The ANLI code is here https://github.com/facebookresearch/anli and now they have also release pre-trained models on huggingface https://huggingface.co/ynie by Yixin Nie
If you are going to use zero shot methods, you might want to start with one of the ANLI trained models first as the outcome will likely be more robust.
Timo Schick‘s Pattern-Exploiting Training (PET) https://github.com/timoschick/pet didn’t work out for us in it’s earlier avatar, but I’m happy to report that this seems to be producing good results for us now. Papers https://arxiv.org/abs/2001.07676 and https://arxiv.org/abs/2009.07118 and a youtube video https://www.youtube.com/watch?v=01jRE9noSWw explaining the paper.
Results vary based on specific questions but overall seem promising (given the other options). One advantage of PET is that it can train a RTE style model so binary labeled data is handled well. Training is relatively quite fast if you turn off distillation (colab+GPU works fine). We agree with the author – engineering PVPs (pattern-verbalizer pair) remains a challenge and base model selection matters too (authors recommend the albert flavor).
PET can potentially suffer from the same issue as the MNLI example above where a pair of sentences and another pair of same sentences with the hypothesis negated can produce the same result, however the good part here is that you can (should!) actually train the model both ways. Our finding is that this “2-way” approach (as we like to call it) produces a more robust model (though more validation is needed).
However, PET doesn’t always beat ZSL, and this is still a far cry from the ‘done’ models we see in the sentiment land. Long ways to go, indeed!
- Can GPT-3 solve this? Likely it can do better, however I am still in the waiting list for access, so if someone can try this out, please let us know! (And if you can whitelist me for access to GPT-3, even better! 😉 )
- Data Augmentation methods certainly help and we use the following in all our approaches
- “Model ensembles are a pretty much guaranteed way to gain 2% of accuracy on anything” – Andrej Karpathy. ‘nuff said. 🙂
- We are always looking at a number of other approaches, and
Unsupervised Commonsense Question Answering with Self-Talkhttps://arxiv.org/pdf/2004.05483.pdf seems promising, though we are yet to evaluate it.
It is also heartening to see that this area of NLP is getting more and more focus. See for example “Limited Labeled Data” https://lld-workshop.github.io/ track in ICLR 2019 (has links to the papers as well).
We have indeed come a long way from sentiment classification. NLP methods to find out deep meaning in data have seen a tremendous interest since BERT and it is exciting to see so much progress in so little time. This is a fast evolving field and Bewgle is committed to being at the forefront in its implementation and real life use cases. We would love to talk to you! Do share your thoughts, point out things we missed, if you’d like to join us or just say hi – firstname.lastname@example.org