Semi-Supervised Insight Generation from Petabyte Scale Text Data

Tech Triveni speaker Samiran Roy

Samiran Roy

Senior Lead Data Scientist

Envestnet Yodlee

About Samiran Roy

Samiran Roy(Masters in Computer Science, IIT Bombay) is currently working as a Senior Lead Data Scientist at Envestnet Yodlee. I work in deploying Deep Learning, Reinforcement Learning, and Semi-Supervised learning-based products.

During my Masters, the majority of my time was spent in developing deep learning, reinforcement learning, and computer vision software. I worked on a lot of cool AI - agents for self-driving cars, robot soccer, Atari games, and Carrom. I also interned in Scilab, where I reverse engineered functions from the Matlab Image Processing Toolbox.

As a part of my MTech thesis, I developed a Multi-Armed-Bandit framework for autonomous agents to supervise their own training.

I have spoken on various topics around Data Science in ODSC 2018/19. Here is a link to a recent video: https://www.youtube.com/watch?v=WqCfAgJ3oRw&t=1248s


Session

Existing state-of-the-art supervised methods in Machine Learning require large amounts of annotated data to achieve good performance and generalization. However, manually constructing such a training data set with sentiment labels is a labor-intensive and time-consuming task. With the proliferation of data acquisition in domains such as images, text and video, the rate at which we acquire data is greater than the rate at which we can label them. Techniques that reduce the amount of labeled data needed to achieve competitive accuracies are of paramount importance for deploying scalable, data-driven, real-world solutions.

At Envestnet | Yodlee, we have deployed several advanced state-of-the-art Machine Learning solutions that process millions of data points on a daily basis with very stringent service level commitments. A key aspect of our Natural Language Processing solutions is Semi-supervised learning (SSL): A family of methods that also make use of unlabelled data for training – typically a small amount of labeled data with a large amount of unlabelled data. Pure supervised solutions fail to exploit the rich syntactic structure of the unlabelled data to improve decision boundaries. There is an abundance of published work in the field - but few papers have succeeded in showing significantly better results than state-of-the-art supervised learning. Often, methods have simplifying assumptions that fail to transfer to real-world scenarios. There is a lack of practical guidelines for deploying effective SSL solutions. We attempt to bridge that gap by sharing our learning from successful SSL models deployed in production

Share the talk