Deepak John Reji

Natural Language Processing (NLP) has gained a lot of popularity in modern Data Science. – Deepak John Reji

Interview with Deepak John Reji

Deepak John Reji

NLP Engineer, Researcher, YouTuber
ERM : Environmental Resources Management
Kerala, India

Q1. Please share your educational and professional journey with us?

I graduated from Trivandrum with a bachelor’s degree in electronics and communications engineering. Initially, I had the opportunity to work with many startups and was exposed to R programming for analyzing user’s workflow in web applications, and that’s how I began my Data Science Career. Dr Brijesh Madhavan, my mentor, helped me in navigating my career path towards Machine Learning and AI. I collaborated on a Startup Idea on Education Analytics called “Stats Envision,” which aimed to improve student success by assessing different indicators of student growth using the power of inferential statistics and predictive analytics services. After that, I joined Arcadis Consulting, where I focused my attention on core consulting in Data Science solutions for Natural Assets and Environment. My interest in the Environment Science and ESG led to my current position in ERM.

Q2. What is the role of NLP in any project and where it is used? Is NLP an extension of Machine Learning and Deep Learning?

Natural Language Processing (NLP) has gained a lot of popularity in modern Data Science because businesses started understanding and leveraging the power of AI for data mining. The contextual understanding capability of NLP models, as well as their versatility to be used for any domain-specific data, making them useful in a range of complex problems. With the advent of several transformer models such as Bert, GPT-3, and other variants, businesses and developers gained confidence in their ability to replicate what a human accomplishes in terms of various language tasks. In several of the benchmarks, it even outperformed humans.

The evolution of Natural Language Processing (NLP) from traditional rule-based linguistic analysis to modern contextual word vectors touched upon all facets of Machine Learning and Deep Learning approaches. When word2vec was released, it opened up a new way of thinking about textual challenges. It’s a two-layer neural network that’s been taught to recreate linguistic contexts of words. From there, a lot of work went into fine-tuning that method; later word embeddings like glove, fasttext, Infersent, Elmo, and others significantly improved the performance, and when Google released Bert in late 2018, it marked a defining moment in terms of how we tackle NLP problems.

Because every organisation deals with a large amount of unstructured textual data, there is a lot of room for different approaches. In certain cases, even simple pattern matching logic or frequency-based embeddings like TF-IDF/CountVectorizer can aid. In many textual use cases, machine learning algorithms such as Naive Bayes, Logistic regression, and others are still used. Problems involving human-level understanding (usually domain-related) are considered complex NLP problems, and they are solved using advanced neural network models over these large embedding models.

Q3. What did attract you towards Environmental Science so you are pursuing PGD Environmental & Sustainable Development?

When I first started working on NLP challenges for environmental remediation/due diligence, I was fascinated by the data I was working with. I had the opportunity to team up with SME’s who are Environmental Scientists. The model that had to be trained was tricky for me because it covered a variety of environmental remediation subjects, and it was even more confusing to validate the results. It was hard to understand why the embedding model weighted one sentence over another for a given topic. It was also difficult to visualize the bias in these sentences. The project was a success, with human effort reduced from 20 hours to 1-2 hours.

I realized the importance of domain knowledge in model development. This prompted me to conduct extensive research and study on the environment, particularly on environmental remediation, sustainability, and other related topics.

Q4. You have great experience in Data Science consultancy. What kind of hurdles did you face as a Data Science consultant?

The challenging part, which I faced is domain understanding and model interpretability. For example, when I worked on a project called Predicting Bridge Fatigue. I created an ML model that considers a variety of parameters that I found in conjunction with Fatigue in various literature. But then I had to compare this model to certain existing models developed by structural engineers who used various simulation techniques and domain knowledge to predict fatigue. Then, when it comes to explainability, I didn’t have much to state other than that I can predict Fatigue given this A, B, C… data. But that’s not a convincing explanation when it comes to a major problem like predicting Fatigue and estimating maintenance costs. When it comes to solving real-world problems, we must often take a different approach to find solutions. It also necessitates a great deal of domain knowledge aligning to the business requirement.

Q5. Which one is the hardest working part of any NLP Data Science Project?

The hardest part, in my opinion, is providing the right data to the model and fine-tuning it. Everything falls in place if the data is correct. Once that hurdle is overcome, the next step is to run the model and receive feedback on the model output from SME’s/testing teams. The amount of fine tweaking required is always difficult to determine. It’s always a trade-off between how much we can fine-tune vs. how much the business needs, taking into account their timeframe and budget.

Q6. Which type of Machine Learning is highly used to solve real-world problems?

Supervised learning approaches, with a little support from unsupervised learning, are generally applied in business, in my opinion. The relevance of model supervision and interpretability is always emphasized.

Q7. Which are the best Online Courses for NLP and Data Science? You are running a YouTube channel too; please share a brief about your channel?

Coursera, Udacity, and other online learning platforms offer excellent courses with a proper learning curriculum.  However, there are various online courses available on YouTube for free.  Nowadays, all tech stacks, such as Keras, Spacy, huggingface, and others, provide their tutorials on their official pages or YouTube.

I have a YouTube channel called “D4 Data,” which I launched last year and have been publishing content about natural language processing tools, prototypes, and tutorials. I recently began a podcast series about AI and sustainability, and it has been well received.


Deepak YouTube Channel

Q8. Everyone is crazy about Data Science right now. Is the Data Scientist job hype or really a demand in the market?

I believe that the importance of Data Science has been extensively acknowledged in recent years. However, Data Science is merely one component of a larger product solution. Other dimensions to consider and recognized include Business Analysis, Data Engineering, Software Development, Testing, Cloud Engineering, and so on. I also believe that the term “Data Scientist” is interpreted differently across the industry.

Read more about @ Deepak John Reji