Rajdeep Pal

Supervised Learning Algorithms are still mostly used in the industry. – Rajdeep Pal

Interview with Rajdeep Pal

Rajdeep Pal

Data Science Engineer
Bengaluru, Karnataka, India

Q1. Please share your educational and professional journey with us?

I did my B.E in Information Technology. from PES Institute of Technology, Bangalore. Primarily self-taught in the fields of Data Science and Machine Learning, I had pursued various internships and projects in those fields throughout my education. During Placements, I had started looking outside due to a lack of data-science related jobs from companies.

I started my career in an AI-Based fashion-retail Startup in Bangalore, called Stylumia. Stylumia helped hone my skills as a data scientist as well as an Engineer. At Stylumia we solved various problems faced by the Fashion Retail industry from Smart Supply chain to optimising the catalogues of Leading Fashion Brands. Working in Startup has helped me sharpen my skills in various verticals such as, – machine learning, DevOps, to client interactions.  

Later I joined Rakuten as a Data Scientist and Developer in their flagship CustomerDNA team which helps analysis and profile customers in the Rakuten Ecosystem. With Rakuten, I had an opportunity to work with a tremendous amount of data across its various services.

Q2. What attracted you to the Data Science Domain?

I think the idea of computer understanding a problem and being able to solve it was extremely exciting to me. Obviously, it’s not as simple as that; however, I still believe Data Science and Machine learning has the ability and the potential to solve complex problems across the world, problems that may have been unfathomable to solve previously. For me personally, I like how each problem is new and something that had worked for a previous problem might not work for this, I feel this keeps things interesting and pushes me to try newer techniques and methods.

Q3. What is the importance of Data in any Data Science Project? Does clean data help to solve the problem fast and save time?

Of course, data for a data science project is extremely important and clean data is always better. However, rarely is that the case in real-world applications; most real-world applications need us to analyze the data and clean it ourselves. I think a data scientist should never try to skip or offshore the data cleaning part of the pipeline as it’s fundamental to understanding the nuances of the dataset. In fact, according to me getting our hands dirty and able to understand data is a vital skillset compared to knowing all the latest machine learning algorithms.

Q4. Which one thing do you want to change in Data Science Domain and why?

I conduct a lot of interviews for positions of Data Scientist and Engineers, during these part experience I have seen that most people only concentrate on the extreme theoretical aspect of Data Science. However, even today at least in India such highly concentrated research roles are very few and most require experience or a PhD in the field.

Roles for an applied data scientist who spend close to 50% of their time on research and the rest on development are much more readily available, hence learning to code is also an extremely important skill set to have. Know your algorithms and data structures. Be well versed in API development and know-how to handle databases.

Q5. Which type of Machine Learning is highly used to solve the real-world problem between Supervised and Unsupervised?

Unsupervised learning is still something that’s not very deterministic. Obviously, we use clustering and other unsupervised algorithms but supervised learning algorithms are still mostly used in the industry primarily due to the vast amount of data available, and also due to their deterministic nature.

Q6. Which are the best Online Courses for Data Scientist?

I feel the series by Andrew NG is still the gospel for anyone looking to learn about machine learning. Furthermore, once you are well versed with the basic material by Abhishek Thakur and others are good for deeper understanding. I personally feel these PG degrees offered by various institutes are not useful; instead, learn by solving various real-life problems is more important. Start by solving Kaggle competitions. Then venture out to learn collecting data for unique problems.

Learn about web-crawling and collect data. Apply what you have learned to new problems and see if it works. This will help one build the skillset to work with raw data and take you through the steps of cleaning data, performing EDA, feature engineering etc. This will also help build the intuition on what kind of model might work on what kind of data.

Q7. What skills can set your profile apart in the industry?

As mentioned before knowing how to code is a critical skills-set to have. Experience in API development and  SQL is sort after. Furthermore, experience with Hadoop and Spark is very valuable. Some roles also prefer candidates to have DevOps experience. A strong foundation in computer science along with knowledge and experience in Data Science is a combination to look out for.

Q8. What is the hardest part of any Data Science Project?

There was this graph I had seen on the internet:

Photo Credit: Rajdeep Pal

I couldn’t agree more with this. More often than not most projects don’t require us to spend too much time on the actual ML algorithm. Due to the available libraries, it’s extremely simple to build the model. Majority of the time goes in building the pipeline in deploying this model in production. How will the data flow to and from the model. How to scale it so it can handle the data load of my system. Does the model return the result in the format that I can consume? These questions take up much more time compared to the actual model building.

Q9. Which Machine Learning Model have you used maximum in your career like SVM, KNN etc?

Well, that depends on the nature of the problem you are solving. When I was working primarily on computer vision problems it was mostly CNNs and some clustering algorithms. Also, traditional Image processing algorithms were very widely used. When dealing with tabular data any kind of tree-based ensemble learning methods is my go-to algorithms for eg: Xg-boost, cat-boost etc.

Read more about Rajdeep Pal @:

Linkedin: https://www.linkedin.com/in/rajdeeppal/