I wrote an article a while ago about econometrics (Econometrics 101 for Data Scientists). The article resonated well with readers, but that was a kind of introductory article for data science people who might not be otherwise familiar with the domain.
Inspired by the response to that article, today I’m attempting to take it to the next level by making it a bit comprehensive. I’ll mostly focus on the methods, tools, and techniques used in econometrics that data scientists will benefit from.
Econometrics is a sub-domain of economics that applies mathematical and statistical models with economic theories to understand, explain…
You are not alone if you had a hard time understanding what exactly Regularization is and how it works. Regularization can be a very confusing term and I’m attempting to clear up some of that in this article.
In this article I’ll do three things: (a) define the problem that we want to tackle with regularization; then (b) examine how exactly regularization helps; and finally (c) explain how regularization works in action.
Data scientists take great care during the modeling process to make sure their models work well and they are neither under- nor overfit.
Let’s say you want to…
As a data scientist if you are asked to find the average income of customers, how’d you do that? Having ALL customer data is of course “good to have”, but in reality, it never exists nor feasible to collect.
Instead, you get a small sample, take measurements on it and make predictions about the whole population. But how confident are you that your sample statistics represent the population?
Statistical distribution plays an important role in measuring such uncertainties and giving you that confidence. Simply speaking, probability distribution is a function that describes the likelihood of a specific outcome (value) of…
Logistic regression is amongst the most popular algorithms used to solve classification problems in machine learning. It uses a logistic function to model how different input variables impact the probability of binary outcomes. The technique is quite convoluted as described in the available literature. The purpose of writing this article is to describe the model in simple terms, primarily focusing on building an intuition by avoiding complex mathematical formulation as much as possible.
I will start with a linear regression problem — which is relatively easy to understand — and build on that to get to logistic regression. Towards the…
Many sophisticated data science algorithms are built with simple building blocks. How quickly you will level up your skills largely depends on how strong is your foundation. In the next few articles, I’ll touch upon a few such foundational topics. Hopefully, learning those topics will make your journey a pleasant and fun experience.
Today’s topic is Python lists.
Most people would learn tools first and then practice them with a few examples. I take the opposite route — focus on problems…
I had this thought a while back — is learning Matplotlib essential for beginners? Or can they get away with Seaborn? That thought came back recently while mentoring a cohort of data science students.
I know Matplotlib is a great library, it gives a lot of flexibility for customized data visualization.
But I also know it’s a complex library, especially for people who are new to data science and data visualization. It can be intimidating to some too!
In most data science curriculums Matplotlib and Seaborn are taught simultaneously, which, in my observation, creates quite a bit of confusion among…
If you want to get into the field of data science and machine learning here’s your pathway:
Learn Data Science 101. Focus on what business problems they solve, what are different sub-disciplines etc. Follow people in data science and see what they say. Maybe listen to some data science podcasts.
Pick a language you are comfortable with. I recommend Python, for various reasons I won’t go into today.
Get a superficial understanding of common data science and machine learning algorithms and how they work, what business problems they solve:
If we had more money in our pockets, we tend to spend more — that’s almost a fact that everyone knows. But what’s often not known is the exact relationship between income and expenditure, i.e. how much people would spend on a known income.
An approximate solution is to build a statistical model by observing people’s income and expenditure. The more data there is, the better the model. We can then take this model and apply it to an unknown place or population with reasonable confidence.
But the model wouldn’t be able to make a 100% accurate prediction, because people’s…
In programming, loop is a logical structure that repeats a sequence of instructions until certain conditions are met. Looping allows for repeating the same set of tasks on every item in an iterable object, until all items are exhausted or a looping condition is reached.
Looping is applied to iterables —objects that store a sequence of values in specific data formats such as dictionaries. The beauty of loops is that you write the program once and use it on as many elements as needed.
The purpose of this article is to implement some intermediate looping challenges applied to four Python…