I wrote an article a while ago about econometrics (Econometrics 101 for Data Scientists). The article resonated well with readers, but that was a kind of introductory article for data science people who might not be otherwise familiar with the domain.

Inspired by the response to that article, today I’m attempting to take it to the next level by making it a bit comprehensive. I’ll mostly focus on the methods, tools, and techniques used in econometrics that data scientists will benefit from.

Econometrics is a sub-domain of economics that applies mathematical and statistical models with economic theories to understand, explain and measure causality in economic systems. …

Well, it’s time for another installment of time series analysis. This time I’m focusing on two things: a) converting a normal dataframe into the right format for analysis; and b) making sense of that data through visualization.

The first objective is quite essential. Wrangling and cleaning up data is a big thing in data science, and it’s more so in time series analysis. Even for basic analysis, it is easier to work with data that is in a good shape. …

I find that the key strength of Medium as a writing platform is its simplicity. Medium simplifies writing to such a point that it forces writers to focus on the content rather than font size and style.

I was going to write about data visualization but wanted to make the point that there is a value in simplicity, it reduces barriers to entry and improves quality of contents.

A typical data visualization represents three things:

**information**— what the figure tells us**aesthetics**— how beautiful it looks**style**— how is it structured

Among all these, the most value from visualization actually comes from information — e.g. what the data tell us about the relationship between two or more variables — whereas, aesthetics and style add relatively little. If we could quantify, a hypothetical equation might look something like…

It was before the Stack Overflow era, so not much help was available online. Some people would print out cheatsheets of different kinds and hang on the walls around their workstations. Having a couple of pages of frequently used codes in front of the desk was an efficient way of correcting syntax errors.

Help is now at the fingertips only few clicks away. But an old-fashioned cheatsheet is still a valuable time-saving tool. It’s even more the case if you have to juggle between multiple programming languages.

Data scientists spend most of their time on data wrangling, so being efficient is a valuable skill to have. So the purpose of this article is to show how to build a “cheatsheet” for data wrangling following a typical analytics workflow. I am not going to write down all the codes needed every step of the way, rather I’ll focus on how to compile a cheatsheet that serves your purpose, so you can spend more time coding, less time searching for the right syntax. …

Previously I wrote a couple of pieces on multivariate modeling but they both focused on time series forecasting. If curious, go ahead and check out the posts on vector autoregression and panel data modeling. Writing on multivariate regression (i.e. multiple linear regression) was always on my list but something else was on its way — I started a series on anomaly detection techniques! Once started that series, I could not stop until I wrote 11 consecutive posts.

Today I’m back with multiple regression and here’s the plan — first I’ll define in broad terms what a multiple regression is, and then list some real-world use cases as examples. The second part is going to be a rapid implementation of multiple regression in Python, just to give a broad intuition. In the third part, I’ll dive a bit deeper following the typical machine learning workflow. …

Logistic regression is amongst the most popular algorithms used to solve classification problems in machine learning. It uses a logistic function to model how different input variables impact the probability of binary outcomes. The technique is quite convoluted as described in the available literature. The purpose of writing this article is to describe the model in simple terms, primarily focusing on building an intuition by avoiding complex mathematical formulation as much as possible.

I will start with a linear regression problem — which is relatively easy to understand — and build on that to get to logistic regression. Towards the end, I will cover additional topics such as cost function and maximum likelihood estimation. …

As a data scientist if you are asked to find the average income of customers, how’d you do that? Having ALL customer data is of course “good to have”, but in reality, it never exists nor feasible to collect.

Instead, you get a small sample, take measurements on it and make predictions about the whole population. But how confident are you that your sample statistics represent the population?

Statistical distribution plays an important role in measuring such uncertainties and giving you that confidence. Simply speaking, probability distribution is a function that describes the likelihood of a specific outcome (value) of a variable. Average income is a continuous variable, meaning it can take any value — $20k/yr or $80.9k/yr or anything in between and beyond. …

You are not alone if you had a hard time understanding what exactly Regularization is and how it works. Regularization can be a very confusing term and I’m attempting to clear up some of that in this article.

In this article I’ll do three things: (a) define the problem that we want to tackle with regularization; then (b) examine how exactly regularization helps; and finally (c) explain how regularization works in action.

Data scientists take great care during the modeling process to make sure their models work well and they are neither under- nor overfit.

Let’s say you want to predict house prices based on some features. You start with one feature, floor area, and you build your first regression model. …

I wrote about cluster analysis in the previous article (Clustering: concepts, tools and algorithms), where I had a short discussion on data normalization. I touched upon how data normalization impacts clustering, and unsupervised algorithms generally.

But I felt that I missed the opportunity to go into more details. But of course, that wasn’t the focus of that article, so today I want to pick up on that.

Let’s first define what exactly is normalization.

Let’s say we have a dataset containing two variables: time traveled and distance covered. Time is measured in hours (e.g. 5, 10, 25 hours ) and distance in miles (e.g. 500, 800, 1200 miles). …

Let’s do this experiment— take a look at the following figures and see if you can identify which figure has clustered data, *A* or *B*?