Ethan Kunin
5 min readMay 6, 2021

--

How to Differentiate Between Categorical and Numeric Variables

https://junilearning.com/blog/math-practice/intro-to-variable-expressions/

When presented a dataset, a typical first step is to explore the variables you will be working with. In a data frame, these are the columns. Sometimes called features, predictor values, or explanatory variables, in their most raw form, these are simply independent or dependent variables. In other words, the characteristics that determine our outcome.

Unfortunately, interpreting these characteristics may not be easy. If the source who generated the data is kind, they have left a description of the variables. Otherwise, that will be up to us to decipher based on domain knowledge and EDA.

The first step to interpreting our variables is determining if they are categorical or numeric. Sometimes these will be referred to as qualitative vs quantitative variables. Think of categoric variable as types or labels. For example, hair color or college major. On the other hand is numeric variables. These take the form of quantities and describe an amount or how much of something.

Numeric variables are either discrete or continuous. Continuous means there are an unlimited amount of values. For example, a person’s height could be any range of positive values because there is an infinite level of precision. Between 5’11’’ — 6’0’’ there is an infinite amount of space. Discrete variables take on integer values. When buying apples from a grocery store, there is no choice to purchase 4.5 apples, only 4 or 5.

Categorical variables, as their name suggests, represent categories or types. They answer questions like country of origin, type of tennis racket, or favorite restaurant. Don’t be fooled by assuming they can’t be represented by numbers because sometimes data sets will encode categories as numbers (common for binary representations). There are two types of categorical variables, ordinal and nominal. Nominal variables do not have discernible order. For example, country of origin. We do not have a hierarchal system to organize country of origin in a meaningful fashion. Ordinal variables have an intrinsic way of being ranked. For a tennis player, there are leagues which grow in level of competition from USTA to ITF to Futures to ATP. They are categorical but do have order and could be represented by numbers. Think of ordinal variables like discrete numeric variables but describing a type rather than an amount

Furthermore, data can be described by its cardinality. Cardinality refers to how many unique values a variable can assume. For country of origin, cardinality would be the total number of countries in the world. Cardinality can be useful for determining if a variable is discrete or continuous.

Now, let’s talk about how we can differentiate variable types using code. First, we can test this out graphically using the seaborn. Using the Socrata API, I have imported data on Micromobility Vehicle Trips for the city of Austin. Think of electric scooters and bicycles.

I created a column called revenue_per_ride which describes the total cost per rider for each scooter trip. Let’s ask some questions to determine what type of variable this is. First, is it categorical or numeric? Since it is describing an amount, not a label, it is numeric. Second, is it continuous or discrete? Since it could hypothetically take on any value, it is continuous.

fig, ax = plt.subplots(figsize=(12,6))
sns.histplot(data=dfc, x=’rev_per_ride’, kde=True)
ax.set_xlabel(‘Revenue per ride’);

As we can see from the histogram, revenue per ride is a continuous variable because it can take on any value. The shape of the KDE line is smooth, which also suggests the variable is continuous.

print(f"Total Values: {len(dfc['rev_per_ride'])}")
print(f"Unique Values: {dfc['rev_per_ride'].nunique()}")
print(f"% Unique Values: {dfc['rev_per_ride'].nunique()/len(dfc['rev_per_ride'])*100:.4}%")

The number of unique values further confirms that it is a continuous variable. If it were discrete, the percentage of unique values would be much lower.

Next, we look at day of the week. The values are encoded 0–6, each number representing a different day of the week. Even though they are represented by numbers, we know the values are labels suggesting it is a categorical variable. The day describes a type. Since there is no natural hierarchy for days of the week, we can consider it ordinal.

fig, axes = plt.subplots(figsize=(15,6), ncols=2)
sns.histplot(data=dfc, x='day_of_week', ax=axes[0])
sns.regplot(data=dfc, x='day_of_week', y='rev_per_ride', ax=axes[1])
axes[0].set_xlabel('Day of the week')axes[1].set_ylabel('Revenue per ride')
axes[1].set_xlabel('Day of the week');

The histogram on the left is spaced between values which suggests it is not continuous. Common sense also says you can’t have a day of the week represented by a fraction, they are not divisible. The plot on the right is a scatterplot with a line of best fit through the values. Since it is flat, it does not look like there is a linear relationship with Revenue per ride. This further supports that it is a categorical nominal variable. If it was ordinal, there would most likely be a linear relationship between day and revenue but here there is none.

print(f"Total Values: {len(dfc['rev_per_ride'])}")
print(f"Unique Values: {dfc['day_of_week'].nunique()}")
print(f"% Unique Values: {dfc['day_of_week'].nunique()/len(dfc['rev_per_ride'])*100:.4}%")

Additionally, there are very few unique values which support that it is a discrete categorical variable.

In conclusion, it can be challenging to differentiate variables. First, ask yourself if the variable is measuring an amount or a type. Next, if it is an amount, ask yourself if it is divisible or can only take on whole numbers. If the variable measures a type you must ask if the types can be ranked or if they are random in order. This simple thought process can help us greatly in determining variable types.

Next steps may be to OneHot Encode certain categorical variables and possibly normalize continuous variables. These methods may be mandatory for certain machine learning models.

Thanks for reading and drop a cheers if you enjoyed the content!

--

--