Too many categories: how to deal with categorical features of high cardinality

Anna Pershukova
7 min readNov 10, 2020

--

Photo by Paul Hanaoka on Unsplash

One of the most common ways to prepare categorical variables for use in machine learning models is one-hot encoding. But as with every method it has its limitations. Have you ever tried to one-hot encode categorical variable with lots of different values when your dataset is of humble size? Whether it’s data about previous employee’s workplace, patient’s medical diagnosis, or customer’s zip code? Then you know what I’m talking about. One hot encoding, in this case, creates a large number of sparse features. It increases your chances to overfit a model and get modest results. Luckily there are more tools in the data science toolbox that can be used to improve this transformation and get much better outcomes.
Let’s say you’re trying to build a predictive model based on a patient’s medical diagnosis. The dataset you’re working with contains 500+ different conditions. What can you do to use the signals hidden in this data? In this post, we’ll explore 2 general approaches and 4 specific solutions.

Approach 1: Use domain knowledge

Domain knowledge means the understanding of the real-life processes that generate and influence your data. It helps to focus the effort in the directions that have the most potential to be informative and help you to build a successful model. You can acquire this understanding by getting familiar with research in your domain, talking to the experts, and engineers involved in data collection.

Keeping in mind what you’re trying to predict you want to extract from data those aspects that you assume to contain valuable signals. Usually, these types of solutions will involve mapping your data to heuristics or some external data sources.

Domain knowledge allows you to focus on the most promising directions. Photo by Panagiotis Nikoloutsopoulos on Unsplash

Solution 1: Decrease the number of groups and apply one-hot encoding

You don’t need to reject one-hot encoding entirely. Think of the meaningful ways to map your data to several larger categories. Then one-hot encode your newly mapped categories. It will help you to translate your input into a lower-dimensional space. We want new categories to:

  1. have a reasonable number of classes compared to your data size, and
  2. represent aspects that are likely to influence your target.

In our example, we can map each medical diagnosis to whether or not patients usually experience severe symptoms:

When will these features be useful? For example, you’re trying to predict whether a patient will show up to the next health checkup on time. The fact that the patient is likely to experience symptoms that impact their everyday life can influence the understanding of checkup importance and can be extracted from a medical diagnosis.

Solution 2: Enrich data with numerical values

Another option of using domain knowledge is quantitative mapping. You can designate each category in your data with a relevant score or statistic.

For example, the mean Google trends score of a disease over the last 5 years can represent its level of public awareness. The higher public awareness can influence a patient‘s urge to go for a check-up.

Comparison of Google trends search terms for Asthma and Type 2 diabetes.

What to keep in mind when using domain knowledge?

  • Your model is as good as your domain expertise. There are endless options to transform your data and you need to decide what’s relevant for your model. If you are new to the domain — try to get access to experts. They can become your cheat codes to get most of your data.
  • Data quality is essential. To map your data to new values you might use 3rd party or open data sources or manual labeling. You’ll need to build a process to match new values to your data. So make sure that the result you’re getting makes sense. It’s really disappointing to get garbage out of a promising variable just because of the bad mapping.
  • Inevitably you will lose some information by grouping your categories together so make sure you choose how to group wisely.

Approach 2: Learn from the output variable

There is another strategy that relies less on people’s opinions and more on patterns hidden in data itself. Although, of course, to use it you’ll also need to make subjective decisions. Also, you’ll need to have output labels, thus it’s problematic to use for completely unsupervised tasks.

Learning from the target variable allows to rely more on patterns you already have in your data and decrease the level of subjectivity. Photo by John Schnobrich on Unsplash

Solution 3: Calculate simple aggregated value per group

Do you think that your categorical variable contains meaningful information to predict the target variable? Try to look at it from a different angle — probably there will be similar target values per category. For every category, you can compute the aggregated target value.

  • Building a classifier? Calculate the ratio of positive labels in a group.
  • Building a regression model? Calculate median target value per group (or any other statistic of your choice — like, mode, or percentile).

In our example, we can calculate the ratio of patients who in the past showed up for a checkup within a period recommended by their doctor per disease category.

Solution 4: Calculate normalized aggregated value per group

In some cases, you might benefit from comparing aggregated statistics in your group to overall values in the dataset. Common examples of this approach are the weight of evidence score or Perlich aggregations.

Weight of evidence allows representing categories having a similar proportion of positives with similar values. In addition, categories with scores close to zero are more usual to your data than those with scores further away from zero (can be useful for fraud detection). Positive scores highlight categories where chances to get a positive target are higher than in the overall dataset, negative scores —categories with lower chances.

Weight of evidence calculation
Calculation example for asthma category

Perlich distribution-based aggregations in addition to simple cases allow transforming data with multiple categories per entity. It is a generic methodology that supports a set of possible aggregations, emphasizing different aspects of data. The idea is to compare the distance between the distribution of categories within an entity (patient, for example) to some reference distribution.

There are 3 steps in the methodology:

  1. Calculate case vectors for every entity(ex., distribution of diseases for a specific patient),
  2. Calculate a distribution or reference vector,
  3. Calculate the distance between the above vectors (can be any distance measure, appropriate for your problem — ex. cosine similarity or euclidian distance).

For example, we have a dataset that allows multiple diseases per patient.

At the preprocessing step we’ll one-hot encode values.

Case vector for the first patient:

Then calculate the categories distribution vector for a positive outcome:

Distribution vectors can be calculated for every target category, or for the overall dataset.

And, finally, the distance between the above two vectors:

Euclidian distance between the distribution of diseases for patient Anna and distribution vector for positive target

For a more detailed explanation, you can check the original paper.

What to keep in mind when learning on the output variable?

  • Beware of data leakage! Make sure you are computing new parameters only on train set data. Why? The test set is aimed to simulate and evaluate how your model will perform in the real world. In the real world, you don’t have your target variables — you’re trying to predict them. So by using your test data for this transformation you’re risking to improve test results artificially — and this can lead to very disappointing results once your model touches the ground. Act as if you have no knowledge of the outcome in the test set.
  • Take care of uncommon values. For rare categories, you might not have any or enough examples in the training data. At the moment when your model encounters the first patient with a rare disease — it wouldn’t know what to do. So be cautious and think of it in advance. For example, you can map every value that appears in the train set less than 1% into the “other” category and calculate transformation for these values together. Any new value you might encounter in the future should also be mapped to “other” and treated accordingly.

I hope that this post inspires you to handle your data with too many categories creatively and improve your model’s predictive capabilities. Or maybe you used other methods to deal with high-cardinality features? I’ll be happy to learn about your experience in the comments!

--

--

Anna Pershukova

Data professional building intelligent data products. Data scientist @Medisafe.