How to create a heatmap (Updated!)

 

A heatmap is basically a table that has colors in place of numbers. Colors correspond to the level of the measurement. Each column can be a different metric like above. It’s useful for finding highs and lows and sometimes, patterns.

From Nathan Yau | Visualize This

One of the problems when we have a big quantity of data is the correct way to visualize and offer to the reader a simple but general vision about all the information.

In order to visualize trends within large sets of data, it is useful consider to create a data heat map with color instead of a table with numbers.

And as everything in life, there ain’t no such thing as a free lunch, and is completely valid in this case: the accuracy is lost because we are replacing numbers for a range of colors, but in exchange we are obtaining a wide vision about trends.

The colors used within the heat map, belong a spectrum of colors based on its distance from the statistical mean, so, in that way, intuitively darker colors means one thing and lighter colors means another thing facilitating a quick evaluation about patterns, maximum and minimum values.

Updates (Sat 11/24/2018)

After some comments made by u/ELKronos and u/prv about how to improve this example, I added how the data looks like before and after tranformations.

Idea

Let’s use a heatmap in order to visualize the stats for America Soccer Cup since the beginning of the times (well, actually since 1916).

Data

In order to see what we are obtaining in exchange, let’s take a look to the table with the stats for America Soccer Cup

TeamTitlesMatchPointsMatches PlayedWinsDrawnLossesGoals scoredGoals againstDifference of GoalsPointsPerformance
Argentina14413981891203831455173+2822,1170,19%
Uruguay15433581971083455399218+1811,8260,58%
Brasil835332178993544405200+2051,8762,17%
Paraguay236225168623967253293-401,3444,64%
Chile238222177643083281304-231,2541,81%
Perú231197148543559213232-191,3344,37%
Colombia121150113422447131184-531,3344,25%
Bolivia12686112202666104279-1750,7725,60%
México01070481913166662+41,4648,61%
Ecuador02770118162280127311-1840,5919,77%
Venezuela01734627134247171-1240,5518,28%
Costa Rica0518175391731-141,0635,29%
Estados Unidos04171852111829-110,9431,48%
Honduras0110631275+21,6755,55%
Panamá0133102410-61,0033,33%
Japón011301238-50,3311,11%
Jamaica020600609-90,000,00%
Haití0103003112-110,000,00%

As you can see, it is extremely complicated achieves any conclusion easily.

Visualization

This is the visualization for the data about America Soccer Cup, and it is very simple to determinate which are the best team along the different tournaments, even when we lost accuracy for the lacks of numbers for each event.

Some ideas that we can elaborate after check this visualization:

  • Argentina and Uruguay are the best team along all the tournaments.
  • Argentina is the team with more power of goals and best difference of goals.
  • Argentina, Brazil and Uruguay are the teams with best performance.
  • There are three groups of countries with similar trajectories:
    • Argentina and Uruguay
    • Brazil, Peru, Chile, Paraguay, Bolivia and Colombia
    • The rest of the teams with low performance since Bolivia to Mexico

Technical implementation

In order to facilitate the implementation for any heatmap, I am going to separate the code in different sections and elaborate an small explanation of each part, however if you want to see all the code and the dataset used in this example, check my github account.

1. Setup libraries

We will use two libraries, readr to read a csv file – the dataset – and RColorBrewer, to use the palettes of colors.

library(readr)
library(RColorBrewer)

2. Get the data

The dataset is in my Github account because I prefer that my examples work out-of-the-box (if you copy, paste and execute the example, the code should work).

A second benefit of that is no matter what happen with the original dataset used in my example, I have it in your account.

# get data 
url_soccer <- 'https://raw.githubusercontent.com/frm1789/soccer_ea/master/AmericaCupData.csv'
df_soccer <- read_csv(url(url_soccer))

3. Order by

From all the data that we have, the most relevant is the quantity of titles that a team have. All the rest (goals, power of goals, won matches…) is subordinate to that.

# Order data for titles
df_soccer <- df_soccer[order(df_soccer$Titles, decreasing = FALSE),]
df_soccer <- data.frame(df_soccer)

3. Transformations

One main point to consider, the function heatmap requieres a numerical matrix, for that reason we will work to delete the columns that we don’t need and transform the rest in numeric columns.

How the data is before transformation?

 TeamTitlesMatchPointsMatches.PlayedWinsDrawnLossesGoals.scoredGoals.againstDifference.of.GoalsPoints_1Performance
1México0107048191316666241,4648,61%
2Ecuador02770118162280127311-1840,5919,77%
3Venezuela01734627134247171-1240,5518,28%
4Costa Rica0518175391731-141,0635,29%
5Estados Unidos04171852111829-110,9431,48%
6Honduras011063127521,6755,55%
7Panamá0133102410-61,0033,33%
8Japón011301238-50,3311,11%
9Jamaica020600609-90,000,00%
10Haití0103003112-110,000,00%
11Colombia121150113422447131184-531,3344,25%
12Bolivia12686112202666104279-1750,7725,60%
13Paraguay236225168623967253293-401,3444,64%
14Chile238222177643083281304-231,2541,81%
15Perú231197148543559213232-191,3344,37%
16Brasil8353321789935444052002051,8762,17%
17Argentina144139818912038314551732822,1170,19%
18Uruguay154335819710834553992181811,8260,58%

Validations before changes

All the rest of the data into the dataset is numeric or integer exceptPoints_1 andPerformance.

sapply(df_soccer, class)
(...) 
# Points_1
# "character" 
# Performance 
# "character" 

Code for changes

# heatmap requieres a numerical matrix, for that reason we will move the names of the team as row.names 
# and after that, we will delete the column "Team"
row.names(df_soccer) <- df_soccer$Team
df_soccer <- df_soccer[,-1]

# transformation to numeric for column "Points_1"
options(digits=2)
df_soccer$Points_1 <- sub(',', '.', df_soccer$Points_1)
df_soccer$Points_1 <- as.double(df_soccer$Points_1)

# transformation to numeric for column "Performance"
df_soccer$Performance = substr(df_soccer$Performance,1,nchar(df_soccer$Performance)-1)
df_soccer$Performance <- sub(',', '.', df_soccer$Performance)
df_soccer$Performance <- as.double(df_soccer$Performance)
df_soccer$Performance <- log(df_soccer$Performance)

# Dataframe to matrix
america_matrix <- data.matrix(df_soccer)

How the data is after transformation?

 TitlesMatchPointsMatches.PlayedWinsDrawnLossesGoals.scoredGoals.againstDifference.of.GoalsPoints_1Performance
México0107048191316666241.463.88382927105736
Ecuador02770118162280127311-1840.592.98416563718253
Venezuela01734627134247171-1240.552.905807566026
Costa Rica0518175391731-141.063.56359963768718
Estados Unidos04171852111829-110.943.4493524235492
Honduras011063127521.674.01728351608564
Panamá0133102410-613.50645789231965
Japón011301238-50.332.40784560365154
Jamaica020600609-90-Inf
Haití0103003112-110-Inf
Colombia121150113422447131184-531.333.78985537145394
Bolivia12686112202666104279-1750.773.24259235148552
Paraguay236225168623967253293-401.343.79863031807306
Chile238222177643083281304-231.253.73313554536847
Perú231197148543559213232-191.333.79256356539082
Brasil8353321789935444052002051.874.12987256828125
Argentina144139818912038314551732822.114.25120585074233
Uruguay154335819710834553992181811.824.10396480559909

Validations after changes

We can see that all the variables in our dataframe now are integer and after transformations, numeric.

sapply(df_soccer, class)
(...) 
# Points_1
# "numeric" 
# Performance 
# "numeric" 

4. Creating a heatmap

We are using the function heatmap almost out of the box, except the adding of margins and colors.

# Creation of heatmap
america_heatmap <- heatmap(america_matrix, Rowv=NA, 
                           Colv=NA, col = brewer.pal(9, "Blues"), scale="column", 
                           margins=c(2,6))