How to create a heatmap (Updated!)

A heatmap is basically a table that has colors in place of numbers. Colors correspond to the level of the measurement. Each column can be a different metric like above. It’s useful for finding highs and lows and sometimes, patterns.

From Nathan Yau | Visualize This

One of the problems when we have a big quantity of data is the correct way to visualize and offer to the reader a simple but general vision about all the information.

In order to visualize trends within large sets of data, it is useful consider to create a data heat map with color instead of a table with numbers.

And as everything in life, there ain’t no such thing as a free lunch, and is completely valid in this case: the accuracy is lost because we are replacing numbers for a range of colors, but in exchange we are obtaining a wide vision about trends.

The colors used within the heat map, belong a spectrum of colors based on its distance from the statistical mean, so, in that way, intuitively darker colors means one thing and lighter colors means another thing facilitating a quick evaluation about patterns, maximum and minimum values.

Updates (Sat 11/24/2018)

After some comments made by u/ELKronos and u/prv about how to improve this example, I added how the data looks like before and after tranformations.

Idea

Let’s use a heatmap in order to visualize the stats for America Soccer Cup since the beginning of the times (well, actually since 1916).

Data

In order to see what we are obtaining in exchange, let’s take a look to the table with the stats for America Soccer Cup

Team	Titles	Match	Points	Matches Played	Wins	Drawn	Losses	Goals scored	Goals against	Difference of Goals	Points	Performance
Argentina	14	41	398	189	120	38	31	455	173	+282	2,11	70,19%
Uruguay	15	43	358	197	108	34	55	399	218	+181	1,82	60,58%
Brasil	8	35	332	178	99	35	44	405	200	+205	1,87	62,17%
Paraguay	2	36	225	168	62	39	67	253	293	-40	1,34	44,64%
Chile	2	38	222	177	64	30	83	281	304	-23	1,25	41,81%
Perú	2	31	197	148	54	35	59	213	232	-19	1,33	44,37%
Colombia	1	21	150	113	42	24	47	131	184	-53	1,33	44,25%
Bolivia	1	26	86	112	20	26	66	104	279	-175	0,77	25,60%
México	0	10	70	48	19	13	16	66	62	+4	1,46	48,61%
Ecuador	0	27	70	118	16	22	80	127	311	-184	0,59	19,77%
Venezuela	0	17	34	62	7	13	42	47	171	-124	0,55	18,28%
Costa Rica	0	5	18	17	5	3	9	17	31	-14	1,06	35,29%
Estados Unidos	0	4	17	18	5	2	11	18	29	-11	0,94	31,48%
Honduras	0	1	10	6	3	1	2	7	5	+2	1,67	55,55%
Panamá	0	1	3	3	1	0	2	4	10	-6	1,00	33,33%
Japón	0	1	1	3	0	1	2	3	8	-5	0,33	11,11%
Jamaica	0	2	0	6	0	0	6	0	9	-9	0,00	0,00%
Haití	0	1	0	3	0	0	3	1	12	-11	0,00	0,00%

As you can see, it is extremely complicated achieves any conclusion easily.

Visualization

This is the visualization for the data about America Soccer Cup, and it is very simple to determinate which are the best team along the different tournaments, even when we lost accuracy for the lacks of numbers for each event.

Some ideas that we can elaborate after check this visualization:

Argentina and Uruguay are the best team along all the tournaments.
Argentina is the team with more power of goals and best difference of goals.
Argentina, Brazil and Uruguay are the teams with best performance.
There are three groups of countries with similar trajectories:
- Argentina and Uruguay
- Brazil, Peru, Chile, Paraguay, Bolivia and Colombia
- The rest of the teams with low performance since Bolivia to Mexico

Technical implementation

In order to facilitate the implementation for any heatmap, I am going to separate the code in different sections and elaborate an small explanation of each part, however if you want to see all the code and the dataset used in this example, check my github account.

1. Setup libraries

We will use two libraries, readr to read a csv file – the dataset – and RColorBrewer, to use the palettes of colors.

library(readr)
library(RColorBrewer)

2. Get the data

The dataset is in my Github account because I prefer that my examples work out-of-the-box (if you copy, paste and execute the example, the code should work).

A second benefit of that is no matter what happen with the original dataset used in my example, I have it in your account.

# get data 
url_soccer <- 'https://raw.githubusercontent.com/frm1789/soccer_ea/master/AmericaCupData.csv'
df_soccer <- read_csv(url(url_soccer))

3. Order by

From all the data that we have, the most relevant is the quantity of titles that a team have. All the rest (goals, power of goals, won matches…) is subordinate to that.

# Order data for titles
df_soccer <- df_soccer[order(df_soccer$Titles, decreasing = FALSE),]
df_soccer <- data.frame(df_soccer)

3. Transformations

One main point to consider, the function heatmap requieres a numerical matrix, for that reason we will work to delete the columns that we don’t need and transform the rest in numeric columns.

How the data is before transformation?

	Team	Titles	Match	Points	Matches.Played	Wins	Drawn	Losses	Goals.scored	Goals.against	Difference.of.Goals	Points_1	Performance
1	México	0	10	70	48	19	13	16	66	62	4	1,46	48,61%
2	Ecuador	0	27	70	118	16	22	80	127	311	-184	0,59	19,77%
3	Venezuela	0	17	34	62	7	13	42	47	171	-124	0,55	18,28%
4	Costa Rica	0	5	18	17	5	3	9	17	31	-14	1,06	35,29%
5	Estados Unidos	0	4	17	18	5	2	11	18	29	-11	0,94	31,48%
6	Honduras	0	1	10	6	3	1	2	7	5	2	1,67	55,55%
7	Panamá	0	1	3	3	1	0	2	4	10	-6	1,00	33,33%
8	Japón	0	1	1	3	0	1	2	3	8	-5	0,33	11,11%
9	Jamaica	0	2	0	6	0	0	6	0	9	-9	0,00	0,00%
10	Haití	0	1	0	3	0	0	3	1	12	-11	0,00	0,00%
11	Colombia	1	21	150	113	42	24	47	131	184	-53	1,33	44,25%
12	Bolivia	1	26	86	112	20	26	66	104	279	-175	0,77	25,60%
13	Paraguay	2	36	225	168	62	39	67	253	293	-40	1,34	44,64%
14	Chile	2	38	222	177	64	30	83	281	304	-23	1,25	41,81%
15	Perú	2	31	197	148	54	35	59	213	232	-19	1,33	44,37%
16	Brasil	8	35	332	178	99	35	44	405	200	205	1,87	62,17%
17	Argentina	14	41	398	189	120	38	31	455	173	282	2,11	70,19%
18	Uruguay	15	43	358	197	108	34	55	399	218	181	1,82	60,58%

Validations before changes

All the rest of the data into the dataset is numeric or integer exceptPoints_1 andPerformance.

sapply(df_soccer, class)
(...) 
# Points_1
# "character" 
# Performance 
# "character"

Code for changes

# heatmap requieres a numerical matrix, for that reason we will move the names of the team as row.names 
# and after that, we will delete the column "Team"
row.names(df_soccer) <- df_soccer$Team
df_soccer <- df_soccer[,-1]

# transformation to numeric for column "Points_1"
options(digits=2)
df_soccer$Points_1 <- sub(',', '.', df_soccer$Points_1)
df_soccer$Points_1 <- as.double(df_soccer$Points_1)

# transformation to numeric for column "Performance"
df_soccer$Performance = substr(df_soccer$Performance,1,nchar(df_soccer$Performance)-1)
df_soccer$Performance <- sub(',', '.', df_soccer$Performance)
df_soccer$Performance <- as.double(df_soccer$Performance)
df_soccer$Performance <- log(df_soccer$Performance)

# Dataframe to matrix
america_matrix <- data.matrix(df_soccer)

How the data is after transformation?

	Titles	Match	Points	Matches.Played	Wins	Drawn	Losses	Goals.scored	Goals.against	Difference.of.Goals	Points_1	Performance
México	0	10	70	48	19	13	16	66	62	4	1.46	3.88382927105736
Ecuador	0	27	70	118	16	22	80	127	311	-184	0.59	2.98416563718253
Venezuela	0	17	34	62	7	13	42	47	171	-124	0.55	2.905807566026
Costa Rica	0	5	18	17	5	3	9	17	31	-14	1.06	3.56359963768718
Estados Unidos	0	4	17	18	5	2	11	18	29	-11	0.94	3.4493524235492
Honduras	0	1	10	6	3	1	2	7	5	2	1.67	4.01728351608564
Panamá	0	1	3	3	1	0	2	4	10	-6	1	3.50645789231965
Japón	0	1	1	3	0	1	2	3	8	-5	0.33	2.40784560365154
Jamaica	0	2	0	6	0	0	6	0	9	-9	0	-Inf
Haití	0	1	0	3	0	0	3	1	12	-11	0	-Inf
Colombia	1	21	150	113	42	24	47	131	184	-53	1.33	3.78985537145394
Bolivia	1	26	86	112	20	26	66	104	279	-175	0.77	3.24259235148552
Paraguay	2	36	225	168	62	39	67	253	293	-40	1.34	3.79863031807306
Chile	2	38	222	177	64	30	83	281	304	-23	1.25	3.73313554536847
Perú	2	31	197	148	54	35	59	213	232	-19	1.33	3.79256356539082
Brasil	8	35	332	178	99	35	44	405	200	205	1.87	4.12987256828125
Argentina	14	41	398	189	120	38	31	455	173	282	2.11	4.25120585074233
Uruguay	15	43	358	197	108	34	55	399	218	181	1.82	4.10396480559909

Validations after changes

We can see that all the variables in our dataframe now are integer and after transformations, numeric.

sapply(df_soccer, class)
(...) 
# Points_1
# "numeric" 
# Performance 
# "numeric"

4. Creating a heatmap

We are using the function heatmap almost out of the box, except the adding of margins and colors.

# Creation of heatmap
america_heatmap <- heatmap(america_matrix, Rowv=NA, 
                           Colv=NA, col = brewer.pal(9, "Blues"), scale="column", 
                           margins=c(2,6))

Tags.

Recent Posts.