5 Basic questions and answers about high dimensional data

The main idea of this post is to answer what high dimensional data is, its main challenges at the moment to create a visualization and offer examples about the adequate plots for this kind of data.

If you master this area and you want to explore some papers, you can check some of them in this summary.

So, let’s start:

.

1.What is high dimensional data?

Dimensionality in statistics refers to how many attributes a dataset has.

High Dimensional means that the number of dimensions is staggeringly high — so high that calculations become extremely difficult. With high dimensional data, the number of features can exceed the number of observations.

For example, healthcare data is notorious for having vast amounts of variables (e.g., blood pressure, weight, cholesterol level). In an ideal world, this data could be represented in a spreadsheet, with one column representing each dimension. In practice, this is difficult to do, in part because many variables are inter-related (like weight and blood pressure).

A second example, microarrays, which measure gene expression, can contain tens of hundreds of samples. Each sample can contain tens of thousands of genes. [1]

.

2. What are the problems with visualization that involves high multidimensional data?

Since our brain process in three dimensions, we can not entirely process and understand a visualization that involves data with multiple dimensions.

.

3. Why do we want to visualize the data?

Because the ultimate objective of data analysis is to detect patterns, solve problems and understand the information that we are studying. Moreover, for the human brain processes information, using graphs to visualize large amounts of complex data is more natural than checking spreadsheets or reports. So, we want to facilitate the task for our brain to obtain value from our data.

.

4. What is Reduction of dimensionality?

Reduction of dimensionality means to simplify the understanding of data, either numerically or visually. Data integrity is maintained. Combine related data into groups using a tool like multidimensional scaling to identify similarities in data. You could also use clustering to group items together.

.

5. How to visualize data with more than two dimensions?

Four different plot types have proven to be the most powerful:

5.1 The Trellis Chart

The Trellis Chart is a layout of smaller plot charts in a grid with consistent scales. Trellis Charts are useful for finding the structure and patterns in complex data. The grid layout looks similar to a garden trellis, hence the name Trellis Chart. [3] With a trellis chart, you can, not only analyze the metrics within each chart without a query (a selection) but also look at the bigger picture and compare with the rest of the group at the same time. This way, you can quickly identify the irregular behaviors among the variables. [4]

.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_wrap(~ class, nrow = 2)+
  scale_color_viridis(discrete=TRUE) +
  labs(
    title ="Facets for Fuel economy data for 38 popular models of car",
    caption = "code author: R for Data Science by Garrett Grolemund, Hadley Wickham
    modifications: thinkingondata.com") +
  theme_minimal() +
  theme(axis.title.x=element_blank(),
        axis.title.y=element_blank())+
  theme(
    plot.title = element_text(size=12)
  )

.

.

5.2 Parallel coordinates plot.

Using this graph you can see how variable contributes with the general data, but also you can detect trends into the data.

data <- iris

# Plot
data %>% 
  ggparcoord(
    columns = 1:4, groupColumn = 5, order = "anyClass",
    #showPoints = TRUE, 
    

    alphaLines = 1
  ) + 
  labs(
    title = "Parallel Coordinate Plot for the Iris Data",
    caption = "code author: ggobi.github.io/ggally/rd.html
    modifications: thinkingondata.com")+
  scale_color_viridis(discrete=TRUE) +
  theme_minimal()+
  theme(axis.title.x=element_blank(),
        axis.title.y=element_blank())+
  theme(
    plot.title = element_text(size=12)
  )

.

5.3 Mosaic Plots

This is a graphical method for visualizing data from two or more qualitative variables. As with bar charts and spine plots, the area of the tiles, also known as the bin size, is proportional to the number of observations within that category.

# using diamonds dataset for illustration
df <- diamonds %>%
  group_by(cut, clarity) %>%
  summarise(count = n()) %>%
  mutate(cut.count = sum(count),
         prop = count/sum(count)) %>%
  ungroup()



ggplot(df,
       aes(x = cut, y = prop, width = cut.count, fill = clarity)) +
  geom_bar(stat = "identity", position = "fill", colour = "white") +
  facet_grid(~cut, scales = "free_x", space = "free_x") +
  scale_color_viridis(discrete=TRUE) +
  labs(
    title ="Mosaic for Diamond dataset",
    caption = "code author: stackoverflow.com/users/8449629/z-lin
    modifications: thinkingondata.com") +
  theme_minimal() +
  theme(axis.title.x=element_blank(),
        axis.title.y=element_blank())+
  theme(
    plot.title = element_text(size=12)
  )

.

.

5.4 Projection Pursuit and Grand Tour

The grand tour and projection pursuit are two methods for exploring multivariate data.

Projection pursuit is a method to finds the best or most informative or interesting projection – usually defined operationally as statistical non normality.

Grand Tour is another method of visualization that could be described as an attempt to look at the data ‘from all possible angles’. A Grand Tour is a video sequence in which each frame shows the result of a single projection of the data, with the sequence as a whole including all possible projection planes. Nevertheless, the Grand Tour replaces the quality of projection pursuit with quantity: a grand tour in high dimensional space is long and mostly uninformative. [5]

.

Conclusion

This post presents a small summary of the high dimensional data and the best well-known plots to address the inherent problems at the moment to visualize this kind of data.

.

References