Quick EDA – Landslides dataset from NASA (Part I)

Context

Landslides are one of the most pervasive hazards in the world, causing injuries and fatalities almost in any country. There are several triggers, but one of the main reason are intense and prolonged rainfall over saturated soil on vulnerable slopes.

The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events around the world, regardless of size, impacts, or location. The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources. The GLC has been compiled since 2007 at NASA Goddard Space Flight Center. This is a unique data set with the ID tag “GLC” in the landslide editor.

Idea

We are in the presence of a small dataset, around 11,000 records, on which we are going to work. In this first section, we will develop a basic exploratory analysis, step by step in order to determine which variables we want to focus on. There are different ways to implement an exploratory analysis

Using basic functions

# Displays the type and a preview of all columns as a row so that it's very easy to take in.
dim(df)

# Displays 10 first rows        
head(df, 10)

# The matrix and data frame methods return a matrix of class table, obtained by applying #summary to each column and collating the results.
summary(df) 

Using new libraries

I am speaking about skim and dataexplorer. They are relative new and can offer excellent results specially considering time-effort leaving some room to make specific searches after an initial idea about the dataset.

Skimr

skimr is designed to provide summary statistics about variables. It is opinionated in its defaults, but easy to modify. In base R, the most similar functions are summary() for vectors and data frames and fivenum() for numeric vectors:

Dataexplorer

Dataexplorer is designed to provide a graphical view about a dataframe. There are 3 main goals for DataExplorer:

  1. Exploratory Data Analysis (EDA)
  2. Feature Engineering
  3. Data Reporting

Exploratory analysis using Skimr

Code:

library(skimr)
skim(df)

Results:

# Skim summary statistics
# n obs: 11033 
# n variables: 31 
# 
# ── Variable type:character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n min max empty n_unique
# admin_division_name 1637 9396 11033 3 36 0 887
# country_code 1564 9469 11033 2 2 0 139
# country_name 1562 9471 11033 4 32 0 141
# created_date 0 11033 11033 8 22 0 420
# event_date 0 11033 11033 22 22 0 6550
# event_description 862 10171 11033 3 1003 0 9401
# event_import_source 1563 9470 11033 3 80 0 3
# event_time 6021 5012 11033 4 7 0 25
# event_title 0 11033 11033 3 150 0 10546
# gazeteer_closest_point 1563 9470 11033 2 45 0 4389
# landslide_category 1 11032 11033 5 19 0 14
# landslide_setting 69 10964 11033 4 16 0 14
# landslide_size 9 11024 11033 5 12 0 6
# landslide_trigger 23 11010 11033 4 23 0 18
# last_edited_date 0 11033 11033 22 22 0 1
# location_accuracy 2 11031 11033 3 7 0 9
# location_description 102 10931 11033 3 412 0 10432
# notes 10716 317 11033 13 484 0 265
# photo_link 9537 1496 11033 28 292 0 1469
# source_link 846 10187 11033 23 386 0 8294
# source_name 0 11033 11033 2 172 0 3918
# storm_name 10456 577 11033 1 41 0 217
# submitted_date 9 11024 11033 3 22 0 3787
# 
# ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# admin_division_population 1562 9471 11033 157760.05 829734.54 0 1963 7365 34021 1.3e+07 ▇▁▁▁▁▁▁▁
# event_id 0 11033 11033 5598.95 3249.23 1 2785 5563 8435 11221 ▇▇▇▇▇▇▇▇
# fatality_count 1385 9648 11033 3.22 59.89 0 0 0 1 5000 ▇▁▁▁▁▁▁▁
# injury_count 5674 5359 11033 0.75 8.46 0 0 0 0 374 ▇▁▁▁▁▁▁▁
# 
# ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# event_import_id 1562 9471 11033 4798.56 2789.13 -111.17 2386.5 4773 7189.5 9669 ▇▇▇▇▇▇▇▇
# gazeteer_distance 1562 9471 11033 11.87 15.6 3e-05 2.36 6.25 15.82 215.45 ▇▁▁▁▁▁▁▁
# latitude 0 11033 11033 25.88 20.42 -46.77 13.92 30.53 40.87 72.63 ▁▁▁▃▅▇▅▁
# longitude 0 11033 11033 2.52 100.91 -179.98 -107.87 19.69 93.95 179.99 ▁▇▅▁▂▆▇▁

Let’s gonna analyze this result. The data is the same, I prefer use tables to facilitate the analysis and visualization.

Results

n obs: 11033 – n variables: 31

Variable type: character

variablemissingcompletenminmaxemptyn_unique
admin_division_name16379396110333360887
country_code1564946911033220139
country_name15629471110334320141
created_date011033110338220420
event_date01103311033222206550
event_description86210171110331003100309401
event_import_source1563947011033808003
event_time602150121103377025
event_title01103311033150150010546
gazeteer_closest_point1563947011033454504389
landslide_category111032110331919014
landslide_setting6910964110331616014
landslide_size91102411033121206
landslide_trigger2311010110332323018
last_edited_date01103311033222201
location_accuracy211031110337709
location_description1021093111033412412010432
notes10716317110334844840265
photo_link953714961103329229201469
source_link846101871103338638608294
source_name0110331103317217203918
storm_name104565771103341410217
submitted_date91102411033222203787

Variable type: integer

variablemissingcompletenmeansdp0p25p50p75p100hist
admin_division_name1562939611033157760.053829734.540196373653402113,000,000▇▁▁▁▁▁▁▁
event_id011033110335598.953249.23127855563843511221▇▇▇▇▇▇▇▇
fatality_count13859648110333.2259.8900015000▇▁▁▁▁▁▁▁
injury_count56745359110330.758.460000374▇▁▁▁▁▁▁▁

Variable type: numeric

variablemissingcompletenmeansdp0p25p50p75p100hist
event_import_id15629471110334798.562789.13-111.172386.547737189.59669▇▇▇▇▇▇▇▇
gazeteer_distance156294711103311.8715.63E-052.366.2515.82215.45▇▁▁▁▁▁▁▁
latitude0110331103325.8820.42-46.7713.9230.5340.8772.63▁▁▁▃▅▇▅▁
longitude011033110332.52100.91-179.98-107.8719.6993.95179.99▁▇▅▁▂▆▇▁

Possible points to investigate

In order to make the conclusion as clear as I can, I will enumerate the possible points to analyze:

1) Event date

First, we have all the data for the value “event_date”. We can group the events by year, in order to see which years have more landslides.

variablemissingcompletenminmaxemptyn_unique
event_date01103311033222206550
event_description86210171110331003100309401
event_import_source1563947011033808003
event_time602150121103377025
event_title01103311033150150010546

2) Country

First, we have almost all the data for the value “country_code”.We can create maps showing the geolocalization of the landslides.

variablemissingcompletenminmaxemptyn_unique
admin_division_name16379396110333360887
country_code1564946911033220139
country_name15629471110334320141

3) Injuries / Fatalities

The variables “injuries” and “fatalities” can allow us to determine which are the events with the most impact. We need to analyze carefully if the data that we have is enough to create a analysis without bias.

variablemissingcompletenmeansdp0p25p50p75p100hist
fatality_count13859648110333.2259.8900015000▇▁▁▁▁▁▁▁
injury_count56745359110330.758.460000374▇▁▁▁▁▁▁▁

4) Landslides: size / category / trigger

  • The variable “landslide_size” can allow us to see the size of each of the events, and group them. Also, we can check which is most frequent size of landslide.
  • The variable “landslide_trigger” can allow us to see which are the most common triggers.
variablemissingcompletenminmaxemptyn_unique
landslide_category111032110331919014
landslide_setting6910964110331616014
landslide_size91102411033121206
landslide_trigger2311010110332323018

5) Relationship between variables

Just a few relationship interesting to exploit:

  • Injuries / Fatalities per country.
  • Country of occurrence and size of the landslide.

Conclusions

Just using one library we have a better idea about our dataset, let try with Dataexplorer in “Landslides Dataset from NASA (Part II)” to see what other paths to explore we can find.

Sources

Kirschbaum, D. B., Adler, R., Hong, Y., Hill, S., & Lerner-Lam, A. (2010). A global landslide catalog for hazard applications: method, results, and limitations. Natural Hazards, 52(3), 561–575. doi:10.1007/s11069-009-9401-4. [1] Kirschbaum, D.B., T. Stanley, Y. Zhou (In press, 2015). Spatial and Temporal Analysis of a Global Landslide Catalog. Geomorphology. doi:10.1016/j.geomorph.2015.03.016. [2]