Quick EDA – Landslides dataset from NASA (Part I)

Context

Landslides are one of the most pervasive hazards in the world, causing injuries and fatalities almost in any country. There are several triggers, but one of the main reason are intense and prolonged rainfall over saturated soil on vulnerable slopes.

The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events around the world, regardless of size, impacts, or location. The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources. The GLC has been compiled since 2007 at NASA Goddard Space Flight Center. This is a unique data set with the ID tag “GLC” in the landslide editor.

Idea

We are in the presence of a small dataset, around 11,000 records, on which we are going to work. In this first section, we will develop a basic exploratory analysis, step by step in order to determine which variables we want to focus on. There are different ways to implement an exploratory analysis

Using basic functions

# Displays the type and a preview of all columns as a row so that it's very easy to take in.
dim(df)

# Displays 10 first rows        
head(df, 10)

# The matrix and data frame methods return a matrix of class table, obtained by applying #summary to each column and collating the results.
summary(df)

Using new libraries

I am speaking about skim and dataexplorer. They are relative new and can offer excellent results specially considering time-effort leaving some room to make specific searches after an initial idea about the dataset.

Skimr

skimr is designed to provide summary statistics about variables. It is opinionated in its defaults, but easy to modify. In base R, the most similar functions are summary() for vectors and data frames and fivenum() for numeric vectors:

Dataexplorer

Dataexplorer is designed to provide a graphical view about a dataframe. There are 3 main goals for DataExplorer:

Exploratory analysis using Skimr

Code:

library(skimr)
skim(df)

Results:

# Skim summary statistics
# n obs: 11033 
# n variables: 31 
# 
# ── Variable type:character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n min max empty n_unique
# admin_division_name 1637 9396 11033 3 36 0 887
# country_code 1564 9469 11033 2 2 0 139
# country_name 1562 9471 11033 4 32 0 141
# created_date 0 11033 11033 8 22 0 420
# event_date 0 11033 11033 22 22 0 6550
# event_description 862 10171 11033 3 1003 0 9401
# event_import_source 1563 9470 11033 3 80 0 3
# event_time 6021 5012 11033 4 7 0 25
# event_title 0 11033 11033 3 150 0 10546
# gazeteer_closest_point 1563 9470 11033 2 45 0 4389
# landslide_category 1 11032 11033 5 19 0 14
# landslide_setting 69 10964 11033 4 16 0 14
# landslide_size 9 11024 11033 5 12 0 6
# landslide_trigger 23 11010 11033 4 23 0 18
# last_edited_date 0 11033 11033 22 22 0 1
# location_accuracy 2 11031 11033 3 7 0 9
# location_description 102 10931 11033 3 412 0 10432
# notes 10716 317 11033 13 484 0 265
# photo_link 9537 1496 11033 28 292 0 1469
# source_link 846 10187 11033 23 386 0 8294
# source_name 0 11033 11033 2 172 0 3918
# storm_name 10456 577 11033 1 41 0 217
# submitted_date 9 11024 11033 3 22 0 3787
# 
# ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# admin_division_population 1562 9471 11033 157760.05 829734.54 0 1963 7365 34021 1.3e+07 ▇▁▁▁▁▁▁▁
# event_id 0 11033 11033 5598.95 3249.23 1 2785 5563 8435 11221 ▇▇▇▇▇▇▇▇
# fatality_count 1385 9648 11033 3.22 59.89 0 0 0 1 5000 ▇▁▁▁▁▁▁▁
# injury_count 5674 5359 11033 0.75 8.46 0 0 0 0 374 ▇▁▁▁▁▁▁▁
# 
# ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# event_import_id 1562 9471 11033 4798.56 2789.13 -111.17 2386.5 4773 7189.5 9669 ▇▇▇▇▇▇▇▇
# gazeteer_distance 1562 9471 11033 11.87 15.6 3e-05 2.36 6.25 15.82 215.45 ▇▁▁▁▁▁▁▁
# latitude 0 11033 11033 25.88 20.42 -46.77 13.92 30.53 40.87 72.63 ▁▁▁▃▅▇▅▁
# longitude 0 11033 11033 2.52 100.91 -179.98 -107.87 19.69 93.95 179.99 ▁▇▅▁▂▆▇▁

Let’s gonna analyze this result. The data is the same, I prefer use tables to facilitate the analysis and visualization.

Results

n obs: 11033 – n variables: 31

Variable type: character

variable	missing	complete	n	min	max	n_unique
admin_division_name	1637	9396	11033	3	36	887
country_code	1564	9469	11033	2	2	139
country_name	1562	9471	11033	4	32	141
created_date	0	11033	11033	8	22	420
event_date	0	11033	11033	22	22	6550
event_description	862	10171	11033	1003	1003	9401
event_import_source	1563	9470	11033	80	80	3
event_time	6021	5012	11033	7	7	25
event_title	0	11033	11033	150	150	10546
gazeteer_closest_point	1563	9470	11033	45	45	4389
landslide_category	1	11032	11033	19	19	14
landslide_setting	69	10964	11033	16	16	14
landslide_size	9	11024	11033	12	12	6
landslide_trigger	23	11010	11033	23	23	18
last_edited_date	0	11033	11033	22	22	1
location_accuracy	2	11031	11033	7	7	9
location_description	102	10931	11033	412	412	10432
notes	10716	317	11033	484	484	265
photo_link	9537	1496	11033	292	292	1469
source_link	846	10187	11033	386	386	8294
source_name	0	11033	11033	172	172	3918
storm_name	10456	577	11033	41	41	217
submitted_date	9	11024	11033	22	22	3787

Variable type: integer

variable	missing	complete	n	mean	sd	p0	p25	p50	p75	p100	hist
admin_division_name	1562	9396	11033	157760.05	3829734.54	0	1963	7365	34021	13,000,000	▇▁▁▁▁▁▁▁
event_id	0	11033	11033	5598.95	3249.23	1	2785	5563	8435	11221	▇▇▇▇▇▇▇▇
fatality_count	1385	9648	11033	3.22	59.89	0	0	0	1	5000	▇▁▁▁▁▁▁▁
injury_count	5674	5359	11033	0.75	8.46	0	0	0	0	374	▇▁▁▁▁▁▁▁

Variable type: numeric

variable	missing	complete	n	mean	sd	p0	p25	p50	p75	p100	hist
event_import_id	1562	9471	11033	4798.56	2789.13	-111.17	2386.5	4773	7189.5	9669	▇▇▇▇▇▇▇▇
gazeteer_distance	1562	9471	11033	11.87	15.6	3E-05	2.36	6.25	15.82	215.45	▇▁▁▁▁▁▁▁
latitude	0	11033	11033	25.88	20.42	-46.77	13.92	30.53	40.87	72.63	▁▁▁▃▅▇▅▁
longitude	0	11033	11033	2.52	100.91	-179.98	-107.87	19.69	93.95	179.99	▁▇▅▁▂▆▇▁

Possible points to investigate

In order to make the conclusion as clear as I can, I will enumerate the possible points to analyze:

1) Event date

First, we have all the data for the value “event_date”. We can group the events by year, in order to see which years have more landslides.

variable	missing	complete	n	min	max	n_unique
event_date	0	11033	11033	22	22	6550
event_description	862	10171	11033	1003	1003	9401
event_import_source	1563	9470	11033	80	80	3
event_time	6021	5012	11033	7	7	25
event_title	0	11033	11033	150	150	10546

2) Country

First, we have almost all the data for the value “country_code”.We can create maps showing the geolocalization of the landslides.

variable	missing	complete	n	min	max	n_unique
admin_division_name	1637	9396	11033	3	36	887
country_code	1564	9469	11033	2	2	139
country_name	1562	9471	11033	4	32	141

3) Injuries / Fatalities

The variables “injuries” and “fatalities” can allow us to determine which are the events with the most impact. We need to analyze carefully if the data that we have is enough to create a analysis without bias.

variable	missing	complete	n	mean	sd	p0	p25	p50	p75	p100	hist
fatality_count	1385	9648	11033	3.22	59.89	0	0	0	1	5000	▇▁▁▁▁▁▁▁
injury_count	5674	5359	11033	0.75	8.46	0	0	0	0	374	▇▁▁▁▁▁▁▁

4) Landslides: size / category / trigger

The variable “landslide_size” can allow us to see the size of each of the events, and group them. Also, we can check which is most frequent size of landslide.
The variable “landslide_trigger” can allow us to see which are the most common triggers.

variable	missing	complete	n	min	max	n_unique
landslide_category	1	11032	11033	19	19	14
landslide_setting	69	10964	11033	16	16	14
landslide_size	9	11024	11033	12	12	6
landslide_trigger	23	11010	11033	23	23	18

5) Relationship between variables

Just a few relationship interesting to exploit:

Injuries / Fatalities per country.
Country of occurrence and size of the landslide.

Conclusions

Just using one library we have a better idea about our dataset, let try with Dataexplorer in “Landslides Dataset from NASA (Part II)” to see what other paths to explore we can find.

Sources

Kirschbaum, D. B., Adler, R., Hong, Y., Hill, S., & Lerner-Lam, A. (2010). A global landslide catalog for hazard applications: method, results, and limitations. Natural Hazards, 52(3), 561–575. doi:10.1007/s11069-009-9401-4. [1] Kirschbaum, D.B., T. Stanley, Y. Zhou (In press, 2015). Spatial and Temporal Analysis of a Global Landslide Catalog. Geomorphology. doi:10.1016/j.geomorph.2015.03.016. [2]

Tags.

Recent Posts.