A Dive Into Some Global Flooding Data

I always like to keep a look out for interesting open data sets. One great resource for such things is Jeremy Singer-Vine’s Data is Plural weekly newsletter that brings together a collection of “useful, curious datasets” for us all to enjoy and wrangle with.

One that cropped up last week was The Dartmouth Flood Observatory’s Global Archive of Large Flood Events. It contains data about 4,500+ floods, dating back to 1985, is updated often, and is available in Excel, XML, HTML, and geospatial formats 👏. The variables include each flood’s location, timespan, severity, main cause, and estimated impact so there’s plenty to explore…


Examining the Data

I downloaded the data in geospatial form from the link above and read it into R as a nice tidy simple features object with the sf package like so:

library(tidyverse)
library(sf)
library(lubridate)
library(hrbrthemes)
library(dygraphs)
library(leaflet)
library(leaflet.extras)
library(widgetframe)

floods_sf <- st_read("floods/FloodArchive_region.shp", stringsAsFactors = FALSE, quiet = TRUE)

# get a tibble version sans geometry 
floods <- as_tibble(floods_sf) %>% select(-geometry)

glimpse(floods)
## Observations: 4,520
## Variables: 14
## $ ID         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ GLIDENUMBE <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ COUNTRY    <chr> "Algeria", "Brazil", "Phillipines", "Indonesia", "M...
## $ OTHERCOUNT <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ LONG       <dbl> 5.230257, -45.348871, 122.974280, 124.606276, 32.34...
## $ LAT        <dbl> 35.814242, -18.711052, 10.020719, 1.014892, -25.869...
## $ AREA       <dbl> 92615.668, 678498.821, 12846.028, 16542.125, 20082....
## $ BEGAN      <date> 1985-01-01, 1985-01-15, 1985-01-20, 1985-02-04, 19...
## $ ENDED      <date> 1985-01-05, 1985-02-02, 1985-01-21, 1985-02-18, 19...
## $ VALIDATION <chr> "News", "News", "News", "News", "News", "News", "Ne...
## $ DEAD       <dbl> 26, 229, 43, 21, 19, 2, 4, 10, 7, 2, 4, 0, 0, 0, 32...
## $ DISPLACED  <dbl> 3000, 80000, 444, 300, 0, 35000, 200, 2000, 2250, 1...
## $ MAINCAUSE  <chr> "Heavy rain", "Heavy rain", "Torrential rain", "Tor...
## $ SEVERITY   <dbl> 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, ...

Now, the eagle eyed amongst us may have spotted that we already have a typo in the first few rows of data. Phillipines should be Phillippines. Ergo, it looks like we have a 4520 row, manually entered dataset and we needs must stay vigilant and on the lookout for errors. But we wouldn’t want it any other way 😅.

Because we have the start and end point of each flood, one thing that may be of interest is the average duration of a flood. Let us then compute the duration of each flood:

floods <- floods %>% 
  mutate(DURATION = as.numeric(ENDED - BEGAN))

mean(floods$DURATION, na.rm = TRUE)
## [1] -136.209

An average flood time of -136.209 days. Something fishy is going here.

We must have computed some negative durations which should not be possible, if we’re going by most people’s perception of time, so let’s find out where the error is:

filter(floods, DURATION < 0) %>% glimpse()
## Observations: 1
## Variables: 15
## $ ID         <dbl> 4352
## $ GLIDENUMBE <chr> NA
## $ COUNTRY    <chr> "Kenya"
## $ OTHERCOUNT <chr> "Tanzania"
## $ LONG       <dbl> 38.34419
## $ LAT        <dbl> -4.714506
## $ AREA       <dbl> 117235.2
## $ BEGAN      <date> 2016-04-13
## $ ENDED      <date> 0206-04-22
## $ VALIDATION <chr> "News"
## $ DEAD       <dbl> 0
## $ DISPLACED  <dbl> 0
## $ MAINCAUSE  <chr> "Tropical Storm Fantala"
## $ SEVERITY   <dbl> 1
## $ DURATION   <dbl> -661081

And there it is! The Great Kenyan Flood of April 2016 - April 206 lasting a whole -661,081 days. Assuming the end date should be 2016-04-22, we can fix using case_when like so:

# probably quicker ways to do this but I like making a case for case_when (h/t hrbrmstr)
floods <- floods %>% 
  mutate(ENDED = case_when(
           ID == 4352 ~ as_date("2016-04-22"),
           TRUE ~ ENDED),
         DURATION = as.numeric(ENDED - BEGAN))

# now lets try our average duration again
mean(floods$DURATION, na.rm = TRUE)
## [1] 10.27676

That looks more reasonable! Around 10 days.


Make some charts

It might also be interesting to see if this differs across the regions of the world. To do this we can try and code a new region variable based on the COUNTRY variable we already have using the countrycode package. Typos will cause NAs here but it’s a sacrifice I’m willing to make, as fixing all typos in the data is outwith the scope of this blogpost.

# first I'll run a function on the country labels to get rid of any strange encoding
floods$COUNTRY <- sapply(floods$COUNTRY, function(row) iconv(row, "latin1", "ASCII", sub=""))

# then run the countrycode function
floods$REGION <- countrycode::countrycode(floods$COUNTRY, "country.name", "region")

region_sums <- floods %>%
  filter(!is.na(REGION), !is.na(DURATION)) %>% 
  group_by(REGION) %>% 
  summarise(count = n(), deaths = sum(DEAD), displaced = sum(DISPLACED), av_duration = mean(DURATION)) %>% 
  gather(stat, value, count:av_duration)


Time for some time-series

Aggregating the data by year, we can get a look at global trends…

flood_sums <- as_tibble(floods) %>%
  filter(!is.na(BEGAN), !is.na(DEAD)) %>% 
  mutate(year = make_date(year(BEGAN), 01, 01)) %>% 
  group_by(year) %>% 
  summarise(floods = n(), deaths = sum(DEAD), displaced = sum(DISPLACED))

# convert to xts time-series objects to allow plotting with dygraphs
ts_floods <- xts::xts(flood_sums$floods, flood_sums$year)
ts_deaths <- xts::xts(flood_sums$deaths, flood_sums$year)
ts_displaced <- xts::xts(flood_sums$displaced, flood_sums$year)




There looks to have been downward trend in all 3 metrics from around 2008 but it will be interesting to see if recent events will cause another spike…


Mapping

Now for the real fun. You may have noticed that we’ve been provided with point data (long, lat coordinates) for each flood. This is good because we can use it to map kernel density estimates and display the results as a heatmap. The idea is to highlight areas of the world that have the highest probability of being flooded based on the data of the previous 27 years.

Playing around with themes, layering, colours and alpha levels can really make a map more impactful and relevant to the subject matter.

world <- map_data("world") %>% 
  filter(region != "Antarctica")

ggplot() +
  stat_density2d(data = floods, aes(LONG, LAT, fill = ..level..), alpha = .5, geom = "polygon") +
  geom_map(data = world, map = world, aes(long, lat, map_id=region), fill="#f5f5f550", colour = "white", size = 0.1) +
  ggalt::coord_proj("+proj=robin") +
  scale_fill_gradientn("kernel density estimate ", colours = RColorBrewer::brewer.pal(7, "Blues")) +
  theme_ipsum_rc(axis = FALSE, base_family = "Iosevka", caption_size = 12) +
  labs(x = NULL, y = NULL, title = "Where Floods Happen", subtitle = "1985-2017", 
       caption = "Culture of Insight / @PaulCampbell91 / Source: Dartmouth Flood Observatory") +
  guides(fill=guide_colourbar(label.position = "top", title.vjust = 0)) +
  theme(legend.position = c(0.3, -0.035), legend.direction = "horizontal", legend.key.width = unit(2, "line"),
        legend.key.height = unit(0.7, "lines"), axis.text = element_blank(), panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0.9))

gif click for full screen version

This confirms that the South-East Asia region has suffered the most from flooding the last ~30 years, as we saw in the initial charts.

The Dartmouth Flood Observatory have also provided spatial polygons to represent each flood and we’d be fools not to try and do some viz justice to this admirable work. Plotting them all on a map at once leads to severe overplotting so I’ll animate the time dimension and take a look at flood areas over time.

world2 <- rnaturalearth::ne_countries(returnclass = "sf") %>% 
  filter(!name %in% c("Fr. S. Antarctic Lands", "Antarctica"))

flood_map <-  floods_sf %>% 
  mutate(YEAR = year(BEGAN)) %>% 
  ggplot() +
  geom_sf(data = world2, fill = "lightgreen", alpha = .6) +
  geom_sf(aes(frame = YEAR), fill = "lightblue") +
  scale_fill_gradientn("Deaths ", colours = RColorBrewer::brewer.pal(4, "Blues")) +
  coord_sf(crs = "+proj=robin", datum = NA) +
  theme_ipsum_rc(axis = FALSE) +
  labs(x = NULL, y = NULL, title = "Global Floods",
       caption = "Culture of Insight / @PaulCampbell91 / Source: Dartmouth Flood Observatory") +
  guides(fill=guide_colourbar(label.position = "top", title.vjust = 0)) +
  theme(legend.position = c(0.9, 1.01), legend.direction = "horizontal", legend.key.width = unit(2, "line"), 
        legend.key.height = unit(0.7, "lines"), axis.text = element_blank(), panel.grid.minor = element_blank(), 
        panel.grid.major = element_blank(), panel.background = element_rect(fill = "lightblue"))

animation::ani.options(interval = 1)
gganimate::gganimate(flood_map, ani.width =  1000, ani.height = 570, title_frame = TRUE)

gif


Finally, using the LON LAT coordinates again, we can easily create an interactive leaflet map with tooltip details for every flood in the dataset. Click on the clusters to zoom into that area and have an explore.

click the full-screen button to get a more immersive experience!


Thanks to the Dartmouth Flood Observatory for compiling and maintaining this excellent dataset full of rich Geo information. They really made it easy to make some interesting maps and learn a bit more about the way flooding has affected the world in recent times.

Let me know if you have any other ideas of how to best visualise this informaiton. It’s always good to see alternative perspectives of the same data. Drop a note in the comment section or catch me on twitter.

Cheers!


To keep things relatively concise, not all code is shown in this blogpost. You can check the full Rmarkdown file here.