TL;DR

If you’re looking for a tool to scrape all the posts in facebook page/group with a link and have the data presented to you in a searchable, filterable table then check out the shiny app I made for this purpose by clicking on the image below (very niche market, I know).

election book

If however, that’s not why you’re here, and would like to look at some interesting ways of visualising social media data (or any kind of events over time data), please read on.


Some Thoughts, Observations & Concerns

The amount of data being generated by the big social media giants is now unimaginably vast. The potential for abuse of this data is therefore quite concerning - see here - but I’m not here to bore you with my musings on the state of cyber surveillance, Trump, Brexit, etc. We won’t be doing anything sinister like that today. Instead of leveraging this data to manipulate others, let’s see if we can use some of our own facebook data for a more reflective (existential) analysis, and see what we learn.

As mentioned above, for this example I’m going to be using the data from a private group I share with some friends to post links to music we think is worth listening to, but this can be applied to any facebook page. Particularly useful if you run a public page with lots of activity and engagements that you want to try and make sense of, or if you’re just a raging narcissist and want to know at what time of the day to change your profile picture to yield the most likes.

We’re going to scrape the data with the ever-so-useful Rfacebook package built by Pablo Barbera (nice one Pablo). Then we’ll use ggplot2 for our visualisations with a bit of added interactivity with plotly to round things off.

On y va!


Scraping The Data

Load em up:

library(Rfacebook)
library(tidyverse)
library(forcats)
library(lubridate)
library(hrbrthemes)
library(ggalt)
library(ggbeeswarm)
library(plotly)

Now, to access the facebook API you need to head over to facebook’s developer site, create your own ‘app’ (not as painful as it sounds), then save the token it gives you as a variable for ease of use later.

Once that’s done, save your group_id as a variable (normally a 15 digit number that comes after the /group/ part of the URL) and we’re good to go.

side-note: you can try this with public facebook pages/groups also. Just paste in the page or group ID.

token <- 'XXXXXXXXXXXXXXXXXXXXX'

group_id <- 'XXXXXXXXXXXXXXXXXX'

Now let’s build a function that will do all the hard graft for us and output a tidy dataframe with only the data that we’re interested in. For me there is a niche element in that I want to scrape the metadata from any links within each post to give me the name of that link. In most cases this will be the title of a youtube video which will most likely be a song title. This gives me a fast way of knowing what songs have been posted in the group without having to follow every URL.

It does require an extra bit of leg-work in the function as Rfacebook’s getGroup function doesn’t return link titles, only URLs. But if you don’t need this info then you can skip it and you’re life is a lot easier.

The main mutations to the dataset we’ll be making are all date/time related. To explore various relationships between group posts and time, it’s useful to aggregate up to broader time categories. Facebook gives us the date/time of each post to the exact second. We’ll then round this datetime up to minute, hour, weekday and month. As R can’t create a purely ‘time’ class, for our Minute and Hour variables we will create a datetime but set the date to the same day for every post. This will squash all times into a 1 day period and allow for better post/time analysis.

Here’s the full scraping function. The limit variable will dictate how many posts facebook will return data for. Without this you get a fairly pathetic 25 posts, but one of my motivations for doing this was to get instant access to historical posts going back to the start of the group in late 2015. It’s a very tedious process of scrolling and scrolling and scrolling on facebook to get to where you want to be, so set the limit high and the function will keep scraping until there are no posts left to scrape.

group_scrape <- function(token, group_id, limit) {
  
  #function we'll use to convert the date output from facebook to an R date
  format.facebook.date <- function(datestring) {
  date <- as.POSIXct(datestring, format = "%Y-%m-%dT%H:%M:%S+0000", tz = "GMT")
  }
  
  # handy Rfacebook function that will return most of what we want in a tidy datatable
  data_main <-  getGroup(group_id, token, feed = TRUE, n = limit)
  
  # custom API call to get the name of the link in any post that has one, returned as a list of lists
  link_names <- callAPI(paste0("https://graph.facebook.com/v2.9/", group_id, 
                                   "?fields=feed.limit(", limit, "){name}"), token)
  # function to get the data out of the lists and into a tidy data frame of the same length as data_main
  # posts with no link will be NAs
  link_names <- bind_rows(lapply(link_names$feed$data, as.data.frame))
  
  # levels for our Days factor variable we're about to create
  days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
  
  # merge the two datasets on the id post variable
  final <- merge(data_main, link_names, by = "id")
  
  # remove posts that don't contain a link with complete.cases on the link name column
  final <- final[complete.cases(final[,12]),] %>%
    mutate(created_time = format.facebook.date(created_time),
           Date = as.Date(created_time, format = "%d/%m/%y"),
           Minute = make_datetime(2017, 01, 01, hour(created_time), minute(created_time), 0, tz = "GMT"),
           Hour = make_datetime(2017, 01, 01, hour(created_time), 0, 0, tz = "GMT"),
           Day = factor(weekdays(created_time), levels = days),
           Month = make_date(year(created_time), month(created_time), 01),
           Link = paste0("<a href='",link,"' target='_blank'>","open link...","</a>")) %>%
    select(Date, Month, `Posted By` = from_name, Track = name, Day, Hour, Minute, Likes = likes_count,
           Comments = comments_count, Link)
  
  # gets rid of any weird stuff (emojis etc) in the link names that can cause a data.table to fail
  final$Track <- sapply(final$Track, function(row) iconv(row, "latin1", "ASCII", sub=""))
  
  return(final)
}

Now let’s throw in our token, group_id and limit into the function and see what we get…

tunes <- group_scrape(token = token, group_id = group_id, limit = 1500)

Time to visualise!

I’m going to explore various relationships in the data with some different chart types then refine it down to 1 (maybe 2) charts that I think give me all the information that is most useful to me, making every pixel count (can you hear me Edward Tufte?)

All charts use hrbrmstr’s glorious ipsum_rc theme from the hrbrthemes package, because Roboto Condensed is life.

Let’s get the ball rolling with a stacked area chart of posts per month split by group member…

To save myself and my friends any embarrassment over things like pitiful post:likes ratios, I’ve anonymised the names of the group members.

per_month <- count(tunes, `Posted By`, Month) %>%
  mutate(`Posted By` = `Posted By` %>% fct_rev())

ggplot(per_month, aes(Month, n, group = `Posted By`, fill = `Posted By`)) +
  geom_area(alpha = 0.7) +
  theme_ipsum_rc(caption_size = 12) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  scale_fill_brewer(palette = "Dark2") +
  labs(y = "", x = "", title = "Facebook Group", subtitle = "Posts per Month",
       caption = "After a miserable April, the troops have shown some bouncebackability and are now in a rich vein of form...") +
  theme(legend.direction = "horizontal", legend.position=c(0.8, 1.05))

Now we’ll try looking at post per day of the week then per hour to see if there’s any peaks or troughs.

per_day <- count(tunes, Day) %>%
  mutate(Day = Day %>% fct_rev())

ggplot(per_day, aes(n, Day)) +
  geom_lollipop(point.colour = "SteelBlue", point.size = 4, horizontal = TRUE) +
  theme_ipsum_rc(grid = "X", caption_size = 12) +
  labs(y = "", x = "", title = "Facebook Group", subtitle = "Posts per Day",
       caption = "Putting in work on Wednesdays and taking it easy on the weekend.")

per_hour <- count(tunes, Hour)

ggplot(per_hour, aes(Hour, n)) +
  geom_bar(stat = "identity", fill = "SteelBlue") +
  theme_ipsum_rc(caption_size = 12) +
  scale_x_datetime(date_labels = "%H:%M", date_breaks = "2 hours") +
  labs(y = "", x = "", title = "Facebook Group", subtitle = "Posts per Hour")

Now let’s use the perennially sought after facebook Like as a measurement of success along with the number of comments on each posts. You may have noticed from the first chart that Member4 joined the party late. Did his addition have any effect on the average engagement each month?

engagements <- tunes %>%
  group_by(Month) %>%
  summarise(posts = n(), Likes = mean(Likes), Comments = mean(Comments)) %>%
  gather(Metric, Average, Likes:Comments)

ggplot(engagements, aes(Month, Average, colour = Metric)) +
  geom_vline(xintercept = as.numeric(min(filter(tunes, `Posted By` == "Member4")$Date)), linetype = 2) +
  annotate(geom = "text", x = min(filter(tunes, `Posted By` == "Member4")$Date),
           y = 3.25, label = "Enter Member4", hjust = -.05, family = "Roboto Condensed") +
  geom_line(size = 1) +
  theme_ipsum_rc(caption_size = 12) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  scale_colour_brewer(palette = "Dark2") +
  labs(y = "", x = "", title = "Facebook Group", subtitle = "Average Engagements per Post", 
       caption = "Bear in mind the max number of likes is 3 in this case (unless you have that friend that likes their own posts)")

With the number of people in the group increasing by 33%, we can see a bit of a hike in average comments per post which then returns to a similar level. Sadly the average number of likes has fallen since his arrival. quantity != quality

Focusing in on member performance, using the count of posts per month by group member, let’s see how consistent each member is by looking at the min and max number of posts.

minmax <- per_month %>%
  group_by(`Posted By`) %>%
  summarise(min = min(n), max = max(n))

ggplot(minmax, aes(y = `Posted By`, x = min, xend = max)) +
  geom_dumbbell(size=2, color="#e3e2e1", 
                colour_x = "LightBlue", colour_xend = "SteelBlue",
                dot_guide=TRUE, dot_guide_size=0.25) +
  labs(y = "", x = "", title = "Facebook Group", subtitle = "Min/Max Posts by Month",
       caption = "Not exactly consistent performers") +
  theme_ipsum_rc(grid = "X", caption_size = 12)


Swarm Those Posts

That covers a lot of the things I would be interested to explore in this dataset, but let’s see if we can build a chart that incorporates a lot of them into one graphic.

If anyone has seen my last blogpost you’ll know that I have a bit of thing for colourful dot-density charts/maps. Using the ggbeeswarm’s geom_quasirandom we can represent each post as a dot and show the density of posts over time. The quasirandom plotting is used to offset points within categories to reduce overplotting.

I’m also going to use the Plotly ggplotly wrapper here to add a bit of interactivity to the plot. There’s a lot of debate in the Data Vis world about interactivity and it’s benefits (or lack thereof) but in this instance it will allow us to add tooltip information for each post as well as zooming funcitonality to focus in on a specific period of time which is great for digging into to areas of high density.

ggplotly(
  tunes %>% mutate(`Posted By` = `Posted By` %>% fct_rev()) %>%
  ggplot(aes(Date, `Posted By`, colour = `Posted By`, alpha = Likes, label = Comments)) +
  geom_quasirandom() +
  scale_color_brewer(palette = "Dark2") +
  theme_ipsum(grid = "X") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  labs(y = NULL, x = NULL, title = "Post Density Over Time by Group Member") +
  theme(legend.position = "none", plot.margin=unit(c(2,0,1,0),"cm")),
  tooltip = c("x", "alpha", "label"))

I’ve set the alpha argument to Likes which makes it easy for us to identify posts that have performed the best.

The pièce de résistance of this chart would be having the name of the link (song title) in the tooltip, but as this blog post will be read by around 5-10 (million) people, I couldn’t possibly reveal the back catalog of music that we have meticulously curated to an audience that vast.

I do have it included in my personal version however, and it’s a lot of fun identifying posts with the most likes via the alpha level, zooming into clustered areas, and using the tooltip to see what each song is - give it a try yourself!

Finally, let’s create the same chart but with a more micro timescale, squashing all posts into a 1 week period.

ggplotly(
  ggplot(tunes, aes(Day, Minute, colour = `Posted By`, text = Date, alpha = Likes, label = Comments)) +
  geom_quasirandom() +
  scale_color_brewer(palette = "Dark2") +
  scale_alpha(guide = FALSE) +
  theme_ipsum(grid = "Y") +
  scale_y_datetime(date_labels = "%H:%M", date_breaks = "4 hours") +
  labs(y = NULL, x = NULL, title = "Posts per Day of Week / Time") +
  theme(plot.margin=unit(c(2,0,1,0),"cm")),
  tooltip = c("text", "alpha", "label"))

Try interacting with the chart:

  1. Filter group members by clicking on their label in the legend
  2. Zoom by clicking + dragging
  3. All the other stuff in the plotly toolbar including…
  4. Download plot as a png (for all those wanting to print + frame it on their wall)

That is all for this post. If you have any thoughts/questions or any other ideas of interesting ways to visualise this kind of data, get in touch below or via my twitter.

And if you’re interesed in bringing your data to life with bespoke visualisation like the above - get in touch with the Culure of Insight team here. We’d love to hear from you!

Adieu.