Scrapping social data from Facebook

Author:

Loic Quertenmont

First Publication:

15/12/2016

Reading Time:

6 minutes

Categories:

Data mining

Tags:

b2c

facebook

mining

python

scrapping

socialnetwork

Nowadays, social networks can be considered as a main source of data. This is particularly true for business to customer companies which must take into account customer feedback on their products. In this blog, we will show how to retrieve information from Facebook using the Facebook Graph API...

As a showcase, we will retrieve the activity over the last 6 months on the Facebook page of the Belgian Railway company (SNCB / NMBS). To do so, we will use the Facebook SDK package for python which provides handy functions to interact with the Graph API.

The first thing that we need is an identification token to connect to the Graph API. The Facebook token allows us to give specific permission to a given application. For instance, we can allow it to see our list of friends, our email address etc. Facebook is actually quite strict regarding the protection of user data. For what we will do today, we don't need any of these permissions as we are not trying to get information from a particular user but instead we want to access a public Facebook page and extract from it as much public information as possible. The token can be obtained via OAuth2 identification but in order to keep this blog simple, we will request a token using the Facebook Explorer interface. On this page, you need to press the "get token" button, and chose "get user token". A popup window opens where you can choose which permissions you want to associate to this token. We don't need any permission for what we plan to do, so no need to check any of the boxes. Be aware that the user token has a limited time validity of about 2 hours. So you may have to request a new token from time to time.

Now that we have a token, we can make a query on the Facebook API using the python SDK. The code bellow shows how to parse all posts on the Facebook page of the SNCB (Belgian railway). This is done via the "getConnections" function which takes as argument a Facebook object Id (here the Id of the SNCB page on Facebook) and the type of contents we want to grab (here the posts on the page). As the number of post on a page can be a pretty large number, Facebook graph API only returns the first posts. But, it also returns a pointer (via a "paging" object) to get the following bunch of posts, so if we want to process all posts of a page we will need to process them bunch by bunch until there is no following bunch available. See the graph API documentation for more details.

import facebook
#set token we received from Facebook explorer (the one bellow is out dated)
access_token = 'EAACEdEose0cBAPWTcaMppNFceRnsORWCFSiaaQD8Gr7UZArgl7xZBuucoUz96g3QmmP2tZCgR2DAlLl4sxmnmlabArULdZBGsqM7KcUHzlLZBJvRH6FWnVBfeYt7bAW5fZAWZCZALQnE0BhRQxraAUKW7ec0H6cwzL7GwkKGXZA435gZDZD'
public_page = 'sncb'

graph = facebook.GraphAPI(access_token)
pageFB = getObject(public_page)
posts = getConnections(pageFB['id'], connection_name='posts', summary='true')

while(True):
   for p in posts['data']:
      #we can process a facebook post  (for this example, we just print it)
      print(p)

   if( ('paging' not in posts) or ('next' not in posts['paging'])):
      #we have processed all posts, exit the while loop
      break
   else:
      #there are more post to grab, so get a new bunch of post
      posts = requests.get(posts['paging']['next']).json()

As you can see, the amount of information contained in one single post is quite impressive. You can note that the information is saved in the form of a python dictionary which is particularly convenient to access specific fields. See what I get for the first post:

{
   story_tags : {'0': [{'type': 'page', 'id': '484217188294962', 'name': 'SNCB', 'offset': 0, 'length': 4}]},
   from : {'id': '484217188294962', 'name': 'SNCB', 'category': 'Transportation Service', 'category_list': [{'id': '152367304818850', 'name': 'Transportation Service'}, {'id': '2258', 'name': 'Travel Company'}]},
   link : "https://www.facebook.com/SNCB/photos/a.844798168903527.1073741830.484217188294962/1157476480969026/?type=3",
   is_expired : False,
   updated_time : "2016-12-15T07:53:27+0000",
   actions : [{'name': 'Comment', 'link': 'https://www.facebook.com/484217188294962/posts/1157476480969026'}, {'name': 'Like', 'link': 'https://www.facebook.com/484217188294962/posts/1157476480969026'}],
   icon : "https://www.facebook.com/images/icons/photo.gif",
   is_hidden : False,
   message : "Les accompagnateurs de train Anneleen et Bart sont fiancés ! <3\nLeurs regards se sont croisés pour la première fois dans le centre de formation où leur carrière a débuté en 2011. Ce n’est qu’un an plus tard qu’a jailli l’étincelle, lors de leur premier rendez-vous. La gare de Louvain a récemment été le théâtre de la demande en mariage.\nNous leur souhaitons beaucoup de bonheur ensemble !",
   object_id : 1157476480969026,
   shares : {'count': 75},
   likes : {'paging': "... TRUNCATED ..."},
   privacy : {'friends': '', 'deny': '', 'allow': '', 'description': '', 'value': ''},
   created_time : "2016-12-14T10:40:21+0000",
   type : "photo",
   name : "Timeline Photos",
   id : "484217188294962_1157476480969026",
   status_type : "added_photos",
   story : "SNCB feeling in love.",
   picture : "https://scontent.xx.fbcdn.net/v/t1.0-0/p130x130/15492137_1157476480969026_4126379523377925813_n.jpg?oh=a99414cf126f04e12de4a006a435fc56&oe=58BA1CD4",
   comments : {'paging': ".... TRUNCATED ..." },
}

Among other things, we have the post message, creation time, the name of the author, type of post, a story description of the post, a link to the post picture and pointers to all the likes and comments (including who liked/commented on this post). That's a gigantic source of information for analytics. With this, we can identify who likes on this page (and is, therefore, concerned about Belgian Railway company), we can analyze the comments made on a post and possibly trigger unsatisfied customer message (and take action to improve the situation), we can do user segmentation based on the user profiles, etc.

As a simple benchmark for this post, we will analyze the ~150 posts that were published on the SNCB page of the last 6 months. We can look first at the post popularity in terms of likes, shares, and comments:

likestrend

For the 15 most popular posts we show the post name. We can notice that for most of them, the post name is "Timeline Photos" which is a name generated by Facebook when a new picture is uploaded. We can therefore already notice that "picture" post are more popular than text or news post. This is precious information for the communication strategy of the company. We can continue by analyzing the text of the post in order to identify the topic of interest on SNCB page. To do so I used the simple entity extraction of the NLTK python library and plot them as a word cloud (word size depends on frequency):

As expected, Mobility, railway stations, and special exhibitions are the main center of interest. We don't learn anything very striking here, but this is just an example. What is more interesting to do is to analyze the text of the user reactions to SNCB posts (but we will not do it here). We can continue our simple page analytics by finding who are the major post likers for this page. As most of the Facebook users are actually using their real name as Facebook pseudo, this analysis is particularly interesting as it allows you to identify (most of the time) real people who are the defender of your products/brand. And eventually to identify the ones who have issues with your products. That's again very useful information as it gives you a chance to engage communication with them and solve the problem they may have and eventually prevent churn. Bellow are two figures showing the best fan of the SNCB page in terms of a #likes/user distribution and on the form of a word cloud.

We can perform many more analyses based on Facebook data, but I will stop here for this blog. Another one with more complex (and interesting) analytics will be released soon...

Have you already faced similar type of issues ?

Feel free to contact us, we'd love talking to you…

Don't forget to like and share if you found this helpful!

Share this page: