Last update on .

In a previous blog post, we have seen how to mine information on static web pages.  In this blog post, I'll explain how we can do the same on dynamically (i.e. javascript) generated web pages.  As a showcase, I will show you how to find the best land investment you can make in Belgium today... We all know many websites of classified advertising for houses and lands selling.  Those sites contain a large amount of interesting information for what concerns land invest.  Unfortunately, most of the time, those websites are heavily using javascript to dynamically generate the web page based on the client information (i.e. language, type of web browser, screen size, geographical position, cached data, etc.).  If we try to mine these sites using the techniques detailed in the previous blog, the only thing that we will get is actually a small part of the javascript code that is executed when the page is opened in a real browser.  This makes the data harvesting particularly complicated.

Fortunately, they are useful python libraries that can be used to solve this problem.  My favorite is Selenium which could be used to open a real web browser (i.e. Firefox or chrome) and automatize the used behavior like clicking on a link, filling a form, pressing a button, etc.  Many selenium tutorials can be found on Guru99.  The only thing that we have to do is to inspect the page (as explained in the previous blog) in order to identify the name of the elements of interest on the web page and tell selenium the sequence of action we want him to do on a page.  Actually, our work is even more simplified by additional web browser plugins like Selenium IDE which could be integrated into your Firefox browser to record (and later export) all the actions you make on a web page.  This allows to very quickly automatize repetitive behavior on the web.  

Below is a small example of the selenium capabilities.  In this demo, we simply open a Firefox web browser on the google page and search for the term "selenium".

import os,time
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
 
#initialize the web browser
os.environ["PATH"] += ":/PathTo/geckodriver" #needed for recent firefox versions
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
driver = webdriver.Firefox(capabilities=firefox_capabilities)
driver.implicitly_wait(10) # timeout for page to load
 
#go to google.com
driver.get("http://www.google.com")
 
#find the searchbox and put "selenium" text in it
driver.find_element_by_id("lst-ib").send_keys("selenium")
 
#wait a little bit for the text to be sent
time.sleep(1)
 
#press the search button  (this will move us to the results page)
driver.find_element_by_id("lst-ib").send_keys(Keys.RETURN)
 
#convert the results page to a beautifulSoup object and print it
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.body.get_text(" "))
 
#wait 1min before closing the window (this is just for the demo)
time.sleep(60)
driver.quit()

So, this is it for the technical aspect of this blog post, we can now move to the logic of today's showcase.  Again, I am not going to give specific code I used in order to preserve the server of the classified advertising website I've used.  However, I can speak about the logic of the algorithm and about the results I've got.  For collecting the data about the land selling market in Belgium, here are the typical actions we want to perform:

  1. Search for land for building opportunities in a Belgian city (based on a zip code)
    1. Find the search form on the page
    2. Select "land for building" as type of good
    3. Fill the zip code in the search box
    4. Press the search button
  2. Wait for the results page to load
  3. Parse the result pages
    1. Convert the page to a BeautifulSoup object (as in the previous blog post) and iterate on all the element of interest we want to gather
    2. Search for a "next page" link and click the link if it exist
    3. Go to step 3.1. and iterate until all results page even been downloaded
  4. Go to step 1. and iterate with another zip code

As you can see the logic is rather simple and thanks to Selenium IDE, all these actions can be recorded in a couple of minutes.  Then the only thing to do is to integrate all this in a python loop on city zip codes we want to analyze.  I processed all post in all Belgian cities during the month of October and collected for each the description of the good, the surface of the land, the selling price, the zip code and the town.  It took less than two hours to collect all the data, corresponding to approximately 10.000 posts. You can find bellow some figures made out of these data.

count

The number of posts collected per Belgian city zone (white indicate no post found).

surf

Average surface (in m²) of the lands for buildings being sold in each city zone.

price

Average price (in euro) of the lands for buildings being sold in each city zone.

priceNorm

The average price per surface (€/m²) of the lands for buildings being sold in each city zone.

The last figure can be compared to the official figure made by the Belgian government in 2014.  In can be seen that despite we used data from only one data harvesting in October 2016, the two figures are very similar and we can reproduce all the trends that are observed on the official figure:  high price in Brussels and on the Belgian coast.  The scale of the price is also quite comparable. We can also notice that the size of the lands being sold is much larger in Wallonia compare to Flanders, but the price/m² is also much lower.

Now that we have meaningful data, we can start looking for the best investment.  To do so, we will look for the land which as a price/m² significantly lower than the average for its town.  In order to accommodate with the limited statistics we have, we will only consider towns for which we have at least 5 offers (in order to have a reasonable error on the average).  Bellow is the list of the 25 best investments you can make according to the average of the price in the town.  In the top 10, nine goods are located in Flanders which is certainly meaning something...

Land of  1614 m² to sell at 215000 € (133.21 €/m²) in 1860 meise (average for the town is 352.58+- 66.56 €/m²)
Land of  1445 m² to sell at 125000 € ( 86.51 €/m²) in 3950 bocholt (average for the town is 152.01+- 20.83 €/m²)
Land of 15950 m² to sell at 595000 € ( 37.30 €/m²) in 2550 kontich (average for the town is 524.20+-155.83 €/m²)
Land of  3326 m² to sell at 144000 € ( 43.30 €/m²) in 3320 hoegaarden (average for the town is 232.71+- 61.52 €/m²)
Land of  2013 m² to sell at 107000 € ( 53.15 €/m²) in 3560 lummen (average for the town is 173.30+- 39.51 €/m²)
Land of  2487 m² to sell at 175000 € ( 70.37 €/m²) in 3520 zonhoven (average for the town is 179.41+- 36.29 €/m²)
Land of  1810 m² to sell at 135000 € ( 74.59 €/m²) in 3990 peer (average for the town is 177.42+- 34.83 €/m²)
Land of  2680 m² to sell at  90000 € ( 33.58 €/m²) in 3970 bourg-leopold (average for the town is 167.06+- 46.53 €/m²)
Land of  1386 m² to sell at 225000 € (162.34 €/m²) in 1860 meise (average for the town is 352.58+- 66.56 €/m²)
Land of  3050 m² to sell at  57000 € ( 18.69 €/m²) in 5350 ohey (average for the town is 45.65+- 9.70 €/m²)
Land of 10491 m² to sell at  32000 € (  3.05 €/m²) in 6640 vaux-sur-sure (average for the town is 42.92+- 14.34 €/m²)
Land of 14201 m² to sell at  90000 € (  6.34 €/m²) in 6860 leglise (average for the town is 56.48+- 18.47 €/m²)
Land of  7000 m² to sell at 165000 € ( 23.57 €/m²) in 1370 jodoigne-souveraine (average for the town is 85.28+- 23.38 €/m²)
Land of  2156 m² to sell at 312000 € (144.71 €/m²) in 2870 breendonk (average for the town is 269.68+- 47.35 €/m²)
Land of  1521 m² to sell at 128200 € ( 84.29 €/m²) in 3670 meeuwen-gruitrode (average for the town is 228.11+- 54.78 €/m²)
Land of  6858 m² to sell at 399000 € ( 58.18 €/m²) in 2310 rijkevorsel (average for the town is 250.75+- 74.73 €/m²)
Land of  6000 m² to sell at 280000 € ( 46.67 €/m²) in 1570 gammerages (average for the town is 221.27+- 68.20 €/m²)
Land of  1858 m² to sell at 275000 € (148.01 €/m²) in 1780 wemmel (average for the town is 423.63+-110.28 €/m²)
Land of 12370 m² to sell at 149000 € ( 12.05 €/m²) in 6470 sivry-rance (average for the town is 37.70+- 10.32 €/m²)
Land of  1371 m² to sell at 125000 € ( 91.17 €/m²) in 3990 peer (average for the town is 177.42+- 34.83 €/m²)
Land of 12000 m² to sell at 100000 € (  8.33 €/m²) in 5377 somme-leuze (average for the town is 46.75+- 15.63 €/m²)
Land of 13860 m² to sell at  35000 € (  2.53 €/m²) in 4190 ferrieres (average for the town is 45.60+- 17.78 €/m²)
Land of   770 m² to sell at  30000 € ( 38.96 €/m²) in 2235 hulshout (average for the town is 227.01+- 78.15 €/m²)
Land of  7700 m² to sell at  84000 € ( 10.91 €/m²) in 6670 gouvy (average for the town is 39.97+- 12.15 €/m²)

Have you already faced similar type of issues ?  Feel free to contact us, we'd love talking to you…

If you enjoyed reading this post, please like it. It doesn't cost you anything, but matters for me!





Pingbacks

  1. Identifying new shop implantation thanks to geo-da on #

    […] zone among Brussels neighborhood accordingly.  We are reusing the techniques detailed in the Dynamic Web scrapping blog post.  The techniques described in this post can be useful for all sorts of B2C companies involved in […]

Pingbacks are closed.

Comments

Comments are closed.