Weeks ago I was given a task to read values from an e-commerce website.
The idea was simple: a link was given, the application should parse the content of the HTML, download the specific value and store it.
I decided to use a crawler instead, and started looking for open-source solutions for Java with fast implementation.
I finally came across crawler4j, which proved to be simple but very efficient right away!
So, below I show the implementation that fits my needs: simply store all available links within a given domain, filtering the extensions which are not of my interest (i.e. images, videos, stylesheet).
For the implementation, I used one class (ProductCrawler) to define the crawler’s behaviour, and another (CrawlerControl) for… you guessed it.
The code is well commented (good practice!), no much need for further explanation.
This web crawler is a producer of product links (It’s was developed for an e-commerce). It writes links to a global singleton pl.
Further improvement could be to check if the current webpage has the target content before adding to the list.
However, this check could be cpu consuming, so the current implementation adds every link - excepts those filtered by extension - to the product list to be evaluated afterwards (or in parallel) outside the crawler implementation.
The crawler controller is simple:
Then, in the main file, it is invoked as simply as this:
Obviously, this is the shortest way to get your crawler running, as you could simply ignore the whole theory behind it.
Some properties can dramatically change your results, especially in the long run, if you need to keep your crawler running indefinitely and updating the content that has already been tracked. For instance, there are different algorithms for revisit policy, parallelism, politeness etc, but there is not a general solution that fits all. Every requirement should be study carefully for optimal efficiency.
The code above is available on my GitHub repository.