This post provides a simple recipe of how to extract website contents from given URLs.
Methods are static, since they are basic utilities and easy to reuse.
The method below receives the URL of the desired page, creates a connection, downloads its contents and replaces special characters from HTML accordingly.
Here I’m using StringBuilder because of possible performance issues.
StringEscapeUtils.unescapeHtml4(...) is a ready-to-use method to decode HTML special characters.
I’m using a getURLContent(String urlStr) to encapsulate the method above with an additional feature to add the http:// protocol in case it’s not provided.
Voilà! The URL content can be easily extracted into a String using the getURLContent(String urlStr).
Remember to treat the Exceptions in the way that fits best your business requirements!