Downloading Webpages and HTML

In this tutorial we are going to write a simple java program, which connects to a Web server and downloads webpages and documents that one is interested in. With the help of an example program you will see the use of Java’s URL and URLConnection classes in the java.net package to download data and content from Internet servers. There are many applications which makes use of client-side network programming in Java for example writing web spiders or crawlers, robots to check and verify links on a website, web browsers, search engines, a tool to download a complete website or writing a simple tool to download webpages and parse it for further data analysis. The tutorial requires a basic familiarity with Java, Object oriented programming, Internet and a Java SDK to compile and run the program.

Uniform Resource Locator (URL)
A Uniform Resource Locator (URL) is like an address, which allows any page to be uniquely identified on the World Wide Web. The standard identifier for a document on the Internet is its URL. Most of us would have come across this term if we have been surfing the web and browsing the numerous websites on the Internet. The following is an example of a URL which addresses the home page of Javacoding.net website.

URL of javacoding.net home page: http://www.netcluesoft.com:80/index.php

In the example http is the protocol identifier, www.javacoding.net is the hostname, 80 is the port number and index.php is the filename or the pathname to the file on the machine. Optionally there can be a reference to a named anchor within a resource, which usually identifies a specific location within a file. For example clicking on the URL http://www.hostname.com:port/filename.html#section1 will take you to a predefined location pointed by section1 on the file or document filename.html. Specifying a port number is also optional. If the port is not specified, the default port for the protocol is used instead. For example, the default port for http is 80.

The java.net.Url Class
Java encapsulates the concept of a URL with the class URL. The Class URL is contained in the java.net.* package that the Java programs can use to represent a URL address. A java.net.URL object instance is used to represent a URL string, which follows the following pattern:

URL string: protocol://host:port/filepath#ref

The following code shows an example use of the URL class

try {
URL javacodingURL = new URL(“http://www.javacoding.net:80/index.php”);
System.out.println(“protocol = ” + javacodingURL.getProtocol());
System.out.println(“host = ” + javacodingURL.getHost());
System.out.println(“filename = ” + javacodingURL.getFile());
System.out.println(“port = ” + javacodingURL.getPort());
} catch ( MalformedURLException e ) {
// Malformed URL
System.out.println(“Error in given URL”);
return;
}

Note that the Class URL deals with the details of the URL without actually opening a connection to it. When we create an object of type URL no network communications have been initiated and only the string argument is parsed in the URL constructor. A MalformedURLException is thrown if the JVM cannot understand the URL string.

The java.net.UrlConnection Class
After getting the URL, the next step is to open the connection which is done with the URLConnection Class. The URLConnection class is abstract, and therefore cannot be instantiated directly. The way to get an URLConnection object is to invoke a openConnection() method on a URL object that returns an object of a subclass of the URLConnection class.

The following is some sample code on how this actually works.

// After creating the URL object, open the connection
try {
URLConnection connection = javacodingURL.openConnection();
BufferedReader br = new BufferedReader ( new
InputStreamReader(connection.getInputStream()));
String line = “”;
while ((line = br.readLine()) != null)
System.out.println(line);
br.close();
}catch(UnknownHostException e){
System.out.println(“Unknown Host”);
return;
}catch(IOException e){
System.out.println(“Error in opening URLConnection”);
return;
}

After opening the connection, we can read and write using InputStream/OutputStream.

The while loop basically reads and displays the HTML file one line at a time.

The purpose of this tutorial was to write a simple Java application, which uses URL objects to download data from web servers. The piece of code written for this tutorial reads a document pointed by a URL and prints the contents to standard output because we use System.out.println(). We could easily write the contents to another file by opening another stream like FileOutputStream.

The complete program listing is:

The source

import java.io.*;
import java.net.*;

public class DownloadHTMLFile {
public static void main(String[] args) {
URL javacodingURL = null;
try {
javacodingURL = new URL(“http://www.javacoding.net:80/index.php”);
System.out.println(“protocol = ” + javacodingURL.getProtocol());
System.out.println(“host = ” + javacodingURL.getHost());
System.out.println(“filename = ” + javacodingURL.getFile());
System.out.println(“port = ” + javacodingURL.getPort());
}catch(MalformedURLException e){
// Malformed URL
System.out.println(“Error in given URL”);
return;
}

try {
URLConnection connection = javacodingURL.openConnection();
BufferedReader br = new BufferedReader(new
InputStreamReader(connection.getInputStream()));
String line = “”;
while ((line = br.readLine()) != null)
System.out.println(line);
br.close();
}catch(UnknownHostException e){
System.out.println(“Unknown Host”);
return;
}catch(IOException e){
System.out.println(“Error in opening URLConnection, Reading or Writing”);
return;
}
}
}