Chapter 3: The Internet and the Web

A Balanced Introduction to Computer Science and Programming

David Reed
Creighton University

Copyright © 2004 by Prentice Hall

Chapter 3: The Internet and the Web

It shouldn't be too much of a surprise that the Internet has evolved into a force strong enough to reflect the greatest hopes and fears of those who use it. After all, it was designed to withstand nuclear war, not just the puny huffs and puffs of politicians and religious fanatics.
Denise Caruso

The 'Net is a waste of time, and that's exactly what's right about it.
William Gibson

Looking at the proliferation of personal web pages on the net, it looks like very soon everyone on earth will have 15 Megabytes of fame.
M.G. Siriam

The Internet is a vast network of computers, connecting people from across the world. Through the Internet, a person in Indiana can communicate freely with a person in India, scientists at different universities can share computing power, and researchers can access libraries worth of information regardless of physical location. The World Wide Web provides a simple, intuitive interface for users to share and access information, and has even become an essential medium for advertising and commerce.

Chapter 1 provided an overview of the Internet and the Web from a user's perspective. It emphasized the differences between the Internet and the Web, noting that the Internet is hardware (computers, cable, wires, …) while the Web is software (documents, images, sound clips, …) that is stored and accessed over the Internet. This chapter provides more details as to how the Internet and Web developed and how they work.

History of the Internet

J.C.R. Licklider
The Internet traces its roots back to the early 1960's. While a professor at the Massachusetts Institute of Technology, J.C.R. Licklider (1915-1990) published a series of articles describing a "Galactic Network" of computers that would allow people to share and access information worldwide. When Licklider became head of the computer research program at the U.S. Department of Defense's Advanced Research Project Agency (ARPA) in 1962, he pushed for the realization of his vision. With ARPA funding, researchers at various institutions worked to design the technology that would allow computers to communicate effectively over long distances.

Larry Roberts
Larry Roberts (1937-) headed the ARPANet project, designing the network that would eventually evolve into the Internet of today. The ARPANet was operational in 1969, connecting four computers at the University of California at Los Angeles (UCLA), the University of California at Santa Barbara (UCSB), the Stanford Research Institute (SRI), and the University of Utah. The network utilized dedicated cables and allowed researchers to transfer information at a rate of 56 Kbits/sec. Interestingly enough, this is roughly the same speed obtainable using a modem and standard phone lines today. However, at the time it represented a dramatic increase over the 160 bit transfer rate obtainable over phone lines.

The original intent of the ARPANet was to connect military installations and universities that participated in government projects. By 1981, more than 200 computers were connected to the ARPANet, allowing researchers to share information and computer resources. Driven by the popularity of network applications such as electronic mail, newsgroups, and remote computer logins, the number of computers on the Internet exceeded 10,000 by 1987. Looking toward future growth, the National Science Foundation (NSF) funded high-speed transmission lines that would form the backbone of the expanding network. The term "Internet" was coined in recognition of the similarities between the computer network and the interstate highway system. The backbone connections were analogous to interstate highways, providing fast communications between major destinations. Connected to the backbone were transmission lines with slower, more limited capabilities, connecting secondary destinations analogous to state highways. Additional levels would be required before reaching individual computers, similar to the city and neighborhood roads required to reach individual houses. Control of the Internet was transferred to the private sector in the early 1990's. The physical components of the Internet are now managed by commercial firms such as MCI WorldCom, which built the very high-speed Backbone Network System (vBNS) in 1995 to replace existing backbone connections. The workings of the Internet are managed by a non-profit organization, the Internet Society, whose committees rely largely on volunteers to design the technology and protocols that define the Internet.

The table below documents the growth of the Internet over the past two decades, as estimated by the Internet Software Consortium. These numbers indicate exponential growth, with the size of the Internet doubling every 1-2 years. An interesting consequence of exponential growth is that at any point in time, roughly half of the computers on the Internet were added within the last 1-2 years. Of course, exponential growth cannot continue indefinitely. To date, however, technological advances have been able to accommodate increasing demands and no immediate end to growth is foreseen.

Year	Computers on the Internet
2002	162,128,493
2000	93,047,785
1998	36,739,000
1996	12,881,000
1994	3,212,000
1992	992,000
1990	313,000
1988	56,000
1986	5,089
1984	1,024
1982	235

How the Internet works

Paul Baran
The design of the ARPANet, and hence the Internet of today, was strongly influenced by the ideas of Paul Baran (1926-), a researcher at the Rand Corporation in the 1960's. Baran's first design idea that was adopted for the ARPANet was that of a distributed network. Recall that funding for the ARPANet was provided by the U.S. Department of Defense, which had very specific requirements in mind. Being at the height of the cold war, the military was concerned with the development of a national communications network that would be resistant to attack. That is, they wanted a design that allowed communication to take place even if parts of the network were damaged or destroyed, whether by an enemy action or by normal machine failures. Clearly, a centralized network that relied on a small number of master computers to coordinate transmissions would not suffice. For example, the U.S. telephone network of the time relied on central hubs or switches that routed service to entire regions. If a hub failed for some reason, entire cities or regions could lose service. Baran proposed a different architecture for a computer network, one in which control was distributed across a large number of machines. His design utilized a lattice structure, with each computer connected to several others. If a neighboring computer or connection were to fail, then communications could be routed around that portion of the network and follow an alternate path.

The other idea proposed by Baran (among others) that was central to the ARPANet architecture was that of packet-switching. In a packet-switching network, messages to be sent over the network are first broken into small pieces known as packets, and those packets are then sent independently to their final destination. There are three main advantages to transmitting messages in this way. First, it tends to make more effective use of the connections. Data communications tend to mostly involve short transmission bursts. If large messages were able to monopolize a connection, many smaller messages might be forced to wait. As a real life example of this effect, think of a line at a busy pay phone. If the person currently on the phone has a long conversation, then everyone else in line, even those who only have to make short calls, must wait. If phone calls were limited in length, say 3 minutes per call, then each person in line would be guaranteed a turn in a reasonable amount of time. Similarly, limiting the size of messages transmitted over the network allows many users to share the connections. The second advantage of packet-switching is related to the distributed nature of the network. Since there are numerous paths that a message might take in reaching its destination, it is advantageous to be able to send parts of the message along different routes, say if a portion of the network fails or becomes overloaded during transmission. The third advantage is that packet-switching improves reliability. If a message is broken into packets and the packets are transmitted independently, then there is a high likelihood that at least part of the message will arrive at its destination, even allowing for some failures within the network. If the recipient receives only part of the message, they can acknowledge its receipt and request retransmission from the sender.

While the terms distributed and packet-switching describe the architecture of the Internet, they do not address how computers connected to the Internet are able to communicate effectively. After all, people from around the world speak different languages and have different customs. If the average American can't speak or understand Russian (and vice versa), how can we expect a computer in Nebraska to communicate with a computer in Moscow? The solution to this problem is to agree upon protocols, sets of rules that describe how communication is to take place. As a real-world example, consider the postal system (jokingly referred to as snail-mail by some in the electronic community). If every state or country used its own system for labeling mail for delivery, sending a letter would be an onerous if not impossible task. Fortunately, protocols have been established for uniquely specifying addresses, including zip codes or country codes that allow letters to be easily sent across the country or across the world. Similarly, protocols were established for the Internet that define how computers are to be addressed and the form by which messages must be labeled for delivery

Similar to the manner in which houses are assigned addresses to uniquely identify them, computers on the Internet are assigned unique addresses known as IP addresses. An IP address is a number, usually written as a dotted sequence such as 147.134.2.20. When a new computer is to be connected to the Internet, it must be assigned an IP address through a local organization or Internet Service Provider (ISP). Once the computer has its IP address and is physically connected to the network, it can send and receive messages and access other Internet services. The manner in which messages are sent and received over the Internet is determined by a pair of protocols called the Transmission Control Protocol (TCP) and Internet Protocol (IP). TCP is concerned with the way that messages are broken down into packets, then reassembled by the recipient. IP is concerned with labeling the packets for delivery and controlling the path that they take to their destination. The combination of these two protocols, written as TCP/IP, is often referred to as the language of the Internet. Any computer that is able to "speak" the language defined by TCP/IP will be able to communicate with any other computer on the Internet.

When a person wants to send a message over the Internet, software using the rules spelled out by TCP will break that message into packets (no bigger than 1,500 characters each) and label the packets as to their sequence (e.g., packet 2 of 5). Software following rules spelled out by IP will label those packets with routing information, including the IP addresses of the source and destination computers. Once labeled, the packets are sent independently over the Internet. Special purpose machines called routers receive the packets, access the routing information, and pass them on towards their destination. The routers utilize various information sources, including statistics on traffic patterns, to determine the best direction for each packet to follow. As an analogy, consider driving a car to a familiar destination. You most likely have a standard route that you take, but may adjust that route if you see heavy traffic ahead or know of road closings due to construction. In a similar way, routers are able to adjust to congestion or machine failures and send each individual packet in the best direction at that time. When the packets arrive at their destination, possibly out of order due to the various routes taken, TCP software running on the recipient's computer reassembles the packets to obtain the original message.

From a user's perspective, remembering the digits that make up IP addresses can be tedious and error-prone. Mistyping only one digit in an IP address might mistakenly identify a computer half way around the world. Fortunately, the Internet allows for individual machines to be assigned names that can be used in place of the IP address. For example, the computer with IP address 147.134.2.20 can be referred to by the name bluejay.creighton.edu. Such names, commonly referred to as domain names, are hierarchical in nature to make them easier to remember. The leftmost part of the name specifies the name of the machine, with subsequent parts specifying the organization and possibly sub-organizations that computer belongs to. The rightmost part of the domain name is known as the top-level domain, which identifies the type of organization involved. For example, the computer bluejay.creighton.edu is named bluejay and belongs to Creighton University, which is an educational institution. Similarly, www.sales.acme.com is a computer named www, belonging to the sales department of a fictional Acme Corporation, which is a commercial business.

Examples of common top-level domains are listed in the table below. In addition, countries have their own top-level domain, such as ca (Canada), uk (United Kingdom), br (Brazil) and in (India).

`edu`	U.S. educational institutions
`com`	commercial organization
`org`	non-profit organizations
`mil`	U.S. military
`gov`	U.S. government
`net`	network providers & businesses

While domain names are easier for users to remember, any communication that is to take place over the Internet requires the IP address of the source and destination computers. Mappings between domain names and IP addresses are stored on special-purpose computers called domain name servers (DNS). When a message is to be sent to a destination such as bluejay.creighton.edu, a request is first sent to a domain name server to map the domain name to an IP address. The domain name server looks up the domain name in a table and sends the IP address (here, 147.134.2.20) back to the sender's computer so that the message can be sent to the correct destination. If a particular domain name server does not have the requested domain name stored locally, it forwards that request to another domain name server on the Internet until the correct mapping is found.

History of the Web

Tim Berners-Lee
While the Internet was popular in the 1980's among universities and government organizations, its mainstream popularity is attributable to the development of the World Wide Web in the early 1990's. The Web, a multimedia environment in which documents can be seamlessly linked over the Internet, was the brainchild of Tim Berners-Lee (1955-). In the 1980's, Berners-Lee was a researcher at the European Laboratory for Particle Physics (CERN). Since CERN researchers were located all across Europe and utilized different type of computers and software, sharing information was difficult. Berners-Lee envisioned a system in which researchers could share information, regardless of location and the type of computer used. In 1989, he proposed the basic idea for the Web, where documents could be stored on local machines but linked together for easy access.

The idea of linking documents together so that they could be accessed easily and in flexible ways was not new to Berners-Lee. The idea of hypertext, documents with cross-linked and interlinked text and media, has been around for centuries in the form of books with alternate story lines, e.g., "If the knight defeats the dragon, continue at page 37. If not, continue at page 44." In 1945, Presidential science advisor Vannevar Bush outlined ideas for a machine that would store textual and graphical information in such a way that any piece of information could be arbitrarily linked to any other piece. Small-scale hypertext systems were developed for computers starting in the 1960's, culminating in the popular HyperCard system that shipped with Apple Macintosh computers in the late 1980's. Berners-Lee's innovation was in combining the key ideas of hypertext with the distributed nature of the Internet. His design for the Web relied on two different types of software, running on computers over the Internet. A Web server is a computer that stores documents and "serves" them to other computers that want access. A Web browser is a piece of software that runs on and individual's computer and allows them to request and view the documents stored on servers. A person running a Web browser could access and jump between documents, regardless of the location of the servers storing those documents.

Marc Andreesen
In 1990, Berners-Lee produced working prototypes of a Web server and browser. His browser was limited by today's standards, being text-based with only limited support for images and other media. This early version of the Web received a small but enthusiastic following when Berners-Lee made it available over the Internet in 1991. The Web might have remained an obscure research tool if not for the development of graphical Web browsers. In 1993, Marc Andreesen and Eric Bina at the University of Illinois' National Center for Supercomputing Association (NCSA) wrote the first Web browser with the everyday user in mind, utilizing buttons as navigational aids and integrated images and media within pages. They called their browser Mosaic, and response to its release was overwhelming. As more and more people learned how easy it was to store and access information using the Web, the number Web servers on the Internet grew: from 50 in 1992 to 3,000 in 1994. In 1994, Andreesen left NCSA to found the Netscape Communications Corporation, which marketed an extension of the Mosaic browser called Netscape Navigator. Originally, Netscape charged a small fee for their browser, although students and educators were exempt from this cost. When Microsoft introduced their Internet Explorer browser in 1995, they released it free of charge and Netscape was eventually forced to follow suit. The late 1990's were a dynamic type for the Web, as Netscape and Microsoft battled for market share, relying on advertising and supporting services to pay for the development of free software. Despite Netscape's early dominance (with 75% of the browser market in 1996), Microsoft's Internet Explorer soon dominated. In 1999, Netscape was bought by AOL for $10 billion in stock. While Internet Explorer and Netscape Navigator are still the most popular browsers on the market, other browsers exist, including text-based browsers, browsers for vision-impaired users, and browsers for new technologies such as cell phones and Personal Digital Assistants (PDAs).

The table below documents the growth of the World Wide Web over the past decade, as estimated by the Netcraft Web Server Survey. It is interesting to note the dramatic increases in the size of the Web following advances in browser technology: Mosaic (1993), Netscape (1995), and Internet Explorer (1996). According to the latest estimates, roughly 1 out of every 5 computers on the Internet (20.4%) acts as a Web server. Of course, each Web server may store a large number of pages, and so the size of the Web in terms of pages is even more impressive. In 2002, the Google search engine (google.com) claimed to have more than 3 billion Web pages indexed, and other sources estimate as many as 5 billion pages and growing.

Year	Computers on the Internet	Web Servers on the Internet
2002	162,128,493	33,082,657
2000	93,047,785	18,169,498
1998	36,739,000	4,279,000
1996	12,881,000	300,000
1994	3,212,000	3,000
1992	992,000	50

The future development of the Web is now guided by a non-profit organization called the World-Wide Web Consortium (W3C), which produces new standards and oversees the design of new Web-based technologies. As part of the Internet Society, the W3C relies mainly on volunteer labor from technically-qualified and interested individuals.

How the Web works

As was the case with Internet communications, the World Wide Web relies on protocols to ensure that Web pages are accessible and viewable by any computer. As we saw in Chapters 1 and 2, the content of Web pages is defined using HTML, the HyperText Markup Language. By placing tags within a text document, the content of the page takes on special meaning. Part of the job of a Web browser is to read those tags and format the page accordingly. For example, when a browser encounters text enclosed in <b></b> tags, it interprets those tags as specifying bold text and displays the characters in a darker font.

HTML is an evolving standard, with new features proposed and adopted as technology and user needs change. The current standard for HTML, as defined by the World Wide Web Consortium, is known as XHTML 1.0, recognizing its connections to the more general hypertext language XML. Web browsers work because they all understand and follow the HTML standard. While subtle differences may occur between browsers, all Web browsers understand the same basic set of tags and display the resulting text similarly. Thus, an author may place an HTML document on a Web server and be assured that it will be viewable by users regardless of their machine or browser software.

To a person "surfing" the Web, the process of locating, accessing, and displaying Web pages is transparent. When the person requests a particular page, either by entering its location into the Address box of the browser or else clicking on a link in an existing page, the new page is displayed in the browser window as if by magic. In reality, complex communications are taking place between the computer running the browser and the appropriate Web server. When the person requests the page, the browser must first identify the Web server where that page is stored. Recall from Chapter 1 that the Web address or URL for a page includes the name of the server as well as the document name. Once the server name has been extracted from the URL, the browser sends a message to that server over the Internet to request the page (following the steps described above for Internet communications). The Web server receives the request, locates the page within its directories, and sends the text of that page back in a message. When that message is received, the browser interprets the HTML formatting information embedded in the page and displays it appropriately in the browser window. The protocol that determines how the messages between the browser and server are formatted is known as the HyperText Transfer Protocol (HTTP).

It is interesting to note that accessing a single page might involve several rounds of communication between the browser and server. If the Web page contains embedded elements, such as images or sound clips, the browser will not recognize this until it begins displaying the page. When the browser interprets the HTML tag that specifies the embedded element, it will then send a separate request to the server for the item. Thus, a page containing 10 images will requires 11 interactions between the browser and server, one for the page itself and one for each of the 10 images. To avoid redundant and excessive downloading, most browsers utilize a technique called caching. When a page or image is first downloaded, it is stored in a temporary directory on the user's computer. The next time that page or image is requested, the browser first checks to see if it has a copy stored locally in the cache, and if so, whether it is up-to-date (by contacting the server and asking how recently the page was changed). If an up-to-date copy is stored locally, then the browser can display this copy instead of taking the time to download the original. Note that it is still necessary for the browser to contact the Web server, since the document on the server might have changed since it was last cached. However, caching can still save the time and effort of downloading a redundant copy.

Review Questions

Answers to be emailed to your instructor by Thursday at 9 a.m.

How is the Internet of today related to the ARPANet of the 1960's and 1970's?
The Internet is defined by a common set of protocols for transmitting and receiving information called TCP/IP. What do TCP and IP stand for, and what is the role of each of these protocols?
What is an IP address? What steps are involved in mapping a computer's domain name, e.g., bluejay.creighton.edu, to its IP address?
Paul Baran proposed two ground-breaking design ideas for the structure and behavior of the ARPANet. Describe these design ideas.
Which has grown at a faster rate, the Internet or the Web? Justify your answer.
What is hypertext? How is the term hypertext relevant to the Web?
What does HTTP stand for, and what is its role in the workings of the Web?
How does caching improve the performance of your Web browser? Does caching reduce the number of interactions that must take place between the browser and the Web server?
What did you find most interesting in reading this chapter?
What did you find most confusing in reading this chapter?

References

Berners-Lee, Tim, Mark Fischetti (contributor) and Michael L. Derouzos. "Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web", HarperBusiness, 2000.

Bush, Vannevar. "As We May Think." Atlantic Monthly, July 1945.
http://www.theatlantic.com/unbound/flashbks/computer/bushf.htm

Comer, Douglas E. "The Internet Book: Everything you need to know about computer networking and how the Internet works, 3rd Edition", Prentice Hall, 2000.

Deitel, H.M, P.J Deitel, and T.R. Nieto. "Internet and World Wide Web: How to Program", Prentice Hall, 2000.

Griffin, Scott. "Internet Pioneers", December 2000.
http://www.ibiblio.org/pioneers/index.html

Hafner, Katie and Matthew Lyon. "Where Wizards Stay Up Late: The Origins of the Internet", Touchstone Books, 1998.

Leiner, Barry M., Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel, Larry G. Roberts, and Stephen Wolf. "A Brief History of the Internet, version 3.31", August 2000.
http://www.isoc.org/internet/history/brief.shtml

A Balanced Introduction to Computer Science and Programming

David ReedCreighton University

Copyright © 2004 by Prentice Hall