How Semantic Web Works

The World Wide Web is an interesting paradox -- it's made with computers but for people. The sites you visit every day use natural language, images and page layout to present information in a way that's easy for you to understand. Even though they are central to creating and maintaining the Web, the computers themselves really can't make sense of all this information. They can't read, see relationships or make decisions like you can.

The Semantic Web proposes to help computers "read" and use the Web. The big idea is pretty simple -- metadata added to Web pages can make the existing World Wide Web machine readable. This won't bestow artificial intelligence or make computers self-aware, but it will give machines tools to find, exchange and, to a limited extent, interpret information. It's an extension of, not a replacement for, the World Wide Web.

That probably sounds a little abstract, and it is. While some sites are already using Semantic Web concepts, a lot of the necessary tools are still in development. In this article, we'll bring the concepts and tools behind the Semantic Web down to earth by applying them to a galaxy far, far away.

Why Semantic Web?

Suppose you want to buy a "Star Wars Trilogy" boxed set online, and you have some basic criteria for your purchase. First, you want widescreen, not full-screen, DVDs, and you want the set that has the extra disc of bonus materials. Second, you want the lowest available price, but you'd prefer to buy a new set, not a used one. Finally, you don't want to pay too much for shipping and handling, but you also don't want to wait too long for delivery.
external image semantic-web-1.jpg

At this point in the evolution of the Web, your best bet would be to look at different retailers' web pages, comparing prices and shipping times and rates. You could also look for a site that will compare prices and shipping options from several retailers all at once. Either way, you have to do most of the virtual legwork, then make your buying decision and place your order yourself.

With the Semantic Web, you'd have another option. You could enter your preferences into a computerized agent, which would search the Web, find the best option for you, and place your order. The agent could then open personal finance software­ on your computer and record the amount you spent, and it could mark the date your DVDs should arrive on your calendar. Your agent would also learn your habits and preferences, so if you had a bad experience buying from one particular site it would know not to use that site again.

The agent would do this not by looking at pictures and reading descriptions like a person does, but by searching through metadata that clearly identify and define what the agent needs to know. Metadata are simply machine-readable data that describe other data. In the Semantic Web, metadata are invisible as people read the page, but they're clearly visible to computers. Metadata can also allow more complex, focused Web searches with more accurate results. To paraphrase Tim Berners-Lee, inventor of the World Wide Web, these tools will let the Web -- currently similar to a giant book -- become a giant database.
We'll look at the tools that can make documents machine readable next.

Marking Up: XML and RDF

Let's say you want to make this sentence readable to a computer:
Anakin Skywalker is Luke Skywalker's father.
It's easy for you to figure out what this sentence means -- Anakin and Luke Skywalker are both people, and there is a relationship between them. You know that a father is a type of parent, and that the sentence also means that Luke is Anakin's son. But a computer can't figure any of that out without help. To allow a computer to understand what this sentence means, you'd need to add machine-readable information that describes who Anakin and Luke are and what their relationship is. This starts with two tools -- eXtensible Markup Language (XML) and Resource Description Framework (RDF).
XML is a markup language like hypertext markup language (HTML), which you're probably somewhat familiar with from surfing the Web. HTML governs the appearance of the information you look at on the Web. XML complements (but does not replace) HTML by adding tags that describe data. These tags are invisible to the people who read the document but visible to computers. Tags are already in use on the Web, and existing bots, like the bots that collect data for search engines, can read them.
RDF does exactly what its name indicates -- using XML tags, it provides a framework to describe resources. In RDF terms, pretty much everything in the world is a resource. This framework pairs the resource (any noun, like Anakin Skywalker or the "Star Wars" trilogy) with a specific item or location on the Web so the computer knows exactly what the resource is. Clearly identifying resources keeps the computer from doing things like confusing Anakin Skywalker with Sebastian Shaw or Hayden Christiansen, or the original trilogy with the One-Man "Star Wars" Trilogy.
To do this, RDF uses triples written as XML tags to express this information as a graph. These triples consist of asubject, property and object, which are like the subject, verb and direct object of a sentence. (Some sources call these the subject, predicate and object.) RDF already exists on the Web -- for example, it's part of RSS feed creation.
external image semantic-web-2.jpg

An RDF triple has a subject (Anakin Skywalker), an object (Luke Skywalker) and a property that unites the two.
So far in this example, the computer knows that there are two objects in this sentence and that there is a relationship between them. But it doesn't know what the objects are or how they relate to one another. We'll look at the tool for adding this layer of meaning next.

Knowing What's What: URIs

Even with the framework that XML and RDF provide, a computer still needs a very direct, specific way of understanding who or what these resources are. To do this, RDF uses uniform resource identifiers (URIs) to direct the computer to a document or object that represents the resource. You're already familiar with the most common form of URI -- the uniform resource locator (URL), which begins with http://. A URI can point to anything on the Web and may also point to objects that are not part of the web, like appliances in computerized homes. Mailto, ftp and telnet addresses are some other examples of URIs.
For our example, we'll use the characters' pages at the official Star Wars site as their URIs.
external image semantic-web-3.jpg

A URI gives a computer a specific point of reference for each item in the triple -- there's no need for interpretation or potential for misunderstanding.
Now the computer knows what the subject and object are -- Anakin Skywalker is the entity represented by the first URI, and Luke Skywalker is the entity represented by the second. But you'll notice that the middle URI in our triple -- the one for the property -- doesn't point to the Star Wars site. Instead, it points to a make-believe document on the HowStuffWorks server. If that page really existed, it would be our XML namespace.

Unlike HTML, which uses standard tags like <b> for bold and <u> for underline, XML doesn't have standard tags. This is useful -- it lets developers create unique tags for specific purposes. But it means that a browser doesn't automatically know what the tags mean. An XML namespace is basically a document that tells applications the meaning of all the tags in another document. The creator of an XML document declares the namespace at the beginning of the document with a line of code. In our example, our namespace declaration would look like this:

<rdf:RDF xmlns:hsw=>

That line of code says to the computer, "Any tags you see that begin with 'hsw' use the vocabulary found in this document. You can look up any tag beginning with 'hsw' here." That way, people can create the XML tags they need for a document without conflicting with other XML documents on the Web.
XML and RDF are the "official language" of the Semantic Web, but by themselves they're not enough to make the entire Web accessible to a computer. We'll look at some of the other layers next.

Another obstacle for the Semantic Web is that computers don't have the kind of vocabulary that people do. You've used language your whole life, so it's probably easy for you to see connections between different words and concepts and to infer meanings based on contexts. Unfortunately, someone can't just give a computer a dictionary, an almanac and a set of encyclopedias and let the computer learn all this on its own. In order to understand what words mean and what the relationships between words are, the computer has to havedocuments that describe all the words and logic to make the necessary connections.
In the Semantic Web, this comes from schemata andontologies. These are two related tools for helping a computer understand human vocabulary. An ontology is simply a vocabulary that describes objects and how they relate to one another. A schema is a method for organizing information. As with RDF tags, access to schemata and ontologies are included in documents as metadata, and a document's creator must declare which ontologies are referenced at the beginning of the document.
Schema and ontology tools used on the Semantic Web include:
  • RDF Vocabulary Description Language schema (RDFS) - RDFS adds classes, subclasses and properties to resources, creating a basic language framework. For example, the resource Dagobahis a subclass of the class planet. A property of Dagobah could be swampy.
  • Simple Knowledge Organization System (SKOS) - SKOS classifies resources in terms of broaderor narrower, allows designation of preferred and alternate labels and can let people quickly portthesauri and glossaries to the Web. For example, in a Star Wars glossary, a narrower term for Sith Lord could be Darth Sidious and a broader term could be villain. Similarly, alternate labels for Han Solo might be nerf herder and laser brain.
  • Web Ontology Language (OWL) - OWL, the most complex layer, formalizes ontologies, describes relationships between classes and uses logic to make deductions. It can also construct new classes based on existing information. OWL is available in three levels of complexity -- Lite, Description Language (DL) and Full.
external image semantic-web-4.jpg

An example of a very small number of the resources and connections that might be found in a Star Wars ontology. You can figure these out on your own from watching the movies and surfing the Web, but a computer must have a clear outline to make sense of it.
The trouble with ontologies is that they are very difficult to create, implement and maintain. Depending on their scope, they can be enormous, defining a wide range of concepts and relationships. Some developers prefer to focus more on logic and rules than on ontologies because of these difficulties. Disagreements regarding the roles these rules should play may be one potential pitfall for the Semantic Web.
Next, we'll tie it all together by looking at our original example -- those "Star Wars Trilogy" DVDs.

Tying it All Together

Security and ProofAs with any Web document, the Semantic Web requires security measures to protect data and transactions. Included in W3C's recommendations for the Semantic Web are digital signatures, encryption, proofs andtrust. Proofs and trust relate to the logic of the Semantic Web and applications' abilities to verify that data is correct and consistent through all of the web's layers.
In our original example, we talked about buying "Star Wars" DVDs online. Here's how the Semantic Web could make the whole process easier:
  • Each site would have text and pictures (for people to read) and metadata (for computers to read) describing the DVDs available for purchase on their site.
  • The metadata, using RDF triples and XML tags, would make all the attributes of the DVDs (like condition and price) machine-readable.
  • When necessary, businesses would use ontologies to give the computer the vocabulary needed to describe all of these objects and their attributes. The shopping sites could all use the same ontologies, so all of the metadata would be in a common language.
  • Each site selling the DVDs would also use appropriate security and encryption measures to protect customers' information.
  • Computerized applications or agents would read all the metadata found at different sites. The applications could also compare information, verifying that the sources were accurate and trustworthy.
Of course, the Web is enormous, and adding all this metadata to existing pages is a huge undertaking. We'll look at this and some of the other potential hurdles for the Semantic Web next.