As IT leaders we often see opportunities for innovation in the technology assets developed and managed within our organizations. Such is the case for the following proposition around corporate web site search, which may be relevant considering the recent introductions of Bing, WolframAlpha and the word “semantic” back in the news again.
The magnitude of the Internet’s success is now matched by user frustration in sifting through endless unstructured web sites and billions of web pages for immediately relevant information. Whether it’s a large company site or a small political Blog, users are presented with a rising sea of information and are increasingly challenged to get to a specific answer or fact.
The large general-purpose search engines have been of significant benefit for several years, in that they could find a million “needles” in the Internet “haystack” in less than a second. The problem is, eventually consumers will want to stop wading through all those needles and get a relevant answer back. The user will expect to interact with a system that can better understand the information and meaning contained in those millions of web pages. Some believe the days of keyword searches presenting millions of results back to a consumer are numbered.
So where does the Semantic Web come into play? First a few definitions:
The Semantic Web is an evolving set of concepts, technologies and solutions that attempt to make the world’s information (a) accessible using natural language, (b) interconnected across content owners and domains, and (c) usable by software agents, thus permitting users to find and leverage information more effectively.
The Semantic Web seeks to enable software systems to reason, and make inferences about the information being analyzed.
From Wikipedia:
At its core, the Semantic Web comprises a set of design principles, collaborative working groups, and a variety of enabling technologies. Some elements of the Semantic Web are expressed as prospective future possibilities that are yet to be implemented or realized. Other elements of the semantic web are expressed in formal specifications. Some of these include Resource Description Framework (RDF), a variety of data interchange formats (e.g. RDF/XML, N3, Turtle, N-Triples), and notations such as RDF Schema (RDFS) and the Web Ontology Language (OWL), all of which are intended to provide a formal description of concepts, terms, and relationships within a given knowledge domain.
Finally, ReadWriteWeb has a helpful Guide to Semantic Technologies here.
It has been called a web of data, a web of structured data, or a standard way to get to a web of data… Many describe it as a means to enable software systems to understand the meaning in a document or data set, rather than just keywords, which is largely what internet search has focused on to-date.
An adult understands the nature of a car, the fact that it has an engine and might have two or four doors, some wheels, and could also go by the label “automobile”, “auto” or even “ride”. A keyword-based search engine does not understand any of this without some assistance. A semantic search application on the other hand, might understand the difference between a [cat/kitten/feline] and the Broadway play [Cats].
Those behind the Semantic Web vision describe a future where the majority of Internet content will be coded and categorized in the new formats necessary to support semantic search agents, as well as the efficient interchange of semantic data across sites. However, it may be many years before a material number of corporate content owners take on this challenge, especially considering the limited tools that currently exist to support this effort.
Where We Are Today
The amount of information available through the major search engines is astounding, and they are continually tweaking their algorithms to deliver more relevant results. However, a simple natural-language question about comic books illustrates the potential need for semantic search technology: “Which authors have written graphic novels”

If you are sitting at home with a cup of coffee researching comic books, you might not mind navigating a few dozen links to (eventually) find the answer to the above question, and may even see value in a serendipitous search that leads you down a new path, or to a new insight.
However, its a different matter altogether when a customer is on your corporate web site, and gets presented with a result set like the above. Is this representative of your customer’s search experience? Your FAQ search? The Customer Service site? Product Forums?
What will your customers do if they get dozens, perhaps hundreds of irrelevant links back from a search within your site? Will they pick up the phone and dial your call center, costing you perhaps $5 to $25 a call? How much did you spend to create the content or transactional systems you hoped would handle these types of questions from your customers?
Organizations are willing to spend many thousands, up to tens of millions on their corporate web presence and transactional web portals, yet, in summing up the internal web site search market, Matthew Brown (Forrester Research, 9/2007) said:
“…many companies end up disappointed and frustrated with high-priced [search] products that fail to live up to expectations. Yet, it’s surprising how little effort these companies typically put into creating a compelling search experience…especially given the potential productivity gains effective search implies.”
The Size of the Opportunity
The 2004 Census reports over 5 million businesses in the US, with > 1 million of those having 10 or more employees. Granted that all of these firms do not have a formal web presence, but of the ones that do, the majority are probably using an out-of-the-box search capability, usually the rudimentary keyword search that was bundled with their server platform.
And yet the language and classifications (ontology) used for each of these businesses is complex, and different from the business down the street. Lets take a look at the health insurance industry as one example. Health Plans would represent a significant opportunity for the use of “semantic overlay technology” because of:
- The large number of stakeholders, each with different needs and terminology used (members, employers, doctors, brokers, regulators)
- The volume of unstructured and semi-structured support content provided (hundreds or thousands of documents typically stored and served up to web site visitors)
- The challenge of creating complex documents that can be approved by insurance regulators and at the same time understandable by stakeholders
- The high cost of servicing these stakeholders through multiple call centers – most health plans of any size provide customer service in multiple languages
Large health plans typically spend millions each year on internal web sites and content, but may have topped-out at only 2%-10% stakeholder usage of these portals. Each additional percentage of customer inquiries that can be successfully answered through relevant web site interactions, and not require a follow-up call to a call center, would represent real savings to that organization.
Typical areas that members interact with Health Plans, which may be improved by a semantic overlay include:
- Frequently Asked Questions
- Plan and Product Information
- Health Topics and Tools
- Conditions & Diseases
- Drug Information
- Claims FAQs
- Find a Physician
- Find a Facility
- Drug Formulary
- Coverage and Costs
- Disease Management
Traditional keyword search is particularly limiting for health plans because of the volume and complexity of terminology in use, the terminology unique to each stakeholder (member, doctor, etc), and the quantity of support and product data. A basic plug-and-play keyword search system has no knowledge that the medical term “otolaryngology” = “ENT”, or “ear nose and throat” to a health plan member, or any of the other thousands of terminology and logic rules unique to this industry and its stakeholders.
And the terminology is different for each industry. The nature of a health care “claim” is markedly different from an auto insurance “claim”, with different language rules, etc. Shouldn’t our corporate web sites know this difference and act accordingly? A simple keyword search algorithm just doesn’t do the job here.
Its easy to see how an organization might benefit from a way to add a semantic overlay and structure to their site content. Benefits might include increased sales, more satisfied customers and lower business support costs. These capabilities would be a real differentiator for the organizations that use them, and would bring part of the promise of the Semantic Web to a large content base in accelerated fashion.
Over the last few years a field of Semantic Web startups have been launched in an attempt to be the next “go-to” search platform. These companies are using innovating methods to provide more relevant search results, and appear to be grouped into the following categories: (note that I am using a charitable definition of “semantic” in some cases)
- Social networking and media storage sites, primarily using “tagging” (Delicious, Digg, Technorati, Twine)
- General knowledge search engines, leveraging available information to create their own semantic datastores, or developing proprietary datasets designed specifically for semantic search (Bing, DBpedia, Hakia, Metaweb, Powerset, True Knowledge, WolframAlpha) Note that Powerset was purchased by Microsoft in July 2008 and is now heavily integrated into, but not the primary driver of, Microsoft’s Bing search engine.
- Business intelligence solutions designed to enable large enterprises and government agencies to leverage their internal information assets, including relational databases and unstructured text, and identify linkages or insights across those datasets (Attensity, Autonomy, ClearForest, Cognition, Inxight)
- Vertical search engines such as travel (Tripit, UpTake) and people search (Spock, ZoomInfo)
- Link engines/widgets that scan editorial content, attempt to identify objects (people, companies, places) within that content, and suggest related stories. (Inform, Lijit, Sphere)
- Semantic tagging services that allow an organization to add semantic information to their web site content, for use by external search engines and future semantic applications (Yahoo SearchMonkey, Semantify).
Yet with all this activity, there currently seem to be few offerings that enable businesses to easily add rich semantic features directly to their public-facing web properties, including:
- Product & Service Catalogs
- General Web Site Information
- Web Site FAQs / Help / Support Systems
- Unstructured HTML Content
Semantic Web capabilities may be a differentiator for businesses and content owners over the next few years, and may eventually be a must-have technology for those presenting information on the web, just as a navigation menu and simple search box is required today.
A compelling “semantic overlay offering” would:
- Offer a rich, user-friendly interface to product and service catalogs, thereby increasing sales
- Provide more relevant search and support resources for customers, increasing customer satisfaction and lowering human support costs
- Enable a company to retain control of their content, messaging and desired site experience (keep customers on the company site versus external search engine)
- Leverage and extend their existing web content investments
- Differentiate their company and product/service offerings from competitors
Components of a Conceptual Semantic Overlay:
So what would a semantic overlay process look like for corporate web content? Let’s continue with the simplistic car example from above for illustration purposes, and build it up layer by layer:
Here you see Objects, Object Instances and Synonyms. Synonyms are created for relevant Objects and Object Instances, and would be a critical part of the service, eventually expanding to include localization/translation capabilities.



Relationships are then built between Objects. With an understanding of relationships as well as relevant synonyms, the service would be able to iterate through and generate useful phrases to respond to the question being asked. In this example:
- Car(s) have Engine(s)
- Auto(s) have Motor(s)
- Automobile(s) have Color(s)
- Manufacturer(s) make Car(s)
- Car(s) have Designer(s)


Here two statements (triples) can be analyzed, and by using simple logic methods, “Cherokee is a Chrysler” can be deduced, then added to the growing rule set. In this way new relationships are discovered in the data that were never explicitly stated by the content owners.

Although not commonly included in Semantic Web applications, Constraints and Data Validation capabilities for external data would be critical features in supporting business customers. Constraint types may include Mandatory, Uniqueness, Subset, Equality, Exclusion, Frequency, Allowed Values, etc.

This is just the start of what might be possible with a semantic overlay service. Other options include multiple ranking algorithms, and of course social networking capabilities, such as subscribing to a “model”, suggesting modifications to a model, having one’s own annotated copy of a model, sharing a model with others, etc.
With a rich semantic overlay in place, three disparate freeform text queries to this fictional car lot database could potentially return the same, relevant answer to the consumer:
- Is a tan Jeep Liberty in stock?
- Is the Jeep model #7654 available in beige?
- Do you have a brown Jeep Liberty on the lot?
How would a semantic overlay system work in practice? One can imagine that it would be offered as an external service (SaaS) rather than yet another system to be managed internally within an organization, and that it might include:
- Data Entry and Update Methods – to enable upload or identification of all “objects” in the structured content of an organization, whether database-driven or as part of an HTML content base. Methods might include file uploads, web crawlers, REST interfaces, manual entry options, etc.
- A Wizard-Based Semantic Overlay – to enable object definition, relationship modeling, constraints and validation as described above
- Overlay Persistence – All semantic overlay structure would be persistent, only structural changes would require user involvement
- Automated Updates – Triggered to match content changes by way of a schedule (new car inventory in this example)
Conclusion
Semantic Web technology is maturing, and should revolutionize information retrieval methods across public and private information stores over the next 5 years. Companies have historically focused their site investments in content, transactional applications and architecture (navigation, usability), with some initial success. It is likely that the next wave of innovation (and ROI) will need to come from significantly improving the search and interaction experience of stakeholders for these sites.
What are you doing to improve the search experience for your customers? Do you have any particular successes that you’d like to share?
——
Note #1: While I am aware of several existing semantic search services (as linked to in this post), I am no doubt missing many, perhaps dozens of other services, either public or in “stealth mode”. One promising service, which seems to have many of the traits described above, is Calais from Thomson Reuters.
Calais is a web service that uses natural language processing (NLP) technology to semantically tag text that is input to the service. The tags are delivered to the user who can then incorporate them into other applications – for search, news aggregation, blogs, catalogs, you name it.
Note #2: In all the hoopla about semantic search and the desire to talk to our search engines using natural language, there are of course alternative views out there. One such view is that of Stan Schroeder at Mashable, who in this post states that we can get very relevant answers today from the major search engines if we just figure out how to enter our keyword search terms correctly, rather than trying to talk to them…
Computers and the web will not adjust to the way people talk. People will adjust to the way computers talk.
I certainly find myself doing a lot of tweaking of search terms when using a general-purpose internet search engine. However, I believe the idea breaks down when concerning corporate web sites. Our job as web property owners/managers is to make the user experience as fruitful as possible, rather than banging our customers over the head with arcane instructions on how to “correctly” interact with our corporate web sites to get the data they want.
Your comments are welcome. If this post was helpful, you might like to subscribe to the RSS feed, sign up for weekly updates via email or follow me on Twitter.
{ 1 comment… read it below or add one }
An excellent post by @scottbooher on corporate search: http://bit.ly/XErlQ . Insightful, a must read
This comment was originally posted on Twitter