Centralized Taxonomy Management and Synaptica Case Study by ProQuest

At last week's conferences we had the wonderful opportunity to show the attendees how one of our customers utilizes Synaptica our Taxonomy and Metadata Management software.

Paula McCoy, Taxonomy Manager at ProQuest joined me on stage for two sessions during Enteprise Search Summit and Taxonomy Bootcamp. ProQuest provides global access to one of the largest online content repositories in the world and Paula is responsible for maintaining their controlled vocabularies that both editor's and end users use.

The case study addresses the challenges ProQuest faced in managing multilingual controlled vocabularies using multiple Word documents and authority files maintained in an Oracle database. During her presentations, Paula describes how implementing a thesaurus management tool helped ProQuest simplify and standardize its business semantic management to create a common language and connect disparate information assets as well as handling large and varied vocabularies and authority files, linking new and existing editorial systems and enabling hierarchical views, and automating thesaurus management tasks.

Note: Unfortunately the video does not focus on the slides during the taping like some of my other videos i posted from the conference so i have embedded the slides for you to follow along (maybe open a second window and click through).

The first session is 30 minutes and i go through what is and why customers use Centralized Taxonomy Management tools in the first 15 minutes and then Paula presents how she uses Synaptica daily to maintain the ProQuest taxonomies.

Centralized Taxonomy Management for Enterprise Information Systems

The second is a 45minute full Case Study (you only hear me introduce Paula and make some comments about how we need to make Taxonomy exciting!)

Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Thanks again to Paul McCoy who did an awesome Job!! Thanks!

Enterprise Search Summit: Terry Jansenn - “An Integrated Continuum of Search Technologies”

Terry Janssen of Lockheed Martin gave a very interesting presentation on Wednesday the 24th at the Enterprise Search Summit West in San Jose. During his talk he discussed some of the technologies that Lockheed employs as a systems integrator to improve search within their own and other enterprises.

A common theme was revealed during his talk that was repeated by in several other presentations and by vendors at the Summit and its sister conference, Taxonomy Bootcamp. Namely, that taxonomies are the key to improving and refining enterprise search.

The Eight Habits of Successful Taxonomists

I think next year i am going to do a Top Ten reasons you should your Taxonomist as a late afternoon skit to get Taxonomy Bootcamp attendees jamming after those heavy lunches! Back to back sessions on the subject of taxonomy sometimes needs a little excitement injected into the day so i am plotting already...

This year one of the sessions i enjoyed the most was Gary Carlson's session on 'The Eight Habits of Successful Taxonomists' that he kindly let me video tape so we can share with all of you who did not make it to Bootcamp this year.

Gary runs his own consulting company called Gary Carlson Consulting and has over 20 years of experience as a taxonomist, consultant, product manager, and information manager working for small to Fortune 100 companies. [Click full screen toggle from panel below for best quality]

So what are the eight habits of a successful taxonomist?

#1- Sets expectations
#2- Knows the Technology
#3- Pays Attention to Workflow
#4- Avoids Religious Wars "Leave dogma at the door!"
#5- Follows the $$ Money $$
#6- Is a Good Listener
#7- Does not use the word 'taxonomy' in polite company
#8- Is a good Juggler!

Some good practical advice that Gary shared with us in conclusion to his presentation:

1. Identify the business problem at the start of the project
2. Gathering requirements for a taxonomy is a huge process and can lead to many different areas of the organization
3. Align your projects with the business and preferably with generating revenue rather then efficiency
4. And most importantly? Have Fun!!!

You can download the complete slide deck here: http://www.garycarlsonconsulting.com/pdf/taxonomy-boot-camp-preso.pdf

What do you think are other habits that a successful Taxonomist might have? Please leave them in the comments!

Juggler Image| Flicker |Marco Fedele1089

Taxonomy and Resource Location: Finding the Who and the What Using Controlled Vocabularies

While I was at Enterprise Search Summit and Taxonomy Boot Camp, I heard some really interesting presentations. On Wednesday I sat in on a presentation by Ahren Lehnert titled "Taxonomy and Resource Location: Finding the Who and the What" . This was a really good presentation because it addressed some of the ways we have been advocating using 'taxonomies' that might be new approaches for some and therefore it resonated with me because it’s something I have been speaking to my customers about.

Ahren’s presentation was about how it can be difficult for an organization to capture and retrieve knowledge and expertise held by resources, both internal and external. He discussed how combining taxonomy and search can help organizations with resource location.

He noted that one challenge is finding the right person with the right skills. In a large company this can be difficult because there are so many roles and such a variety in skill sets among employees. In small organizations the difficulty stems from having fewer roles, and therefore, fewer skill sets. He also made the point that job titles don’t always indicate what knowledge and skill sets are associated. An additional challenge can be the result of a merger or acquisition—each company could have unique titles for the same position, or the same title could be used to describe different positions. The information about these resources can be found in a variety of places—wikis, blogs, human resources applications, sales applications, project management applications, and content management systems. The key is to be able to surface content from all of these repositories through a single search.

Here is an Example:
Employee X creates a user profile on the wiki and lists her job skills and interests. She also co-authors a report for a shoe manufacturer. She then attended a seminar on SharePoint and Taxonomy and blogged about it. HR already has her resume on file, which lists her former positions.

If I am working on a project with that same shoe manufacturer, I might be interested in talking with others who have expertise in that area. Employee X could be a great resource, but it is possible I don’t know that she exists, much less that she has experience working with that particular client. If I could search across all the repositories of information, and a controlled vocabulary were leveraged in that search, I should be able to find Employee X and contact her about the project. The knowledge is there, it just needs to be organized with a taxonomy and retrieved using search combined with the taxonomy.

When we work on client engagements, we work directly with the client to assess the various repositories of information and develop taxonomy strategies that many times includes vast amount of information about their employees. Using our Process Model for Developing & Deploying Taxonomies we can also build custom taxonomies that can be leveraged in the client’s enterprise search solutions. Often however a client will also have an existing taxonomy in place so it just needs to be enhanced to meet the current needs as well as expand it out for expertise location.

In conclusion- controlled vocabularies can and should be used in various ways to assist corporate users in finding information- from finding the right report with a quick search to finding the right person to validate an opportunity with a quick call or email and Ahren's presentation at the Enterprise Search Summit gave us all some good examples of how and why this should be done.

Image| created with http://wordle.net

Taxonomies for Human Vs Auto-Indexing by Heather Hedden

Day 1 at Taxonomy Bootcamp covered a lot of basic taxonomy principles such as planning and implanting taxonomy, choosing taxonomy software and indexing principles. The talk by Heather covered the perennial issue of human vs. auto-indexing and whether it was possible to ascertain whether one was better than the other or not. Ultimately, whichever method was selected, it depended on the purpose of the taxonomy and its use. It was emphasized that indexing was best used for search and retrieval.

Before the virtues and drawbacks of each indexing method were explored, Heather provided clarity on what indexing was and how it differed from tagging and categorisation. In a nutshell:

• Indexing is done by a trained indexer, preferably with subject matter knowledge and is largely used for browsing.
• Tagging can be done by anyone and is the applying of labels to documents. These tags can then be used by a database.
• Categorisation is the grouping and placing of information in buckets in a systematic manner.

The differences in human and auto-indexing were covered in 3 broad areas, namely contents/materials handled, methodology and technology. In terms of contents, human indexing would be at its best if the contents were in manageable numbers and included a variety of formats and subjects/topics. On the other hand, auto-indexing would work well for very large numbers of documents, textual documents (no images!) and single subject areas.

Technology-wise, indexers (humans) use fairly simple and straightforward indexing tools which were designed so that indexing could be carried out in quickly and accurately. There was also the flexibility for indexers to input new terms, when necessary. Training for indexers could be carried out with the use of indexing guideline (both for development and quality checking). Auto-indexing was a little more complex as in required an entity extractor and text had to be mined and analysed. Although auto-indexing is done by the machines, there still has to be human intervention in the form of rules building as well as to provision of sample documents to of the ‘train the automated indexing’.

Having covered the pros and cons of each, the next part of the talk focused on the differences in the terms. Terms indexed by human and machines can be differentiated through their granularity, types of relationships, descriptions/notes and types synonyms/variants. The main difference in the term relationship between human and auto indexing is that in human indexing, there are both hierarchical and associative relationships. In human indexing, there can also be more notes which are visible to the end user and indexer.

Heather also touched on the differences in synonyms/variants between humanly indexed terms and auto-indexed terms. For example, in human indexing, abbreviations are allowed for common terms whereas in auto-indexing, the machine will not be able to understand the abbreviations.

She concluded with a short description of the additional tasks that an indexer would have to do in both human and auto-indexing. Both would require human intervention, its just that the tasks and extent of work is different. For human indexing, terms have to be checked and amended/added in if terms are omitted or misused. In the case of auto-indexing, the work is more focused on the training documents and adjustments of the rules.

This was a very factual and descriptive presentation on both human and automated indexing. It was reiterated that no one method is better than the other and the choice of either one is simply dependent on the usage of the taxonomy. The use of the taxonomy should determine whether human or automated indexing should be done. Both will yield different results in terms of structure and terms created. Both will also require a different level of human intervention in rules building or policy development.

Heather’s website can be found at http://www.hedden-information.com

Peter Morville on Connecting Knowledge Management and Discovery: Search 3.0

This morning I attended the Taxonomy Bootcamp and KMWorld Joint Keynote by Peter Morville on "Connecting Knowledge Management and Discovery: Search 3.0" . Peter delivered an engaging overview of many aspects that are key to successful Knowledge Management and Discovery. Some of the points that were covered included:

  • Good search and discovery being achieved through collaboration of people with different skills and an appreciation of Information Architecture focusing on business goals as well as user needs
  • For website design it is critically important from the point of view of findability to have multiple paths to information such as alphabetical indexes, search engines, topical schemes and site maps due to users looking for information for different reasons and having different approaches to finding that information
  • Information Architecture and website design is linked to a honeycomb of different qualities. A site needs to be useful, valuable, desirable, usable, findable, accessible and credible. These qualities are all interactive and interdependent
  • The relationship between search and Knowledge Management is very important. Good quality content will be used and found, which encourages maintenance of the quality of this content
  • When developing portals, Information Architects need to think about taxonomies and vocabularies. Content is more dynamic these days and we need to look at work done in both the collaboration and 2.0 space. A critical component of portals is Enterprise search. This needs federated search solutions that bridge the gaps between all repositories, including external websites and databases
  • Any architect (physical or digital) needs to have one foot in the past and one in the future. We need to learned lessons from the past, but at the same time we need to understand that systems will be used into years in the future and will become the legacy systems of the future.
  • One interesting concept that Peter talked about was that of the disciplines of way finding (finding our way in the physical world) and information retrieval are converging. Examples of this are Google World and GPS devices that help to converge mobile devices with location awareness. But just because we can do this, do we really want to?
  • People are becoming findable objects as well as other things. It will probably be about 30 years before the Internet of objects is fully realised via technologies such as RFID. This technology can help in many aspects, the example given was that of Cisco allowing the tagging and locating of high value objects such as wheelchairs left in rooms in hospitals. These technologies will help with costs and customer service

A balance needs to be found with the web 2.0 movement, but we shouldn’t throw away ideas of Information Architecture and vocabulary development. In 10 years time we are still going to be using a search box. This means we will still need taxonomies to provide options for browsing navigation and filtering. Search and browsing will continue to work hand in hand.

The process of search is iterative and interactive and over the course of a search a query can evolve. Search is also one of the most important ways in which we learn. We need to recognise it is a complex adaptive system. It is not just about the interface or the user. We need to know how to get systems to work together, remove outdated content and design interfaces to help users for when they get stuck. Narrow down results etc.

Three key questions when redeveloping a site are:

  • Can users find our website?
  • Can users find their way around our website?
  • Can they find information and their way around the site DESPITE the website?

Design Patterns used in website creation:

  • Best Bets – Opportunity to query disambiguation
  • Federated Search – Searching across multiple database and locations. Users often don’t know which database to search in
  • Faceted Navigation - Bringing search and browsing together and leveraging taxonomies and vocabularies. Need to take a decision on whether to push navigation to users or show it in a more subtle way

Ultimately we need to expand what we think about as search. For example Google Books dramatically expands what we think about the searchable internet. Other examples are the searching of video and podcasts through sites such as Everyzing.

There are lots of possible futures for search. User experience design helps to identify future concepts. Search is a wicked problem. The only way to move forward is by sharing and working together.

Peter has made his slides available from the keynote and blogs over at www.findability.org

Taxonomy Boot Camp Keynote: La taxonomie est morte! Vive la taxonomie

Today is the first day of Taxonomy Boot Camp, and Theresa Regli delivered the keynote address: "Taxonomies: Dying? Dead? Or Just Hitting Their Stride?"

Theresa began by asking how we, as taxonomists, remain relevant. There have been a lot of changes in the past few years, and a lot of paradigms that we need to let go of. Taxonomies are still relevant--enterprises still focus on and invest in taxonomies, but in different ways than in the past. There are some situations in which taxonomies aren't necessary, and we need to acknowledge that, but in most situations technology needs taxonomy to achieve best results.
Some of the signs that taxonomies aren't dead yet:

  • More people attending Taxonomy Boot Camp this year.
  • Taxonomy COP has around 1000 members
  • Taxonomy COP isn't limited to taxonomists and information architects; people form different backgrounds are joining, showing that taxonomists aren't as isolated from other groups as they once were.

Theresa reviewed what she called the "mullets" of taxonomy.

  • Bob Boiko thinks the enterprise taxonomy is outdated. One all-encompassing taxonomy is too much for a large organization and is unmanageable. Smaller, more specific taxonomies are needed.
  • Ron Daniel thinks that people who say they need a 3-level general business taxonomy need to go. The more general a taxonomy is, the less useful it is; targeted, focused taxonomies are more useful. You also can’t decide how a taxonomy should be structured until you understand the business problem.
  • Seth Earley thinks site maps need to go. He also thinks that manual tagging projects are obsolete and we should utilize the improved technology for autocategorization.
  • Theresa thinks that the idea of one classification that fits all needs to go. Taxonomists can't dictate how people might need to find information later; instead, we need to figure out how the user might need to find something later.
  • Theresa also thinks that definitive categorization is usually obsolete. She gave the example of breakfast. She said to her breakfast means bacon and eggs; to me it means Coke and chocolate.
  • Theresa's final mullet was bottom-up content analysis by humans. There is just too much content now to analyze. Content analysis software can give us a good starting point for a project because machines are better at finding the information. A human is better at figuring out the context of that information and how people will use it.

After talking about what taxonomists need to leave behind, Theresa focused on the new lifeblood of taxonomies:

  • Application integration
  • Creating smaller, more manageable taxonomies for the enterprise
  • Understanding the context of information
  • Meta data for dynamic navigation and filtered searches. Content has to be metadata rich in order to be found, and this is no longer specific to e-commerce.
  • Taxonomists acknowledging the importance of technology and working to understand that technology.
  • Creating standards and teaching auto-categorization tools to make contextual distinctions.

Theresa then talked about the key ideas behind taxonomy. She doesn't like to think of taxonomy as hierarchical—it’s more about categories. Because people approach information in different ways, pieces of information should be more fluid. Taxonomies are about enriching content with metadata so that people can find it however they need. She then commented on folksonomies, which can be useful in some cases, but as she pointed out, aren't really the right path in areas of science, legal, and compliance. When millions of dollars are at stake, letting the masses pick the category isn't a good idea because they don't always pick the correct one. Theresa presented Joseph Busch's three basic principles:

  • Metadata needs to be associated with content
  • Topics should be divided into a few discrete facets
  • Some facets are common to many applications and some need to be specific to each application.

Theresa concluded her talk by relating our work to Isaac Asimov's short story "The Last Question." In the story, people ask the computer Multivac, "Can the workings of the second law of thermodynamics (used in the story as the increase of the entropy of the universe), be reversed?" The computer is unable to answer, and the question is repeated several times over thousands of years. Finally the computer answers that it is unable to answer the question because all the data relationships for the information aren't available. Once all humans are gone, the computer is able to answer the question because it knows all the data relationships.
So this relates to us because we are working to define relationships among data, and once we have completed that work (far, far, in the future), taxonomies will be obsolete. As we make computers smarter and technology better, we are working towards the death of taxonomies.

Get Your Taxonomy Bootcamp Puzzles Now!

Today Taxonomy Bootcamp officially starts in San Jose, CA. In an effort to have 'fun' with controlled vocabularies, I created these domain specific crossword puzzles. If you are at the conference stop by and pick some up for your trip home on friday!

These Puzzles were created from Taxonomies available for licensing via the Dow Jones Taxonomy Warehouse site at www.taxonomywarehouse.com. Each puzzle's theme covers a specific industry domain and the puzzle answers are based on standard taxonomy development relationships such as:

RT= Related Term
BT= Broader Term
EQ= Equivalent
USE= Use
UF= Use For
NT= Narrower Term

Download the Puzzles Now! [note: Adobe PDF reader required]

Stuck? need some answers? View the Puzzle Answer here:

The Future of Search: A Keynote by Susan Feldman at Enterprise Search Summit West 2008

Sue Feldman has been a key analyst and researcher in the search space for a number of years. Her work at IDC as Vice President of Research, Search and Digital Marketplace is very highly regarded. (Sometimes I envy her job!) It's Wednesday morning in sunny San Jose, and Sue has just given the morning's keynote at the Enterprise Search Summit West.

Sue believes that we are seeing a convergence of tools in search, and thankfully the vendors are seeing a stronger market, which will motivate them to keep innovating. The future of search is not a platform based on transactions, as we have today. It will be a language-based foundation for a new platform - a knowledge platform that she predicts will gain equal place with transaction-based systems. The similarities with the evolution of the database platforms imply a parallel path.

We will continue to see development in categorization, text analytics and linguistic modules. This includes capabilities for identifying parts of speech; extracting entities, concepts, relationships, sentiment and geo-location; semantic understanding via dictionaries and taxonomies; support for multiple languages. One of the biggest problems Sue thinks we need to solve for in the search market is something taught to every library school student: selection. There is so much information, from so many sources - what can you trust? What are the valuable sources?

What are the market drivers? New business requirements. Updated business requirements. Connect the right information to the right people at the right time. We've heard that for years, and while it may be annoying, it's still valid! Determining the state of the business despite the data being in separate silos. Compliance - governments create and change those requirements regularly! Controlling costs - think information workers and call centers; the faster a service rep can find information and finish the call, the lower the per call cost and the more calls they can take, translating to happier, more loyal customers and the next driver: increased revenue. In keeping with "Web 2.0" a better understanding of customers and improved communications with them are a key driver.

eCommerce will require ever more sophisticated tools in the search digital media space, as will publishers as they continue to migrate online. The digital marketplace and government (DoD, NIH) have been early investors in this space - for ad matching, interaction improvements, rich media search, fraud and terrorist detection, access to information and more. The market, according to Sue, is realizing that they have tons of information NOT in their ERP and CRM systems. Transaction-based computing is no longer enough. User-centered computing requires re-thinking, new human-computer interaction models.

Sue believes we need to automate knowledge work, as we no longer are limited to working 40 hours a week in our offices - we work in bits, here and there, 24/7 on multiple devices in many formats. We need to have personalized interaction models - and even more granular that just at the individual level, at the person's role level - employee, volunteer, family, friend. The personalization needs to address the user, the device, and the context. It needs to be flexible and adaptable, ad hoc in real time. It needs to be secure and contiguous across user environments.

The challenges for search are:

  • How to unify access to kinds of information from a single, contextual user interface
  • Improving human-computer interaction models
  • Identifying what is good in interaction design for information access

Sue believes that she will not have a market to forecast in 10 years. By then search will be embedded in the platform and in the applications to provide interaction. Applications will use this search platform to personalize, filter and visualize. We will see task-specific applications in our work environments. In fact, some of these applications are already on the market. Search will be at the center of interactive computing, as search is now language-based, just as humans.

Meaning-Based Computing: A Presentation by Autonomy at Enterprise Search Summit West

Like Daniela, I too am attending the 2008 KMWorld/Enterprise Search Summit West/Taxonomy Boot Camp meta-conference in San Jose. Shortly before lunch, we heard from Gary Szukalski, Vice President, Customer Relations, Autonomy. He spoke about Meaning-Based Computing. I've known Gary for a number of years and he did not disappoint - his message becomes more refined each time I see him speak.

Gary spoke of a "major paradigm shift" in the the IT industry. For years, we (IT practitioners and vendors) have been forced, unnaturally, to aggregate, dumb-down, and structure the mess of unstructured data that makes up approximately 80% of an organization's information assets. Why have we done this? Because that's how computers work - they need structure. We are moving into a world where we can stop forcing structure onto data as computers will understand the semantics of what they are storing and indexing.

Now, he didn't say semantic web or semantic technologies. :) He talked about meaning - how do we teach our machines to disambiguate terms. He gave an Enron example - "shred" means destroy paper documents, but also refers to slicing vegetables in the Enron corpus. It also is a snowboarding reference. How does the machine know? This is where Autonomy is heading.

Why would we care? Gary spoke of the December 2007 amendments to the Federal Rules of Civil Procedure. In a nutshell, these amendments made all relevant electronic information admissible in a legal case. There are definite ROI measures to be had for using the right discovery tools to protect organizations from legal troubles. This brought to my mind the Sedona Principles as well - legal guidelines regarding the importance of metadata.

Pan-enterprise search is the new buzzword. Rather than aggregating - federating - sources together, a search tool should now be able to index ALL objects, regardless of file or storage type. Glad to hear a top ES vendor saying that finally!

Now, I was a big Verity customer/user at a prior employer. I gave them a great deal of feedback on their tools. One thing that always gnawed at me, born from my library roots, was that the definitions of the categories and topics that improved search relevance were locked in the tools. My organization defined them, but we couldn't share them easily - only the evidence of their existence, by means of better search results and faceted browsing. But the critical thing about "meaning" is that it be shared! In the "shred" example above, I understood fully it's importance in the Enron context. But my first thought on hearing the word was cooking, while the woman next to me thought of snowboarding. How does an organization use the power of the tool to educate the users of the tool? Who is working on the UI part of this paradigm shift? And who is thinking about the UI in the context of information security? Secure search should provide access at the role, group, organization or public level. Is Autonomy using open standards to minimize efforts at integrating metadata pan-enterprise? For me pan-enterprise is not just behind the firewall, it extends onto the web in the form of corporate messaging and consumer feedback. Are any of the enterprise search vendors using open methods to allow this kind of integration? I'm interested in hearing, as I left the search world behind a couple of years ago, and have drifted towards the outer edges of the space.

This was one of the better presentations this morning, and I hope they post the slides somewhere soon!