Classifying Images Part 2: Basic Attributes

Last month i asked the question "What is the Hardest Content to Classify?" and promised additional posts on the subject based on my background of 13 years developing taxonomy and indexing solutions for still images libraries, so I am continuing my thoughts in this post focusing on the basic attributes of image classification.

In my opinion, images are the hardest content items to classify, but luckily for sanities sake not all image classification is equally demanding.

The easiest elements of image classification relate to what I'm going to call image attributes metadata. This area, for me, covers all the metadata about the image files themselves, rather than information describing what is depicted in images and what images are about.

Metadata aspects in this area cover many things and there are also layers to consider:

1, The original object
-- This could a statue, an oil painting, a glass plate negative, a digital original, or a photographic print

2, The second generation images
-- The archive image taken of the original object, plus any further images, cut-down image files, screen sizes, thumbnails, images in different formats, Jpeg, Tiff etc

The first thing to think about is the need to create a fully and useful metadata scheme, capturing everything you need to know to support what you need to do. This may be to support archiving and/or search and retrieval.

Then look at what data you may already have or can obtain. Analyse data for accuracy and completeness and use whatever you can. Look to the new generation of digital cameras to obtain metadata from them. Ask image creators to create basic attribute data at the time of creation.

You'll be interested in the following metadata types:

- Scanner types
- Image processing activities
- Creator names
- Creator dates
- Last modified names
- Last modified dates
- Image sizes and formats
- Creator roles - photographers, artists, sculptures
- Locations of original objects
- Locations at which second generation images were created
- Unique image id numbers and batch numbers
- Secondary image codes that may come from various legacy systems
- Techniques used in the images - grain, blur etc
- Whether the images are part of a series and where they fit in that series
- The type of image - photographic print, glass plate negative, colour images, black and white images

This data really gives you a lot of background on the original and on the various second generation images created during production. Much of this data can either be obtained freely or cheaply, lots of it will be quick and easy to grab and enter into your systems. It should also be objective and easy to check.

My next post will cover dealing with depicted content in images. Please feel free to leave comments or questions on the subject.

Image|Flickr|

Synaptica Central : Dow Jones Video Library


Video might have killed the Radio Star but in today's video streaming world it certainly is helping distribute knowledge and that is why we are publishing a video page to augment our blog postings.

Very often i talk to clients and they are in need of information to learn about key concepts or even just to share a third party view with their colleagues about specific topics around controlled vocabularies that I know someone on the team has presented or written about. It could be for example providing a white paper about Audience Centric Views, a video overview of Taxonomy Management Tools and how to use these tools to collaborate around developing controlled vocabularies or a real life case study of an existing client using Synaptica. In the past, I have kept these references in a .txt file on my desktop that I reference when I need to, but since this blog is being used as a resource for both us internally here at Dow Jones as well as the community, i figured it would be a good time to start a Video Library of our Dow Jones public resources.

So without any further ado- our Dow Jones Video Library has been published.

This is just the start of turning Synaptica Central into a must go to resource for our community, so please watch this space for additional resource pages from recommended white papers, industry standards references, must see videos, must listen to podcasts and must read books!

Have suggestions of things we should make sure we add to our resource pages? Please leave them in the comments or drop me a note at

Image|Flickr|

In Developing a Custom Taxonomy Only Time Can Tell

OK Quick Monday Quiz: How Many Minutes Does It Take to Create a Category (aka term, node, leaf, etc)???

I suspect that anyone who has worked on developing a taxonomy has heard this question or a variation of it. It seems like we get it daily! Once a client decides they need or want a taxonomy – they need or want it immediately so figuring out when becomes the next question.

After almost 30 years of being involved in the development of controlled vocabularies, thesauri and taxonomies I should be able to say it takes X minutes per term but I’m still forced to tell clients that it will depend on a number of things that are usually covered in the of any engagement like:

• What is the topic of the taxonomy?
• What is its intended purpose?
• What systems will you use to develop and maintain it?

Once we’ve answered all these questions, the next one is frequently whether they could just use a taxonomy that is already developed. No matter what approach is ultimately chosen to create a taxonomy – it still takes time and the ultimate answer is that it depends on what the client needs, how many terms there will be, how technical those terms are and the taxonomy development tool that is being used.

Building a taxonomy for an area that you are familiar with can be done fairly quickly while building one on scientific, technical or medical areas might be much slower. Adding to the issue of the topic is the issue of the tool where the taxonomy is being built. The more efficient the tool the faster the development once terms have been decided upon and research for the terms completed.

Experience in developing taxonomies has given me some general metrics that can be used for pricing a taxonomy but the reality is that the best answer is that it all depends on what is needed.

So – how long does it take?? – it takes as long as necessary!!

Image|Flickr|h.

Taxonomies are a Commodity

For some reason or another (lots of travel, several hats at home and work) I've had trouble finalizing this post. Earlier today though, I read Paul Miller's latest post on ZDNet. There seems to be some discussion about whether or not data is a commodity. I think there IS most definitely data that are a commodity.

Taxonomies are a valuable raw material in the management of information. A file that can be bought and sold and used to improve services. They can be generated by humans, machines, or even better: humans working with machines. Many taxonomies are a dime a dozen, with little to differentiate between versions of the same data. Some are like Kopi Luwak coffee - rare and extremely valuable. The word "taxonomy" is itself suffering from a kind of genericide. Classical definitions still apply: taxonomies have become commoditized.

The complexity of the controlled vocabulary will determine its value to a degree. A simple pick list should be easy and cheap to acquire - a list of countries, for example. Or colors, seasons, months - you get the idea. What is the value of a list of industries? Or companies? Maintenance is the primary cost factor - frequent changes require frequent updates, but an authority file in and of itself is not that complex. A broad and deep poly-hierarchical taxonomy I would expect to have more value. A poly-hierarchical taxonomy is one where a term in the taxonomy can have more than one parent term. Managing these relationships takes more time. An ontology - well, those aren't quite commodities yet, but they will get there. Why? Because they still require a great deal of thought and effort.

The source of the data will also help determine its value. Data from trusted sources - for whom integrity is paramount - should be valued higher. Is the data accurate? Is it maintained? Is it in a usable format? Does it have high availability? (Many quality vendors can be found at TaxonomyWarehouse.com.)

The uniqueness of the taxonomy will drive its value. Like our coffee example above, a taxonomy as ubiquitous as Starbucks will not be as valuable as say a pharmaceutical research vocabulary. Given the, uh, processes needed to produce Kopi Luwak, it is rare and therefore fetches a higher price, as would our R&D taxonomy.

The information security concerns also impact value. Our pharmaceutical company, or a financial services provider, is not about to release it's vocabulary into the wild. It is a significant intellectual asset that merits a substantial IT effort to protect.

I actually like the fact that taxonomies have become commoditized. Why? Competition drives improvement - in quality, in focus, in security and in usability. These are areas that the semantic web community needs to focus on - in my experience, security and usability need attention NOW. Good fences make good neighbors, and when we've got good fences, we can make more links and learn to trust. Icing on the cake!

Flickr image by

Synaptica Featured in New Report on Industry and Leading Vendors in the Semantic Web Space

The Synaptica team is in Denver this week doing strategic planning, or as I say, ! ;) There are a great number of really interesting problems in information management, and it's fun and rewarding to brainstorm ways of solving them. We're not the only ones scheming and it's great to see the market itself growing.

David Provost this week published a Global Review of the Industry and Leading Vendors in the Semantic Web space titled in which Synaptica from Dow Jones and Dow Jones Client Solutions were highlighted. Dow Jones is in a unique position as a software vendor, consulting services provider and deployer of semantic solutions, which made for a great conversation - I highly recommend you read the report. (Not just because I was involved!)

Paul Miller has posted on his ZDNet blog a review of the report New report places Semantic Web ‘On the Cusp’ of something big. Paul adds some great commentary to his summary of the report, should you not be able to get through David's entire document at once.

Happy Reading!

Congratulations to Gabe Rivera on being named to Business Weeks 25 Most Influential People on the Web

I am a huge fan of Techmeme and therefore of Gabe Rivera who just yesterday got named by Business Week as one of the 25 Most Influential People on the Web!!

So i recorded this message earlier today:

For the month of September, i managed to convince our wonderful marketing department to have our new Dow Jones Synaptica Central Blog as one of the sponsored posts on Techmeme and we have certainly had success in driving traffic and interest in the Taxonomy side of the Dow Jones Client Solutions business.

I was an early fan of Techmeme - for example here is a video of me and Robert Scoble in February 2007 showing Clare Hart, Dow Jones EVP, Techmeme and can you can hear how excited i get.

Well today is the last day of the month- and we decided to skip the month of October as a sponsored post - but we hope to be back soon!

Congratulations once again Gabe and thanks for letting us be a sponsored guest to get our Synaptica Central blog off the ground!

Centralized Taxonomy Management and Synaptica Case Study by ProQuest

At last week's conferences we had the wonderful opportunity to show the attendees how one of our customers utilizes Synaptica our Taxonomy and Metadata Management software.

Paula McCoy, Taxonomy Manager at ProQuest joined me on stage for two sessions during Enteprise Search Summit and Taxonomy Bootcamp. ProQuest provides global access to one of the largest online content repositories in the world and Paula is responsible for maintaining their controlled vocabularies that both editor's and end users use.

The case study addresses the challenges ProQuest faced in managing multilingual controlled vocabularies using multiple Word documents and authority files maintained in an Oracle database. During her presentations, Paula describes how implementing a thesaurus management tool helped ProQuest simplify and standardize its business semantic management to create a common language and connect disparate information assets as well as handling large and varied vocabularies and authority files, linking new and existing editorial systems and enabling hierarchical views, and automating thesaurus management tasks.

Note: Unfortunately the video does not focus on the slides during the taping like some of my other videos i posted from the conference so i have embedded the slides for you to follow along (maybe open a second window and click through).

The first session is 30 minutes and i go through what is and why customers use Centralized Taxonomy Management tools in the first 15 minutes and then Paula presents how she uses Synaptica daily to maintain the ProQuest taxonomies.

Centralized Taxonomy Management for Enterprise Information Systems

The second is a 45minute full Case Study (you only hear me introduce Paula and make some comments about how we need to make Taxonomy exciting!)

Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Thanks again to Paul McCoy who did an awesome Job!! Thanks!

Enterprise Search Summit: Terry Jansenn - “An Integrated Continuum of Search Technologies”

Terry Janssen of Lockheed Martin gave a very interesting presentation on Wednesday the 24th at the Enterprise Search Summit West in San Jose. During his talk he discussed some of the technologies that Lockheed employs as a systems integrator to improve search within their own and other enterprises.

A common theme was revealed during his talk that was repeated by in several other presentations and by vendors at the Summit and its sister conference, Taxonomy Bootcamp. Namely, that taxonomies are the key to improving and refining enterprise search.

The Eight Habits of Successful Taxonomists

I think next year i am going to do a Top Ten reasons you should your Taxonomist as a late afternoon skit to get Taxonomy Bootcamp attendees jamming after those heavy lunches! Back to back sessions on the subject of taxonomy sometimes needs a little excitement injected into the day so i am plotting already...

This year one of the sessions i enjoyed the most was Gary Carlson's session on 'The Eight Habits of Successful Taxonomists' that he kindly let me video tape so we can share with all of you who did not make it to Bootcamp this year.

Gary runs his own consulting company called Gary Carlson Consulting and has over 20 years of experience as a taxonomist, consultant, product manager, and information manager working for small to Fortune 100 companies. [Click full screen toggle from panel below for best quality]

So what are the eight habits of a successful taxonomist?

#1- Sets expectations
#2- Knows the Technology
#3- Pays Attention to Workflow
#4- Avoids Religious Wars "Leave dogma at the door!"
#5- Follows the $$ Money $$
#6- Is a Good Listener
#7- Does not use the word 'taxonomy' in polite company
#8- Is a good Juggler!

Some good practical advice that Gary shared with us in conclusion to his presentation:

1. Identify the business problem at the start of the project
2. Gathering requirements for a taxonomy is a huge process and can lead to many different areas of the organization
3. Align your projects with the business and preferably with generating revenue rather then efficiency
4. And most importantly? Have Fun!!!

You can download the complete slide deck here: http://www.garycarlsonconsulting.com/pdf/taxonomy-boot-camp-preso.pdf

What do you think are other habits that a successful Taxonomist might have? Please leave them in the comments!

Juggler Image| Flicker |

Taxonomy and Resource Location: Finding the Who and the What Using Controlled Vocabularies

While I was at Enterprise Search Summit and Taxonomy Boot Camp, I heard some really interesting presentations. On Wednesday I sat in on a presentation by Ahren Lehnert titled "Taxonomy and Resource Location: Finding the Who and the What" . This was a really good presentation because it addressed some of the ways we have been advocating using 'taxonomies' that might be new approaches for some and therefore it resonated with me because it’s something I have been speaking to my customers about.

Ahren’s presentation was about how it can be difficult for an organization to capture and retrieve knowledge and expertise held by resources, both internal and external. He discussed how combining taxonomy and search can help organizations with resource location.

He noted that one challenge is finding the right person with the right skills. In a large company this can be difficult because there are so many roles and such a variety in skill sets among employees. In small organizations the difficulty stems from having fewer roles, and therefore, fewer skill sets. He also made the point that job titles don’t always indicate what knowledge and skill sets are associated. An additional challenge can be the result of a merger or acquisition—each company could have unique titles for the same position, or the same title could be used to describe different positions. The information about these resources can be found in a variety of places—wikis, blogs, human resources applications, sales applications, project management applications, and content management systems. The key is to be able to surface content from all of these repositories through a single search.

Here is an Example:
Employee X creates a user profile on the wiki and lists her job skills and interests. She also co-authors a report for a shoe manufacturer. She then attended a seminar on SharePoint and Taxonomy and blogged about it. HR already has her resume on file, which lists her former positions.

If I am working on a project with that same shoe manufacturer, I might be interested in talking with others who have expertise in that area. Employee X could be a great resource, but it is possible I don’t know that she exists, much less that she has experience working with that particular client. If I could search across all the repositories of information, and a controlled vocabulary were leveraged in that search, I should be able to find Employee X and contact her about the project. The knowledge is there, it just needs to be organized with a taxonomy and retrieved using search combined with the taxonomy.

When we work on client engagements, we work directly with the client to assess the various repositories of information and develop taxonomy strategies that many times includes vast amount of information about their employees. we can also build custom taxonomies that can be leveraged in the client’s enterprise search solutions. Often however a client will also have an existing taxonomy in place so it just needs to be enhanced to meet the current needs as well as expand it out for expertise location.

In conclusion- controlled vocabularies can and should be used in various ways to assist corporate users in finding information- from finding the right report with a quick search to finding the right person to validate an opportunity with a quick call or email and Ahren's presentation at the Enterprise Search Summit gave us all some good examples of how and why this should be done.

Image| created with http://wordle.net