Skip to content

Auto-classification: a bit of a stretch

January 4, 2012

Last post I kicked off a series on auto-classification, which has been increasingly top of mind for my clients of late. I want to tackle auto-classification from a few different angles:

  • What it is and what it isn’t – the very name auto-classification conjures up almost magical powers that can transform a gloppy, hulking mass of unstructured content into a highly structured, polished collection of tagged documents. As you might imagine, this is not entirely true.
  • How it works – not from a technical perspective, because this goes way beyond my knowledge. But I do know a bit about the people and process work these tools require to work properly, and the reality of it may surprise you.
  • Whether it works – I’m involved with a POC to test some of the auto-classification solutions out there against that most elusive of things: real client data. We’ve got an organization willing to share a chunk of their shared drive content as well as some vendors willing to use their tools to auto-classify that content. I won’t be identifying either the firm or the vendors here, but I will speak to auto-classification capabilities in general and what I saw working and not working during the POC.

For this post, I want to sort out the first point, because I come across lots of misconceptions about what auto-classification is and how it works (some of them my own).

What do you mean by auto?

I don’t know about you, but when I hear auto-classification, I kind of assume that the tool does all the work. Automatic transmissions, auto-renew clauses in wireless contracts, automatons—the basic idea is that something happens on its own, without human intervention.

So my assumption about auto-classification tools was that they classified content with little to no human intervention. Sure I have to push the go button, and I figured I would have to QC the output, but as for the rest of it, Watson or Hal or whatever was going to handle that for me automatically.

Turns out that the reality of auto-classification couldn’t be further from this utopian vision of nearly effortless content classification.

All the tools I’ve come across, whatever the actual algorithm(s) they use to classify, need to be trained in order to function effectively, i.e., you provide the tool some examples of each kind of document you want it to classify so it can learn what makes each kind of document what it is and different from other kinds of documents.

Teach a machine to fish

For example, let’s say one of the documents you want your auto-classification tool to recognize is a contract. First thing you need to do is find examples of all the different flavors of contracts at your organization: vendor contracts, services contracts, big ones, small ones, ones you own, ones sent to you by third parties…enough of them that the auto-classification tool will be able to find any contract that might be in your repository with reasonable (e.g., >90%) certainty.

You might be saying to yourself, sounds great, but I would want the tool to classify lots more than just contracts. I have hundreds, potentially thousands, of document types I would want to classify. That seems like it would take a long time and a lot of effort.

And you would be right. The amount of time and effort it will take you to train an auto-classification tool is substantial, and not just on the part of a few resources. Because you’re going to need lots of examples of document types from across the whole organization, folks from every department will need to participate and shoulder some of the burden.

Danger Will Robinson

Based on the amount of prep work it takes to get auto-classification tools up and running I think it’s more useful to talk about machine classification rather than auto-classification. Here’s what I mean…

Think about a robot in a car parts factory. It arrives day one with all sorts of capabilities: it can rotate its claw, bend at the elbow and shoulder, open and close its pincers, etc. But if you put it in front of the parts that make up the muffler for a Honda Accord and turn it on, nothing happens, because before that robot can do anything productive, you need to teach it how to build the muffler for a Honda Accord.

You need to teach it, step by step and movement by movement, precisely what to do in order to build that muffler correctly. Once you do that, then you can put box after box of parts in front of it, and it will build you mufflers all day (and all night) long. But without the up-front efforts to train the robot, or with poor quality up-front efforts, you’ll get nothing.

Auto-classification tools are pretty much the same from what I can see, and in this respect, it seems better to think of them as machine classification, i.e., after teaching them what to do, they then execute on the rules you taught them to classify content. But what they don’t do is come up with the classification rules on their own in the first place.

The final word

Okay, so much for what auto-classification is and isn’t. Next post we’ll look in more depth at how the up-front training of the tools works to give you a better idea of what that entails in order to be successful.

In the meantime, I’d love to hear from folks out there, especially those of you who have more experience in the trenches working with these tools: am I right about them, or have I missed something? Do you see the matter differently, is there another way to frame the issues up?

Whatever your thoughts, jump in, and let’s get the conversation started.

6 Comments leave one →
  1. January 4, 2012 10:35 am

    I have autoclassification experience with Documentum. Out of the box it won’t happen. And also without a thorough business analysis, not only of how to classify records, but also how the work gets done with complete auto attribution, it is next to impossible.

    So, first you need to get your act together with business analysis and charting the flow of information in the organization, especially from an operational standpoint. IE start at the COO office and work your way through the organization… then you can set up rules that say, if this document is a xyz then cross reference it to this classification tree abc, and so on.
    You also need two distinct folder trees for folder structure. In Documentum it is easy to cross link one object between different structures. In other products perhaps but in many no.

    So once your analysis is done you need a rules engine and some code to scan through the documents existing and re-classify and cross link them. If new you can use action activated methods which auto attribute and classify on checkin. You get the attributes from the context of who is creating what document for what purpose. If it is not mapped out in advance with analysis it wont work. All the buzz about tools for taxonomies and OCR, etc is just that. There are places for it but it wont accomplish it without a lot more.

    I can do it in Documentum but it is not easy to setup and can be expensive. Down the road the ROI will be amazing but the sell to the organization is hard unless you are reducing headcount as in SAP value propositions.

  2. January 4, 2012 10:36 am

    That last sentence in the last paragraph of the “Danger Will Robinson” section? – ” But what they don’t do is come up with the classification rules on their own in the first place.” From a process and technical perspective both, that’s what the auto-classification hokey-pokey is ALL about. Yup, you got it.

  3. January 9, 2012 2:46 am

    “Teach a machine to fish” is a clever way to describe probabilistic approaches to classification and what it means to train the system.

    What about other approaches with predefined taxonomies or rule-based taxonomies? Do you have experience good or bad with them?

    • January 10, 2012 9:04 am

      In my experience it becomes an auto classify and auto attribute exercise where you capture as much as possible from the context of the document creation process. If you are just ingesting a shared drive the only thing you can go by is the folder structure, owner and file name unles you can introspect the file. With some imaging apps like Captiva you can grab keywords like vendor, invoice, address, etc. from image files and you can full text index text files or even peek into mp3 and engineering diagrams. However the best way to capture this information is from the user or app while it is being created. This sounds easy but we all know that users do not want to fill in a bunch of attributes while checking in a document. So, if you analyse the operational process around the creation of content, and most people do not do this just for fun, and find out which kids of documents are created for which purpose it becomes much easier to map out a process and know how to classify it. That is why I like Documentum Taskspace combined with TBOs that detect and link files according to rules as opposed to Webtop where the user must drive the classification. You can channel the user experience rather than give them a huge power tool and force them to classify it.

      • February 2, 2012 7:55 am


        You’re spot on about mentioning capture, because auto-classification in this space is by far the most mature…but of course, my caveat about “auto” vs. “machine” is still the case: the documents being scanned are highly structured, and so the system can determine classification based on them. I think classification technology in other areas (e.g., for slop on shared drives) will get to that level, but it will take time, just like it did for OCR.

        Thanks for jumping in and taking the time to comment…




  1. Bibliotheken en het Digitale Leven in Januari 2012 | Dee'tjes

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s