Skip to content

Auto-classification POC: the results show (part 1)

March 23, 2012

So the auto-classification POC I’ve been involved with is over, and we’re putting the finishing touches on the report out for the client and the vendors. And although a public report is not possible given the agreements between everyone involved in the effort, there are some more general takeaways I can share, which, despite their generality, will hopefully be valuable to folks out there interested in the auto-classification and text analytics (AC/TA) space.

The tools work

I’ll admit to going into the POC with a (very) healthy skepticism about whether these tools could do all (or even some) of what vendors claim they can. Much of this was based on (1) the natural tendency of the analyst in me to distrust the sales/marketing claims of software vendors and (2) horror stories from clients of how these tools had failed them, sometimes spectacularly.

And I’ll admit that I left the POC pleasantly surprised at how well the tools actually work on the ground. All of them basically did what they said they did (imagine that!). They differed in things like UI/UX, how precisely they leveraged AC/TA capabilities to deliver functionality, and the balance of out of the box to configuration to development. But in terms of simply being able to do auto-classification or text analytics (or both), they worked.

But keep in mind that saying this is akin to saying that the Milwaukee, Makita, and Craftsman chop-saws at the Home Depot all work. Will they cut wood? Sure. Will you be able to build a custom cabinet with them if you buy one and take it home with you? Maybe—depends on how good a carpenter you are.

Same goes for all the AC/TA tools. They do what they say they do, but that doesn’t do you much good until you figure out how exactly you’re going to use them in the service of some business goal—and it turns out that therein lies the rub.

A means to an end

What I came to realize is that AC/TA tools do you little to no good in and of themselves. To get any tangible value out of them, you’ve got to have clear, articulated reasons to drive how you use them and the outcomes you’ll aim for. In this, they’re no different than any technology tool, but so often I think we view AC/TA as an end in itself, maybe because the technology is so futuristic and geekily cool. To get value out of these tools, however, you need to push yourself to clearly articulate why you’re using them.

Based on working side-by-side with vendors and a client over a number of weeks for the POC, there seem to be a few broad categories of goals you might use AC/TA tools to strive for:

  • Storage reduction – to reclaim storage and save hard dollars, to make storage management easier (or both)
  • Risk mitigation – e-discovery, audit, disaster recovery/business continuity, etc.
  • Operational efficiency – improve findability, reduce rework, foster collaboration and information sharing

Once you determine which of these (or other) goals are important to you, you can begin to structure your use of AC/TA tools to have a hope of achieving them.

Auto-classification is much cheaper than manual classification

But before you get too lathered up about this, realize that saying this is like saying that an Aston Martin is cheaper than a Lamborghini. You can’t get either one of them for the price of a Honda, which is what you likely have budget for in the first place.

But before we dash the dreams of enterprise auto-classification, let’s run some back of the napkin numbers to see how the two methods compare.

Let’s assume 40TB of content with an average of 250KB per document. That gives you roughly 200M documents to classify.

First, manual classification:

  • If it took on average 30 seconds to classify each document, you would need a little over 1.6M hours to classify all of them
  • If you had three shifts of 100 workers each and had the teams classifying seven days a week, you would need almost two years to complete the task
  • If you paid them US wages ($35/hour), it would cost somewhere north of $58M
  • If you paid them offshore wages ($5/hour), it would cost somewhere north of $8M

So much for the Lamborghini, now let’s look at the Aston Martin:

  • Cost of typical AC/TA software: $750K
  • Time to collect documents to build a single training set: 40 hours
  • Number of record types to train the tools on: 300
  • Total hours required to gather training sets: 12,000 (at an internal cost of about $1M if you use an $85/hour fully loaded FTE rate)
  • Total hours required to train the AC/TA software using the training sets: 360 (at an internal cost of $30K at the $85 rate)
  • Number of documents requiring manual classification: 20M (200M at a 90% accuracy rate)
  • Hours to classify at 30 seconds per document: 166,667
  • Cost to classify onshore ($35/hour): just north of $5.8M
  • Cost to classify offshore ($5/hour): just north of $800K
  • Total hours to auto-classify: a little more than 187,000
  • It would take three shifts of 20 folks working seven days a week a little over a year to complete the task
  • Total cost of AC/TA onshore: just north of $7.6M
  • Total cost of AC/TA offshore: just north of $2.6M

Again, is auto-classification cheaper than manual classification? Sure. But leave manual out of it, since no one in their right mind would do it.

Instead, ask yourself whether you could get $2.6M dollars and 20 bodies round the clock (or even eight hours a day) to do auto-classification. In today’s tight budgeting environment, I think it would be a hard sell, all other things being equal.

Given all this, a key takeaway for me is that very few clients will decide to do auto-classification all by itself—simply too time consuming and expensive.

The final word

So what will people do? How will they leverage AC/TA tools in the real world, with Honda (or Kia) budgets? In the next post I’ll sketch out my thinking on how organizations can best leverage AC/TA tools in a targeted, incremental way to gain many of the benefits of auto-classification while minimizing the costs in time, money, and resources.

In the meantime, jump in and let everyone know what you think, share your own experiences with AC/TA, or just heckle—let’s get the conversation started.

3 Comments leave one →
  1. March 27, 2012 11:43 am

    Very interesting stuff Joe… thanks for sharing your POC insights…

  2. March 30, 2012 11:42 am

    Fully agree with the statement: “They do what they say they do, but that doesn’t do you much good until you figure out how exactly you’re going to use them in the service of some business goal—and it turns out that therein lies the rub.”

    However I don’t get the labor cost for the AC approach… oops scratch that. It’s for the manual assignment of the failed 20M docs. How does one know they are failed by then way? More manual labor?

    • Lane Severson permalink
      April 3, 2012 8:47 am

      “How does one know they are failed by then way? More manual labor?”

      Pitch, I believe that these are documents that have not met the threshold of certanty set up as “acceptable”. So when we set up the system we decide that if there is a high level of certainty it is a marketing doc (75-85% for example) then it is auto classified as “marketing”. If the system is less than 75% certain of the doc type it is pushed for manual review.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s