Skip to content

Auto-classification – One Piece of the Puzzle

March 1, 2012

I’m still in the middle of the auto-classification POC I started back in December, although it’s almost done. Right now, we’ve finished the on-site work and are back at the ranch combing through data, crunching numbers, and trying to net out the results for ourselves, for the vendors, and for the client. Once that’s done, I’ll definitely share as much as I can with you all without getting on the wrong side of confidentiality with anyone, but in the meantime, I wanted to share two related aha moments I had around auto-classification in the last few weeks.

 There’s more to it

The first aha moment was when I realized that auto-classification (AC) isn’t the only game in town: there’s also analytics tools that accomplish many of the same things as AC but from a different angle as well as day-forward efforts to maintain better information hygiene:

  • Auto-classification – Using training sets and iterative passes, teach a system how to recognize and classify record types
  • Analysis – Using powerful search and analytic tools to interrogate content to better understand it and perform simple actions
  • Manual Hygiene – Reengineering user and system behavior to enforce information consistency and classification at the point of creation/use

And while the last one has been the holy grail of information lifecycle management for some time, the second is something I haven’t heard most folks talking about in the AC space.

Basically, instead of training up a tool to recognize known categories in a set of data, you point a tool at a set of data and it tells you something about what’s there. This can be basic sys admin stuff like duplication rates, content aging, and file types, but can get more complex with the application of business rules using advanced search.

For example, finding every file where ###-##-#### appears to give you an indication of how many social security numbers you have laying around your shared drives; or finding where social security numbers appear in conjunction with one or more terms from a list of medical conditions to catch personal health information; and so on.

What’s more, many of these analysis tools can also take rudimentary action based on the search results, e.g., move or delete files, add or change system metadata or document headers, create reports and dashboards, etc.

A team effort

The second aha moment was that the goal isn’t to decide which of these three methods to use, but rather determine how to use all three in concert to speed up the process and increase quality.

With that in mind, I came up with the following figure to lay out what seems like one way to use these three approaches effectively.

Step 1 – Analysis. Begin by de-duping the content, followed by file type analysis to identify file types that can be removed immediately according to policy (.exe, for example); then move to identification of broad categories (PII, PHI, IP), identify initial content blocks that can be treated en masse (flagged, moved, deleted); then move to more narrow categories to act on, starting with keywords of active legal holds and departmental records categories (a set of keywords that reasonably identify finance, HR, or training, documents if they appear); finally, use content aging to find content that is beyond its useful life and either delete it or move it to cheaper storage.

Step 2 – Auto-classification. Based on the content left from step one, train the AC tool on the highest value, highest risk, highest value record types or document categories and use it to classify the content and then act on the results (move, delete, retain).

Step 3 – Day-forward hygiene. Now that you’ve streamlined your content, put in place both automated/systematic processes as well as improved manual processes to keep that content under control (or at least less out of control).

As the figure indicates, with each step, the volume of content drops, so that you use the less resource-intensive tools up front on your full rock pile of junk, and then move to greater and greater resource-intensive tools as the rock pile shrinks.

And the value to the organization increases with each step, until eventually, with day-forward hygiene, the greatest benefits are reaped by not having to get into this whole clean up business in the first place.

The final word

As I said at the top of the post, I’ll be writing in more detail about the POC itself once we get done with our analysis. But in the meantime, I hope that my overview of AC in the context of analysis and day-forward hygiene gets you thinking—and as always, jump in and share your own experiences and thoughts…or just keep me honest with some good old fashioned heckling. Let’s get the conversation started.

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s