SharePoint EDMS Architecture

Blog post co-authored by Mike Hacker, Scott Derby, Tim Benjamin and Julian Soh.

As organizations embark on using SharePoint for large EDMS (Enterprise Document Management System) projects they may encounter SharePoint support limits that will impact the overall architecture of the solution. This article is intended to describe one such support limit and how to build a SharePoint solution so your end users are unaware of the underlying technology and decisions made to ensure the environment remains within the Microsoft defined support limits.

The Issue

SharePoint has certain documented boundaries and limits to ensure good performance and recoverability. These boundaries and limits for on premises deployments are described in detail in this TechNet Article. Online deployments are governed by the same maximums however the service offers varying limits depending on what service you subscribe to. This is done to ensure that the online service can meet not only its performance requirements but its SLA’s for RPO (Recovery Point Objective) and RTO (Recovery Time Objective). Information on the limits for each service is available in the appropriate service descriptions (Office 365 Beta, BPOS-D). It is important to note that the online service is constantly evolving and that these limits are changing rapidly. Regardless of whether your deployment is on-premises or online one of the support limits is the content database size. A content database is a dedicated SQL database for holding documents and information for a SharePoint 2010 web application. For a collaboration style SharePoint web application the support limit is 200GB per content database whether stored in a traditional SQL content database or using Remote Blob Storage (RBS).

EDMS solutions typically contain large amounts of information therefore it is important to understand the basic architecture of SharePoint and how to properly design a solution for managing terabytes of content.

The primary logical container within SharePoint is the web application. A single on-premises SharePoint 2010 web application can have up to 300 content databases. This would provide a storage limit of approximately 60TB of data per web application. In an online deployment one does not manage a web application. Within a web application there are additional containers called site collections. Site collections provide a logical boundary for managing ownership and storage allocation within SharePoint. Within the site collections are the individual collaboration sites which contain lists and libraries utilized by the end users. An important concept is that a site collection cannot span across multiple content databases. This means that the maximum size of any site collection is 200GB. In order for an EDMS solution to store and manage over 200GB of content, multiple site collections will be required.

The remainder of this article outlines possible solutions for enabling easier management and collaboration in an EDMS system that contains multiple site collections for storing documents and information whether online or on-premise. As with any good SharePoint solution, the first step is writing your governance[1] document.

Governance and Information Architecture

Before diving directly into the “how” for the solution it is important to take a step back and understand the solution being built. Understanding only the technical capabilities of SharePoint will allow you to create solutions with large sets of data; however, that solution may not really meet the needs of your organization. Taking the time to think about the end results and writing a governance document will help ensure a successful SharePoint deployment and EDMS solution.

The governance overview for SharePoint 2010 describes governance as “the set of policies, roles, responsibilities, and processes that guide, direct and control how an organization’s business divisions and IT teams cooperate to achieve business goals.” The process of writing a governance document will help everyone better understand the purpose of SharePoint within the organization. A good governance document can also help protect the organization from security threats or noncompliance liability; it will also ensure the best return on investment.

For an EDMS solution, information architecture will play a very important role. The information architecture institute defines information architecture as “the art and science of organizing and labeling web sites,intranets,online communities, and software to support findability and usability”. Within SharePoint, information architecture includes how the data within a web application (the sites, documents, lists and data) is organized and presented to the users. SharePoint 2010 provides many tools such as managed metadata, enterprise search, and hierarchical metadata navigation which can be useful when implementing the information architecture.

Information architecture is really more of an art than a science. It will take an investment of time and effort by many individuals in order to collect, organize and sort all of the documents and data that you will be storing in your SharePoint EDMS solution. Information architecture is a very important step that should not be skipped regardless of EDMS solution and will provide a high return on investment.

Good information architecture within SharePoint will help reduce the costs for finding information and ultimately increase end user satisfaction. Diving into specifics about information architecture theory or processes is beyond the scope of this document. If you are interested in learning more about information architecture, check out the many books available on the subject.

The Solution

The supportability limits for SharePoint were documented to ensure that organizations get the best performance and reliability from their SharePoint implementation. No organization wants to be placed in the unenviable situation where their SharePoint solution is not performing well or their Recovery Time Objective (RTO) or Recovery Point Objective (RPO) is unattainable.

A well designed SharePoint user interface can be implemented so that the end user is abstracted away from the underlying site collection / site architecture required in order to stay within the SharePoint best practices and supportability of online solutions. Several key SharePoint design elements such as search, document IDs, metadata navigation, managed metadata service, and the content organizer can be used to meet this goal.

The following sections of this paper will describe the technologies within SharePoint and describe in more detail how these design elements might be used to accomplish this goals.

Site Taxonomy & Top Level Site Design

The Site Taxonomy and Top Level Site Design is the most critical component to ensure successful implementation. In coming up with a Site Taxonomy and Top Level Site Design we keep the following design considerations in mind:

· Ease of Use. Visitors, contributors, and content stewards will find document management features to be easier to use, adaptable, easier to manage, fast to deploy, and highly capable by default. Document management features facilitate improved document creation and management.

· Enterprise Readiness. Regardless of the document management scenario, the solution must provide a high degree of performance at scale, rich feature depth, customizability, and extensibility. The document management features of SharePoint are infused with metadata and use metadata to drive document retrieval functionality.

· Broad Participation. Everyone in the organization has access to, and can benefit from, document management features. You can adjust capabilities so that everyone can see just what they need.

Content Organizer

The Content Organizer makes routing decisions by analyzing the metadata associated with each individual item. You can base routing decisions on content types and metadata. This enables the Content Organizer to make complex document management decisions based on rules defined by the content steward or librarian.

The Content Organizer requires the use of site content types. A content type is a reusable collection of metadata (columns), workflow, behavior, and other settings for a category of items or documents in a Microsoft SharePoint 2010 list or document library. Content types enable you to manage the settings for a category of information in a centralized, reusable way. When you create a new content type and add it to a site’s collection, it becomes what is known as a site content type. At this point, the content type is available to add to lists and document libraries but has not yet been added

Documents flow through the content organizer according to these rules:

· Documents with the correct content type, metadata, and matching rules are automatically routed to the final library and folder.

· Documents that lack the amount of metadata required to match a rule or that are missing required metadata are sent to the Drop-Off Library so that the user can enter metadata.

· Users manually add site content types to target libraries.

· For all of a site’s content types for which rules have been associated, the first time that you add a rule for that content type the content type is automatically added to the Drop-Off Library.

· After a document has the appropriate amount of metadata and the specific metadata required to match a rule, it is automatically routed to the target library and folder.

Document IDs

A document ID is a unique identifier for a document or document set and a static URL that opens the document or document set associated with the document ID, regardless of the location of the document. Document IDs provide:

· A way to reference items such as documents and document sets in SharePoint Server 2010 that is less fragile than using URLs. URLs break if the location of the item changes. In place of the URL, the document ID feature creates a static URL for each content item with a document ID assigned to it.

· More flexible support for moving documents or document sets at different points in the document life cycle. For example, if you create a document on a My Site or Workspace page and then publish it on a team site, the document ID persists and travels with the document, circumventing the broken URL problem.

· A document ID generator that assigns a unique document ID to items. You can easily customize the ID by defining an ID prefix (using four or more characters), and if needed, you can use the document management API to customize the ID generator of the document ID providers.

Managed Metadata Service

SharePoint 2010 provides a new feature called the managed metadata service which holds a hierarchical collection of centrally managed terms that you can define and then use as attributes for items, including documents. A term is any word or phrase that you would use as an attribute for an item stored within SharePoint.

By having a centrally managed set of terms you will experience the following benefits:

· More consistent use of terminology – by requiring the use of pre-defined terms you can ensure that users are utilizing the proper enterprise terminology when selecting attribute values for items.

· Better search results – simple searches now return more relevant information when the items stored within SharePoint are using a consistent set of attribute values.

· Dynamic – Enables the organization to add or modify the centrally managed term sets and have all of the related items automatically updated. For example, let’s say you have a term set for department names and you have multiple documents using that term set to store a department attribute. If a name change occurs for a department, you only need to change it in the managed metadata service and all documents with an attribute of the original department name will be updated to the new name.

Information Management Policies & Workflow

Microsoft SharePoint Server 2010 includes four information management policy features to help you manage your content: expiration, auditing, document labels, and document bar codes.

Document Retention

As part of your document management process, SharePoint Server 2010 helps retain information for fixed periods of time. At the end of the content’s life, the expiration policy feature can dispose of content in a consistent way that can be tracked and managed. For example, you can set content that is assigned a specified content type to expire on a specific date, within a certain amount of time after the document was created or last modified, or based on a workflow activity or some other event.

After the document expires, you can determine the actions that the policy control takes. For example, the policy can delete the document, or define a workflow task to have SharePoint Server 2010 route the document for permission to destroy it. In addition, the expiration policy feature provides the capability for you to build and use a custom workflow to be performed on the item after it reaches its expiration date.

Auditing

SharePoint Server 2010 provides an auditable system of record. As such, SharePoint Server 2010 automatically logs events and activities performed by SharePoint Server 2010, custom solutions, and users. This auditing feature is available for documents and for list items that are not part of a document, such as items in task lists, issues lists, discussion groups, and calendars. The auditing feature logs information for events such as the following:

· Each access to an item

· Each check-in and check-out of a document

· Any changes to permissions and settings

· The time a specified item was deleted and by whom

Document Labels
The document label feature and the bar code feature are designed to assist you to organize your documents for systematic storage and retrieval. You can use either feature to assign a unique label to a document, whether the document is a physical copy or electronic file, which enables you to track it.

Document labels are text labels that you can have SharePoint generate automatically based on a content type’s metadata. For example, a law firm might want to attach a document label consisting of client name, subject matter, and date assigned to each document of a given content type.

You can print and affix document labels to a physical copy of the document, or insert them as images into a Microsoft Office 2010 document.

Document Barcodes

Document bar codes are similar to document labels, but instead of text, they are a generated unique ID. You can print and affix the bar code to a physical copy of the document, or insert it as an image into an Office 2010 document. You can also extend and customize the format of the bar code.

Metadata Navigation

Metadata navigation and filtering is an effective tool for navigating large lists of documents. The feature was designed to be the way to navigate the contents of large repositories in SharePoint Server 2010, which it accomplishes by:

· Enabling multiple pivots on data. After the content steward or librarian classifies documents by tagging them, users can find and retrieve those documents based on any metadata tagged to that document.

· Ensuring that visitors, contributors, and content stewards are never blocked from seeing useful results after using metadata navigation and filtering to run a query.

· Enabling content stewards to configure metadata navigation and filtering to perform well for the majority of libraries without having to explicitly create indices to support queries used to retrieve documents.

· Assisting content stewards to specify additional indices that they can use to enhance performance over a wider range of queries.

· Assisting users in refining queries to use compound indices to increase the relevance of results

Search

Search will be the single most important part of any EDMS solution. SharePoint Search out of the box offers significant functionality to enable a high quality search experience

o Relevance based on extracted metadata. Document metadata is indexed along with document content. However, information workers do not always update metadata correctly. For example, they often re-purpose documents that were created by other people, and may not update the author property. Therefore, the original author’s name remains in the property sheet, and is consequently indexed. However, the search system can sometimes determine the author from a phrase in the document. For example, the search system could infer the author from a phrase in the document such as "By John Doe". In this case, SharePoint Server 2010 includes the original author, but also maintains a shadow value of "John Doe". Both values are then treated equally when a user searches for documents by specific authors.

o Configurable Search Interface. Provide the most efficient user interface possible for information workers to search and manipulate results in an organizationally relevant manner. The Enterprise Search Center for SharePoint Server 2010 and the FAST Search Center are highly customizable without requiring developer effort. For example the following strategies might be considered;

o Create Search Scopes to allow users to quickly narrow searchable content and improve the relevancy of the result set.

o Use Out of the Box Web Parts to improve search results and allow users to filter and modify advanced queries to provide more meaningful and narrowed search results.

o Apply XSLT styles to provide Branding and Improve the User Interface

All of this can be done through the web browser or by using SharePoint Designer 2010. For customers using SharePoint Online this does not require approval by the SharePoint Online Team at Microsoft or High Level Design (HLD) documentation.

FAST Search Server 2010 for SharePoint enables you to further extend search across additional content sources, larger[2] content volume, and complex queries. For example FAST can provide:

o Custom metadata extractors

Used in the content processing pipeline. This allows you to define how managed properties are populated from crawled content, which can then be used in deep refiners.

o Organization-specific vocabularies

Include in metadata refinement panel which enables search in the terms and language of your business.

o 500 million item content volume with sub-second query response time

Scale to extremes with FAST Search Server 2010 for SharePoint while maintaining sub-second query response times.

o Advanced content processing with advanced linguistics

Extract and create metadata latent in documents to improve search results, sorting capabilities, and the refinement panel.

o Thumbnails and previews
Thumbnails and previews make the results of a search query visual, allowing users to recognize the right content quickly.

o Contextual search
Tailor different results and refinement options based on the profile of the user or audience.

The ability of FAST to crawl multiple content databases and providing the ability to organize the results based on relevance and/or location that the content is stored would play a significant role. For example, in the event that the volume of content that should normally reside in a single library would exceed the limits outlined at the beginning of this document. We may need to overcome the limitation through proper information architecture so the information is divided (organized) in a logical fashion. However, the results from search would need to also take into account the closer relationship of the information located in these different content databases and represent them accordingly.

Conclusion

The design considerations outlined above will allow the user access a single location and upload, find, view and edit content regardless of the content database that item may exist in. This is done by implementing rules in the content organizer that places the items in the appropriate content database. Some examples of those rules might be metadata, size or age. Once these rules are created navigation of the content is done through metadata and or search. Throughout this entire process the user is unaware that they may have crossed content database boundaries.

For customers that are using SharePoint Online one could, for example, setup Site Collections as follows to create the appropriate amount of storage and growth allocations.

· Store and Manage older documents and records in extra-large site collections that are dedicated for “archive purposes”. These Site Collections should be designed to store and manage older and less active documents and should have minimal versions for optimized document storage & performance. These should have Document ID’s assigned to all documents to ensure quick retrieval from a top level site and support the ability to move the document as necessary.

· Keep Active Collaborative Documents in Team Site Collections – These site collections are smaller in nature and are designed with active collaboration in mind, the advantage here is there is not a limit to team site collections, in addition I would also enable content approval workflows to store final versions in the XL Site Collections for Archive Purposes and use Information Management Policies to keep extraneous documents and older past versions to a minimum.

· Federate with FAST Search Server 2010 on premise in order to expand Federated Search to an extremely large number of heterogeneous systems. This provides context for information outside of SharePoint Online that would otherwise not be able to be integrated while still maintaining a single user interface.


[1] A good governance resource is the “SharePoint Deployment and Governance using COBIT 4.1 – A Practical Approach” by the Information Systems Audit and Control Association (ISACA). ISBN: 978-1-60420-117-8

[2] FAST Search can crawl in excess of 500 million individual content

Leave a Reply