Information systems seminar
"Data warehouse project management and data modeling"
D. Barsukov
By "The data warehouse Lifecycle toolkit" of
R. Kimball, L. Reeves, M. Ross and W. Thornthwaite
Today, I am going to introduce some aspects of data warehouse project management. Besides I am going to introduce some activities of data modeling phase.
Chapter 1. Data warehouse project management
The first target of building information system was to automate working with information. For example: to find an invoice by invoice number, to find purchase of defined client etc. The best description of this technology is an OLTP name (On-Line Transactional Processing). It means that in any period of time we are dealing with some peace of information like an invoice, a client or something like that. We have been gathering information during this period of IS life. And now we have huge databases with valuable information. The next step of developing info systems should be using this information by analyzing it. So, the time of data warehousing and data mining is coming.
The overall lifecycle approach to data warehouse implementation is illustrated at the Picture 1. This diagram describes the sequence of high-level tasks required for effective design, development and deployment.

The lifecycle begins with project planning which addresses vision and scope of the data warehouse project as well as readiness assessment (see table 1) and risk management. Besides project planning dealing with resources, task assignments and schedule. It is dependent on the business requirements, as shown by the two-way arrow between these activities.
The parallel step of DW Lifecycle is business requirements definition. The project success is greatly increased by understanding of the business end-user and their requirements. Designers should find the key factors driving the business to effectively determine requirements and translate them into design considerations. Probably the best approach is an interviewing. But You can't just ask users what data they would like to see in the data warehouse. Instead, you have to talk to them about their jobs, their objectives and their challenges and try to figure out how they make decisions, both today and in the future. This step is base for the three parallel tasks focused on technology, data and end-user applications.
The one of three parallel tasks is data track containing "dimensional modeling", "physical design" and "data staging design and development". The first of them is dimensional modeling. The business requirements define the data needed and we develop dimensional model. This tasks start from constructing a matrix that represents key business processes and their dimensionality. Besides we identify the fact table grain, associated dimensions, attributes, hierarchical drill path and facts. The logical database design is completed with appropriate table structures and primary/foreign key relationship. This set of activities concludes with the development of the source-to-target data mapping.
The next process of data track is physical design. It defines the physical structures necessary to support the logical database design. Primary elements of this process include defining naming standards and setting up the database environment. Indexing and partitioning strategies are also determined.
The last step in the data track is data staging design and development. Data staging has three steps: extraction, transformation and loading. The extract process always deals with data quality since this significantly impacts data warehouse creditability. Besides You have to built two staging processes - one for initial population of data and another for the regular, incremental loads.
Now we will see on the next of the parallel processes: Technology track. And the first task of this track is technical architecture design. Data warehouse requires an integration of numerous technologies. You will need to consider three factors:
to establish the data warehouse technical architecture design. Technical architecture answers the question "How we will do?" and describes the flow of data from the source systems to the decision maker. It breaks into two big pieces: the back room and the front room (see picture 3). The flow is driven by the metadata that describes the locations and definitions of the sources and targets, data transformation and dependencies.

After technical architecture design You are ready for product selection and installation. You should select such components like hardware platform, database management system, data staging tool, data access tool and so forth. You have to map all objects of Your user application to tiers of n-tier model. And then You should install all products and properly configure them to ensure that every component of Your architecture is working correctly.
The last parallel process is an application track. The first task is end-user application specification. You have to define a set of standard end user applications since not all business users need ad hoc access to the data warehouse. Application specifications describe the report templates, user driven parameters and required calculations. These specifications ensure that the development team and business users have a common understanding of the application to be delivered.
The next task of this track is end user application development by following application specifications. It includes building of user interfaces, configuration of the tools metadata, and constructing the specified reports.
Those tracks end with deployment. Extensive planning is required to ensure that all puzzle pieces (technology, data and end-user application) fit together properly. Besides it includes business-user education. User support and communication or feedback strategies should be established before any user has access to data warehouse.
The very important task is maintenance and growth. You need to continue focusing on your business users by providing them with ongoing support and education. You also need to focus attention on the back room, ensuring that the processes and procedures are in place for effective ongoing operation of the data warehouse Performance should be measured over time and logged to support marketing of the data warehouse. And Your data warehouse is bound to evolve and grow. Unlike traditional systems development, demands for changes should be viewed as a sign of success, because it means that Your data warehouse is in use! You should set priorities on user demands and go back to the planning to build up new features according users requirements.
There is one more activity during data warehouse building. It is project management, which occur throughout the lifecycle. This activity focus on monitoring project status, issue tracking, and change control to preserve scope boundaries. Finally, project management includes the development of a comprehensive project communication plan that addresses both the business and information system organization. Ongoing communication is absolutely critical to managing expectations and managing expectations is absolutely critical to achieving your data warehouse goals.
Chapter 2. Data modeling
Every dimensional model is composed of one table with a multipart key called fact table and a set of smaller tables called dimension tables. The most useful fact tables also contain one or more facts (fields of data).

The key to understand the relationship between dimensional modeling and entity-relationship is that a single entity-relationship diagram breaks down into multiple fact table diagrams. The first step in converting an entity-relationship diagram is to separate the entity-relationship diagram into its discrete business processes and to model each one separately. The next step is select many-to-many tables and designate them as fact tables. The third step is to denormalize all the remaining tables into flat tables connected directly to the fact tables. These tables become the dimensions tables. The resulting master dimensional model of a data warehouse for a large enterprise will consist of somewhere between 10 to 25 very similar looking star schemes. Each scheme will have from 5 to 10 dimension tables and called data mart. Many dimensions will be shared from data mart to data mart. These shared dimensions are called conformed dimensions: i.e. dimensions that mean the same thing with every possible fact table to which them can be joined. Besides many fact tables can have data (facts) with the same units, measured in the same meaning. It looks like those tables have been done using the same template. Those facts of the fact tables called conformed facts. Conformed dimensions and conformed facts are the "bus" of the data warehouse (a bus is a common structure that everything connects to and derives power from). The resulting set of standards is called the Data Warehouse Bus Architecture. The new data marts will derive power from the bus, because the bus offers an array of standard dimensions and parallel facts.
Separating business to be modeled into business processes allow us to concentrate at the smaller set of data. At any point of time we are dealing with only one data mart that describe one business issue. So, it looks to be logical to start from the single-source data mart, because it minimizes the risk of failure. Most of the risks come from trying to build too big data mart. Besides single-source data mart will provide users with enough interesting data to keep them happy and quiet while the team keep working on the harder issue. At least they will be busy.
To decide which dimensional model to build, we begin with a top-down approach we call the Data Warehouse Bus Architecture matrix. The matrix forces us to name all the data marts we could possibly build (business processes) and to name all the dimensions implied by those data marts (data marts and dimensions could also come from the converting entity-relationship diagram). It is very useful to think of a data mart as a collection of related facts that needed to be used together. So, the rows of matrix are the data marts and the columns are the dimensions (see table 2). Columns and rows can be created independently, but the easiest way to create the list of data marts first. When the data marts and their dimensions have been identified the team is ready to start the detailed logical and physical design of individual tables.

The detailed logical design of a dimensional scheme is driven by four steps, which consist of four choices, made in order.
It is recommended to choose as simple data mart as it is possible. There are situations where You have to built data mart which is not easiest but it critical for business. But You have a list of critical data marts, You can to choose the easiest among them.
Declaring the grain is equivalent to saying what is an individual fact table record. For example it could be: Daily product sale in each store by clients. Generally, the fact table grain is chosen to be as low as it is possible. A very low granularity is far better for responding to unexpected new queries or for adding a new data element. The three most useful fact table grains are individual transaction, snapshot and line items from control document like invoices.
A good clear declaration of the grain often makes it easy to choose the dimensions for the particular fact table. Frequently, the grain is stated in terms of primary dimensions. Other dimensions can be tested against the grain to see if they make sense. Once a dimension is chosen, there may be a large number of descriptive attributes, which can be used to populate the dimension. Each dimension has its own granularity too, but it can't be lower than the fact table granularity. For example, if the fact table describes monthly sale,then the grain of time dimension cannot be days or weeks.
The last step is to add as many facts as possible within the context of the declared grain. An attribute can live in one and only one dimension, whereas a fact can be repeated in multiple fact tables. But be careful and don't add fact from the wrong grain. Besides every fact should have a default aggregation rule.
It is very important to communicate with end users as well as with other team members during data modelling. It is better to ensure that you are going to build up the right product by communication with user before you will spend thousands or even millions dollars on building it. And visualization is essential to making the communication work. The matrix developed by the design team in their internal meetings can be cleaned up and used as a presentation aid for meetings with other designers, administrators and end-users. The matrix is very useful as a high-level introduction to the design. It gives a view of what the data warehouse will become. Besides fact table diagram is used to describe each completed fact table. It names the fact table, clearly states its grain and shows all connected dimensions. Supporting information for the diagram includes the name of each dimension and a description of that dimension. Keep your description and diagrams together, don't use references, otherwise it is to complex to find out the right description by flipping back and forth between documents. This diagram also can be used for end-user training. Fact table detail and dimensional table detail diagram also used to show for is in the tables. Each attribute should be described even if it is not released in the current version. In this case, users can see that their requirements have been heard and understood.
Table 1: Data Warehouse Readiness Test
(project planning phase)|
Factor |
|
|
|
|
|
Strong Business Management Sponsor |
||||
|
|
Can take weeks for team to gain access |
Û |
Readily available to team |
|
|
|
Hope "you" get it done |
Û |
Active, vocal and visible supporter - willing to put own neck on line |
|
|
|
You can deliver this to 250 users next month, right? |
Û |
Realistic expectations |
|
|
|
"A data whatta?" |
Û |
Data warehouse savvy |
|
|
Compelling Business Motivation |
||||
|
|
"And your point is?" |
Û |
Survival dependent on data warehouse |
|
|
|
Ten different views of the solution |
Û |
Consistent view of the solution |
|
|
|
Tactical issue |
Û |
Strategic issue |
|
|
|
Cost savings opportunity |
Û |
Incremental revenue opportunity |
|
|
|
Unable to quantify the payback |
Û |
Huge payback |
|
|
IS / Business Partnership |
||||
|
|
Business consultants without IS knowledge |
Û |
Business and IS work hand-in-hand |
|
|
|
"We can't trust any numbers from our systems" |
Û |
Strong confidence in existing reporting environment |
|
|
|
It takes "years" to get a new ad hoc request turned around |
Û |
Quick IS response to ad hoc requests |
|
|
Current Analytic Culture |
||||
|
|
"Gut feel" decision making |
Û |
Decision making relies on facts and figures |
|
|
|
Users don't ask for data |
Û |
Business users demand access to data - "Just get me the data and I'll figure it out" |
|
|
|
Users don't look at current reports |
Û |
Current reports are consistently rekeyed into spreadsheets for analysis and historical trending |
|
|
|
Users have secretaries log on and print off email to read it |
Û |
Users are very computer literate |
|
|
Feasibility |
||||
|
|
Data warehouse would require purchase of all new technology |
Û |
Robust technical infrastructure in place |
|
|
|
Everyone and their uncle is committed to the year 2000 project |
Û |
Experienced resources available |
|
|
|
Reliable data won't be available. |
Û |
Quality data available |
|