Tuesday, August 20, 2019

Literature review about data warehouse

Literature review about data warehouse CHAPTER 2 LITERATURE REVIEW 2.1 INTRODUCTION Chapter 2 provides literature review about data warehouse, OLAP MDDB and data mining concept. We reviewed concept, characteristics, design and implementation approach of each above mentioned technology to identify a suitable data warehouse framework. This framework will support integration of OLAP MDDB and data mining model. Section 2.2 discussed about the fundamental of data warehouse which includes data warehouse models and data processing techniques such as extract, transform and loading (ETL) processes. A comparative study was done on data warehouse models introduced by William Inmons (Inmon, 1999), Ralph Kimball (Kimball, 1996) and Matthias Nicola (Nicola, 2000) to identify suitable model, design and characteristics. Section 2.3 introduces about OLAP model and architecture. We also discussed concept of processing in OLAP based MDDB, MDDB schema design and implementation. Section 2.4 introduces data mining techniques, methods and processes for OLAP mining (OLAM) which is used to mine MDDB. Section 2.5 provides conclusion on literature review especially pointers on our decision to propose a new data warehouse model. Since we propose to use Microsoft  ® product to implement the propose model, we also discussed a product comparison to justify why Microsoft  ® product is selected. 2.2 DATA WAREHOUSE According to William Inmon, data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the managements decision-making process (Inmon, 1999). Data warehouse is a database containing data that usually represents the business history of an organization. This historical data is used for analysis that supports business decisions at many levels, from strategic planning to performance evaluation of a discrete organizational unit. It provides an effective integration of operational databases into an environment that enables strategic use of data (Zhou, Hull, King and Franchitti, 1995). These technologies include relational and MDDB management systems, client/server architecture, meta-data modelling and repositories, graphical user interface and much more (Hammer, Garcia-Molina, Labio, Widom, and Zhuge, 1995; Harinarayan, Rajaraman, and Ullman, 1996). The emergence of cross discipline domain such as knowledge management in finance, health and e-commerce have proved that vast amount of data need to be analysed. The evolution of data in data warehouse can provide multiple dataset dimensions to solve various problems. Thus, critical decision making process of this dataset needs suitable data warehouse model (Barquin and Edelstein, 1996). The main proponents of data warehouse are William Inmon (Inmon, 1999) and Ralph Kimball (Kimball, 1996). But they have different perspectives on data warehouse in term of design and architecture. Inmon (Inmon, 1999) defined data warehouse as a dependent data mart structure while Kimball (Kimball, 1996) defined data warehouse as a bus based data mart structure. Table 2.1 discussed the differences in data warehouse structure between William Inmon and Ralph Kimball. A data warehouse is a read-only data source where end-users are not allowed to change the values or data elements. Inmons (Inmon, 1999) data warehouse architecture strategy is different from Kimballs (Kimball, 1996). Inmons data warehouse model splits data marts as a copy and distributed as an interface between data warehouse and end users. Kimballs views data warehouse as a unions of data marts. The data warehouse is the collections of data marts combine into one central repository. Figure 2.1 illustrates the differences between Inmons and Kimballs data warehouse architecture adopted from (Mailvaganam, 2007). Although Inmon and Kimball have a different design view of data warehouse, they do agree on successful implementation of data warehouse that depends on an effective collection of operational data and validation of data mart. The role of database staging and ETL processes on data are inevitable components in both researchers data warehouse design. Both believed that dependant data warehouse architecture is necessary to fulfil the requirement of enterprise end users in term of preciseness, timing and data relevancy 2.2.1 DATA WAREHOUSE ARCHITECTURE Although data warehouse architecture have wide research scope, and it can be viewed in many perspectives. (Thilini and Hugh, 2005) and (Eckerson, 2003) provide some meaningful way to view and analyse data warehouse architecture. Eckerson states that a successful data warehouse system depends on database staging process which derives data from different integrated Online Transactional Processing (OLTP) system. In this case, ETL process plays a crucial role to make database staging process workable. Survey on factors that influenced selection on data warehouse architecture by (Thilini, 2005) indentifies five data warehouse architecture that are common in use as shown in Table 2.2 Independent Data Marts Independent data marts also known as localized or small scale data warehouse. It is mainly used by departments, divisions of company to provide individual operational databases. This type of data mart is simple yet consists of different form that was derived from multiple design structures from various inconsistent database designs. Thus, it complicates cross data mart analysis. Since every organizational units tend to build their own database which operates as independent data mart (Thilini and Hugh, 2005) cited the work of (Winsberg, 1996) and (Hoss, 2002), it is best used as an ad-hoc data warehouse and also to be use as a prototype before building a real data warehouse. Data Mart Bus Architecture (Kimball, 1996) pioneered the design and architecture of data warehouse with unions of data marts which are known as the bus architecture or virtual data warehouse. Bus architecture allows data marts not only located in one server but it can be also being located on different server. This allows the data warehouse to functions more in virtual mode and combined all data marts and process as one data warehouse. Hub-and-spoke architecture (Inmon, 1999) developed hub and spoke architecture. The hub is the central server taking care of information exchange and the spoke handle data transformation for all regional operation data stores. Hub and spoke mainly focused on building a scalable and maintainable infrastructure for data warehouse. Centralized Data Warehouse Architecture Central data warehouse architecture build based on hub-and-spoke architecture but without the dependent data mart component. This architecture copies and stores heterogeneous operational and external data to a single and consistent data warehouse. This architecture has only one data model which are consistent and complete from all data sources. According to (Inmon, 1999) and (Kimball, 1996), central data warehouse should consist of database staging or known as operational data store as an intermediate stage for operational processing of data integration before transform into the data warehouse. Federated Architecture According to (Hackney, 2000), federated data warehouse is an integration of multiple heterogeneous data marts, database staging or operational data store, combination of analytical application and reporting systems. The concept of federated focus on integrated framework to make data warehouse more reliable. (Jindal, 2004) conclude that federated data warehouse are a practical approach as it focus on higher reliability and provide excellent value. (Thilini and Hugh, 2005) conclude that hub and spoke and centralized data warehouse architectures are similar. Hub and spoke is faster and easier to implement because no data mart are required. For centralized data warehouse architecture scored higher than hub and spoke as for urgency needs for relatively fast implementation approach. In this work, it is very important to identify which data warehouse architecture that is robust and scalable in terms of building and deploying enterprise wide systems. (Laney, 2000), states that selection of appropriate data warehouse architecture must incorporate successful characteristic of various data warehouse model. It is evident that two data warehouse architecture prove to be popular as shown by (Thilini and Hugh, 2005), (Eckerson, 2003) and (Mailvaganam, 2007). First hub-and-spoke proposed by (Inmon, 1999) as it is a data warehouse with dependant data marts and secondly is the data mart bus architecture with dimensional data marts proposed by (Kimball, 1996). The selection of the new proposed model will use hub-and-spoke data warehouse architecture which can be used for MDDB modelling. 2.2.2 DATA WAREHOUSE EXTRACT, TRANSFORM, LOADING Data warehouse architecture process begins with ETL process to ensure the data passes the quality threshold. According to Evin (2001), it is essential to have right dataset. ETL are an important component in data warehouse environment to ensure dataset in the data warehouse are cleansed from various OLTP systems. ETLs are also responsible for running scheduled tasks that extract data from OLTP systems. Typically, a data warehouse is populated with historical information from within a particular organization (Bunger, Colby, Cole, McKenna, Mulagund, and Wilhite, 2001). The complete process descriptions of ETL are discussed in table 2.3. Data warehouse database can be populated with a wide variety of data sources from different locations, thus collecting all the different dataset and storing it in one central location is an extremely challenging task (Calvanese, Giacomo, Lenzerini, Nardi, and Rosati, , 2001). However, ETL processes eliminate the complexity of data population via simplified process as depicts in figure 2.2. The ETL process begins with data extract from operational databases where data cleansing and scrubbing are done, to ensure all datas are validated. Then it is transformed to meet the data warehouse standards before it is loaded into data warehouse. (Zhou et al, 1995) states that during data integration process in data warehouse, ETL can assist in import and export of operational data between heterogeneous data sources using Object linking and embedding database (OLE-DB) based architecture where the data are transform to populate all validated data into data warehouse. In (Kimball, 1996) data warehouse architecture as depicted in figure 2.3 focuses on three important modules, which is the back room presentation server and the front room. ETL processes is implemented in the back room process, where the data staging services in charge of gathering all source systems operational databases to perform extraction of data from source systems from different file format from different systems and platforms. The second step is to run the transformation process to ensure all inconsistency is removed to ensure data integrity. Finally, it is loaded into data marts. The ETL processes are commonly executed from a job control via scheduling task. The presentation server is the data warehouse where data marts are stored and process here. Data stored in star schema consist of dimension and fact tables. This is where data are then process of in the front room where it is access by query services such as reporting tools, desktop tools, OLAP and data mining tools. Although ETL processes prove to be an essential component to ensure data integrity in data warehouse, the issue of complexity and scalability plays important role in deciding types of data warehouse architecture. One way to achieve a scalable, non-complex solution is to adopt a hub-and-spoke architecture for the ETL process. According to Evin (2001), ETL best operates in hub-and-spoke architecture because of its flexibility and efficiency. Centralized data warehouse design can influence the maintenance of full access control of ETL processes. ETL processes in hub and spoke data warehouse architecture is recommended in (Inmon, 1999) and (Kimball, 1996). The hub is the data warehouse after processing data from operational database to staging database and the spoke(s) are the data marts for distributing data. Sherman, R (2005) state that hub-and-spoke approach uses one-to-many interfaces from data warehouse to many data marts. One-to-many are simpler to implement, cost effective in a long run and ensure consistent dimensions. Compared to many-to-many approach it is more complicated and costly. 2.2.3 DATA WAREHOUSE FAILURE AND SUCCESS FACTORS Building a data warehouse is indeed a challenging task as data warehouse project inheriting a unique characteristics that may influence the overall reliability and robustness of data warehouse. These factors can be applied during the analysis, design and implementation phases which will ensure a successful data warehouse system. Section 2.2.3.1 focus on factors that influence data warehouse project failure. Section 2.2.3.2 discusses on the success factors which implementing the correct model to support a successful data warehouse project. 2.2.3.1 DATA WAREHOUSE FAILURE FACTORS (Hayen, Rutashobya, and Vetter, 2007) studies shows that implementing a data warehouse project is costly and risky as a data warehouse project can cost over $1 million in the first year. It is estimated that two-thirds of the effort of setting up the data warehouse projects attempt will fail eventually. (Hayen et al, 2007) cited on the work of (Briggs, 2002) and (Vassiliadis, 2004) noticed three factors for the failure of data warehouse project which is environment, project and technical factors as shown in table 2.4. Environment leads to organization changes in term of business, politics, mergers, takeovers and lack of top management support. These include human error, corporate culture, decision making process and poor change management (Watson, 2004) (Hayen et al, 2007). Poor technical knowledge on the requirements of data definitions and data quality from different organization units may cause data warehouse failure. Incompetent and insufficient knowledge on data integration, poor selection on data warehouse model and data warehouse analysis applications may cause huge failure. In spite of heavy investment on hardware, software and people, poor project management factors may lead data warehouse project failure. For example, assigning a project manager that lacks of knowledge and project experience in data warehouse, may cause impediment of quantifying the return on investment (ROI) and achievement of project triple constraint (cost, scope, time). Data ownership and accessibility is a potential factor that may cause data warehouse project failure. This is considered vulnerable issue within the organization that one must not share or acquire someone else data as this considered losing authority on the data (Vassiliadis, 2004). Thus, it emphasis restriction on any departments to declare total ownership of pure clean and error free data that might cause potential problem on ownership of data rights. 2.2.3.2 DATA WAREHOUSE SUCCESS FACTORS (Hwang M.I., 2007) stress that data warehouse implementations are an important area of research and industrial practices but only few researches made an assessment in the critical success factors for data warehouse implementations. He conducted a survey on six data warehouse researchers (Watson Haley, 1997; Chen et al., 2000; Wixom Watson, 2001; Watson et al., 2001; Hwang Cappel, 2002; Shin, 2003) on the success factors in a data warehouse project. He concluded his survey with a list of successful factors which influenced data warehouse implementation as depicted in figure 2.8. He shows eight implementation factors which will directly affect the six selected success variables The above mentioned data warehouse success factors provide an important guideline for implementing a successful data warehouse projects. (Hwang M.I., 2007) studies shows an integrated selection of various factors such as end user participation, top management support, acquisition of quality source data with profound and well-defined business needs plays crucial role in data warehouse implementation. Beside that, other factors that was highlighted by Hayen R.L. (2007) cited on the work of Briggs (2002) and Vassiliadis (2004), Watson (2004) such as project, environment and technical knowledge also influenced data warehouse implementation. Summary In this work on the new proposed model, hub-and-spoke architecture is use as Central repository service, as many scholars including Inmon, Kimball, Evin, Sherman and Nicola adopt to this data warehouse architecture. This approach allows locating the hub (data warehouse) and spokes (data marts) centrally and can be distributed across local or wide area network depending on business requirement. In designing the new proposed model, the hub-and-spoke architecture clearly identifies six important data warehouse components that a data warehouse should have, which includes ETL, Staging Database or operational database store, Data marts, MDDB, OLAP and data mining end users applications such as Data query, reporting, analysis, statistical tools. However, this process may differ from organization to organization. Depending on the ETL setup, some data warehouse may overwrite old data with new data and in some data warehouse may only maintain history and audit trial of all changes of the data. 2.3 ONLINE ANALYTICAL PROCESSING OLAP Council (1997) define OLAP as a group of decision support system that facilitate fast, consistent and interactive access of information that has been reformulate, transformed and summarized from relational dataset mainly from data warehouse into MDDB which allow optimal data retrieval and for performing trend analysis. According to Chaudhuri (1997), Burdick, D. et al. (2006) and Vassiladis, P. (1999), OLAP is important concept for strategic database analysis. OLAP have the ability to analyze large amount of data for the extraction of valuable information. Analytical development can be of business, education or medical sectors. The technologies of data warehouse, OLAP, and analyzing tools support that ability. OLAP enable discovering pattern and relationship contain in business activity by query tons of data from multiple database source systems at one time (Nigel. P., 2008). Processing database information using OLAP required an OLAP server to organize and transformed and builds MDDB. MDDB are then separated by cubes for client OLAP tools to perform data analysis which aim to discover new pattern relationship between the cubes. Some popular OLAP server software programs include Oracle (C), IBM (C) and Microsoft (C). Madeira (2003) supports the fact that OLAP and data warehouse are complementary technology which blends together. Data warehouse stores and manages data while OLAP transforms data warehouse datasets into strategic information. OLAP function ranges from basic navigation and browsing (often known as slice and dice), to calculations and also serious analysis such as time series and complex modelling. As decision-makers implement more advanced OLAP capabilities, they move from basic data access to creation of information and to discovering of new knowledge. 2.3.4 OLAP ARCHITECTURE In comparison to data warehouse which usually based on relational technology, OLAP uses a multidimensional view to aggregate data to provide rapid access to strategic information for analysis. There are three type of OLAP architecture based on the method in which they store multi-dimensional data and perform analysis operations on that dataset (Nigel, P., 2008). The categories are multidimensional OLAP (MOLAP), relational OLAP (ROLAP) and hybrid OLAP (HOLAP). In MOLAP as depicted in Diagram 2.11, datasets are stored and summarized in a multidimensional cube. The MOLAP architecture can perform faster than ROLAP and HOLAP (C). MOLAP cubes designed and build for rapid data retrieval to enhance efficient slicing and dicing operations. MOLAP can perform complex calculations which have been pre-generated after cube creation. MOLAP processing is restricted to initial cube that was created and are not bound to any additional replication of cube. In ROLAP as depict in Diagram 2.12, data and aggregations are stored in relational database tables to provide the OLAP slicing and dicing functionalities. ROLAP are the slowest among the OLAP flavours. ROLAP relies on data manipulating directly in the relational database to give the manifestation of conventional OLAPs slicing and dicing functionality. Basically, each slicing and dicing action is equivalent to adding a WHERE clause in the SQL statement. (C) ROLAP can manage large amounts of data and ROLAP do not have any limitations for data size. ROLAP can influence the intrinsic functionality in a relational database. ROLAP are slow in performance because each ROLAP activity are essentially a SQL query or multiple SQL queries in the relational database. The query time and number of SQL statements executed measures by its complexity of the SQL statements and can be a bottleneck if the underlying dataset size is large. ROLAP essentially depends on SQL statements generation to query the relational database and do not cater all needs which make ROLAP technology conventionally limited by what SQL functionality can offer. (C) HOLAP as depict in Diagram 2.13, combine the technologies of MOLAP and ROLAP. Data are stored in ROLAP relational database tables and the aggregations are stored in MOLAP cube. HOLAP can drill down from multidimensional cube into the underlying relational database data. To acquire summary type of information, HOLAP leverages cube technology for faster performance. Whereas to retrieve detail type of information, HOLAP can drill down from the cube into the underlying relational data. (C) In OLAP architectures (MOLAP, ROLAP and HOLAP), the datasets are stored in a multidimensional format as it involves the creation of multidimensional blocks called data cubes (Harinarayan, 1996). The cube in OLAP architecture may have three axes (dimensions), or more. Each axis (dimension) represents a logical category of data. One axis may for example represent the geographic location of the data, while others may indicate a state of time or a specific school. Each of the categories, which will be described in the following section, can be broken down into successive levels and it is possible to drill up or down between the levels. Cabibo (1997) states that OLAP partitions are normally stored in an OLAP server, with the relational database frequently stored on a separate server from OLAP server. OLAP server must query across the network whenever it needs to access the relational tables to resolve a query. The impact of querying across the network depends on the performance characteristics of the network itself. Even when the relational database is placed on the same server as OLAP server, inter-process calls and the associated context switching are required to retrieve relational data. With a OLAP partition, calls to the relational database, whether local or over the network, do not occur during querying. 2.3.3 OLAP FUNCTIONALITY OLAP functionality offers dynamic multidimensional analysis supporting end users with analytical activities includes calculations and modelling applied across dimensions, trend analysis over time periods, slicing subsets for on-screen viewing, drilling to deeper levels of records (OLAP Council, 1997) OLAP is implemented in a multi-user client/server environment and provide reliably fast response to queries, in spite of database size and complexity. OLAP facilitate the end user integrate enterprise information through relative, customized viewing, analysis of historical and present data in various what-if data model scenario. This is achieved through use of an OLAP Server as depicted in diagram 2.9. OLAP functionality is provided by an OLAP server. OLAP server design and data structure are optimized for fast information retrieval in any course and flexible calculation and transformation of unprocessed data. The OLAP server may either actually carry out the processed multidimensional information to distribute consistent and fast response times to end users, or it may fill its data structures in real time from relational databases, or offer a choice of both. Essentially, OLAP create information in cube form which allows more composite analysis compares to relational database. OLAP analysis techniques employ slice and dice and drilling methods to segregate data into loads of information depending on given parameters. Slice is identifying a single value for one or more variable which is non-subset of multidimensional array. Whereas dice function is application of slice function on more than two dimensions of multidimensional cubes. Drilling function allows end user to traverse between condensed data to most precise data unit as depict in Diagram 2.10. 2.3.5 MULTIDIMENSIONAL DATABASE SCHEMA The base of every data warehouse system is a relational database build using a dimensional model. Dimensional model consists of fact and dimension tables which are described as star schema or snowflake schema (Kimball, 1999). A schema is a collection of database objects, tables, views and indexes (Inmon, 1996). To understand dimensional data modelling, Table 2.10 defines some of the terms commonly used in this type of modelling: In designing data models for data warehouse, the most commonly used schema types are star schema and snowflake schema. In the star schema design, fact table sits in the middle and is connected to other surrounding dimension tables like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table. Most data warehouses use a star schema to represent the multidimensional data model. The database consists of a single fact table and a single table for each dimension. Each tuple in the fact table consists of a pointer or foreign key to each of the dimensions that provide its multidimensional coordinates, and stores the numeric measures for those coordinates. A tuple consist of a unit of data extracted from cube in a range of member from one or more dimension tables. (C, http://msdn.microsoft.com/en-us/library/aa216769%28SQL.80%29.aspx). Each dimension table consists of columns that correspond to attributes of the dimension. Diagram 2.14 shows an example of a star schema For Medical Informatics System. Star schemas do not explicitly provide support for attribute hierarchies which are not suitable for architecture such as MOLAP which require lots of hierarchies of dimension tables for efficient drilling of datasets. Snowflake schemas provide a refinement of star schemas where the dimensional hierarchy is explicitly represented by normalizing the dimension tables, as shown in Diagram 2.15. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. (C) Levene. M (2003) stresses that in addition to the fact and dimension tables, data warehouses store selected summary tables containing pre-aggregated data. In the simplest cases, the pre-aggregated data corresponds to aggregating the fact table on one or more selected dimensions. Such pre-aggregated summary data can be represented in the database in at least two ways. Whether to use star or a snowflake mainly depends on business needs. 2.3.2 OLAP Evaluation As OLAP technology taking prominent place in data warehouse industry, there should be a suitable assessment tool to evaluate it. E.F. Codd not only invented OLAP but also provided a set of procedures which are known as the Twelve Rules for OLAP product ability assessment which include data manipulation, unlimited dimensions and aggregation levels and flexible reporting as shown in Table 2.8 (Codd, 1993): Codd twelve rules of OLAP provide us an essential tool to verify the OLAP functions and OLAP models used are able to produce desired result. Berson, A. (2001) stressed that a good OLAP system should also support a complete database management tools as a utility for integrated centralized tool to permit database management to perform distribution of databases within the enterprise. OLAP ability to perform drilling mechanism within the MDDB allows the functionality of drill down right to the source or root of the detail record level. This implies that OLAP tool permit a smooth changeover from the MDDB to the detail record level of the source relational database. OLAP systems also must support incremental database refreshes. This is an important feature as to prevent stability issues on operations and usability problems when the size of the database increases. 2.3.1 OLTP and OLAP The design of OLAP for multidimensional cube is entirely different compare to OLTP for database. OLTP is implemented into relational database to support daily processing in an organization. OLTP system main function is to capture data into computers. OLTP allow effective data manipulation and storage of data for daily operational resulting in huge quantity of transactional data. Organisations build multiple OLTP systems to handle huge quantities of daily operations transactional data can in short period of time. OLAP is designed for data access and analysis to support managerial user strategic decision making process. OLAP technology focuses on aggregating datasets into multidimensional view without hindering the system performance. According to Han, J. (2001), states OLTP systems as Customer oriented and OLAP is a market oriented. He summarized major differences between OLTP and OLAP system based on 17 key criteria as shown in table 2.7. It is complicated to merge OLAP and OLTP into one centralized database system. The dimensional data design model used in OLAP is much more effective for querying than the relational database query used in OLTP system. OLAP may use one central database as data source and OLTP used different data source from different database sites. The dimensional design of OLAP is not suitable for OLTP system, mainly due to redundancy and the loss of referential integrity of the data. Organization chooses to have two separate information systems, one OLTP and one OLAP system (Poe, V., 1997). We can conclude that the purpose of OLTP systems is to get data into computers, whereas the purpose of OLAP is to get data or information out of computers. 2.4 DATA MINING Many data mining scholars (Fayyad, 1998; Freitas, 2002; Han, J. et. al., 1996; Frawley, 1992) have defined data mining as discovering hidden patterns from historical datasets by using pattern recognition as it involves searching for specific, unknown information in a database. Chung, H. (1999) and Fayyad et al (1996) referred data mining as a step of knowledge discovery in database and it is the process of analyzing data and extracts knowledge from a large database also known as data warehouse (Han, J., 2000) and making it into useful information. Freitas (2002) and Fayyad (1996) have recognized the advantageous tool of data mining for extracting knowledge from a da

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.