Learn Big Data,Data Warehousing Fundamentals,Informatica,SQL,Cognos : Important Data Warehousing Concept Questions Part 1

1.What are the potential problems and challenges of ETL?

The ETL task can face many problems due to a variety of reasons:

§ Unanticipated complexity

§ Conflicting business rules

§ Changing business requirements

§ Frequent changes in design

§ Flawed or inadequate design

§ Incorrect mapping

§ Longer loading times than designed

§ Dirtier data than expected

§ Insufficient development time

§ Problems due to lack of validation

§ Flawed testing process

§ Inadequate tools

§ Inferior developer skills

§ Infrastructure issues

2.Why is the ETL task often underestimated?

The ETL task is the most difficult and labor-intensive activity in the data warehouse project. It is frequently underestimated, leading to cost overruns and project delays. The scope of this task is often underestimated for the following common reasons:

§ Incorrect assumption about data quality

§ Unanticipated complexity of design

§ Unanticipated complexity of business logic

§ Incorrect estimation of programming effort

§ Unrealistic expectations about ETL tool capabilities

§ Higher-than-expected complexity in extracting and loading data from numerous sources

§ Additional validation effort required for larger-than-expected data volumes

§ Lack of guidance from subject matter experts

3.How are ETL tasks handled?

ETL tasks are handled by a variety of tools. Some of them, such as file transfer programs, are fairly simple. However, some tools, such as those used for data transformation, can be fairly complex and are able to perform the following functions:

§ Automate data extraction from multiple sources

§ Map sources to the target database

§ Transform and/or manipulate data

§ Load data into the data warehouse

Technologies used in the ETL process include ASCII, XML, HTML, DB Connect, IDoc, BAPI, and SOAP.

4.What are the common tasks performed by ETL tools?

Many tools, from a number of vendors, are available for executing ETL tasks. There are three primary functions that ETL tools are required to accomplish:

§ Read data from a source, such as a relational database table, flat file, or other

§ Manipulate data (filter, modify, or enhance), based on specific rules

§ Write resulting data to the target database

There are many intermediate and post-loading steps, during and after the three primary ETL operations, including building reference data, performing validation, cleaning, checking integrity, building aggregates, staging, archiving, and purging.

Specific tasks that an ETL tool might be required to perform include the following:

§ Converting data

o Changing the incoming source data into a unified format and definition

§ Deriving data

o Applying mathematical formulae to fields in order to create brand-new fields

§ Filtering data

o Screening unwanted data from source data files before moving it to the data warehouse

§ Integrating data

o Merging files from different databases and platforms

§ Summarizing data

o Combining tables

§ Selecting data

o Selecting and loading data based on triggers

5.What are desired features of ETL tools?

As users clamor for complete solutions, vendors are forced to provide more functionality and reliability. Many database vendors now include several desirable ETL features and functions in their products:

§ Richness of features and functions—should include both transformation and cleansing features

§ Ability to handle greater complexity of mappings and transformations

§ Ability to read various file formats

§ Ability to read directly from the source

§ Back-end data management and processing, including metadata management

§ Ability to handle real-time, or near real-time, data for ever-tightening batch-processing schedules

§ Improved throughput and scalability for handling rising data volumes, even as batch-processing timeframes become narrower

§ Ability to handle greater numbers and varieties of data sources (XML, HTML, etc.)

§ Improved capability for capturing changes and updates

§ Easy to use—for loading and/or updating

§ Provide metadata support

§ Ability to communicate with different relational databases

§ Improved administration

ETL tools should be selected carefully, based on the unique needs of each implementation. Selection of an inappropriate tool can cause many problems.

6.What types of ETL tools are available?

The functionality of the available ETL tools varies considerably and their prices can range from minimal (some are even free) to hundreds of thousands of dollars. At the lower end are simple data migration tools whose functionality is limited to extraction and loading. However, the more versatile tools are very sophisticated and can perform many tasks such as enabling transformations, handling a wide range of input formats and sources, and so forth. In general, the tools fit into three main categories:

§ Data transformation engines

§ Code generators

§ Data capture through replication

7.What are the ETL tool selection criteria?

The ETL tool to be used must, first and foremost, be appropriate for the environment in which it is to be used. It should support the basic functions required—transformation and cleansing. The desired and/or required ETL features for a particular implementation should be used as the criteria for selecting the tool. Additionally, business factors that come into play include cost, reliability, scalability, vendor reputation, and availability of resources with the skills to use the product effectively. In some situations, it will not make sense to purchase a third-party tool, which can be very expensive and difficult to use. If the data transformation is extremely complex, for example, it might be better to develop in-house custom programs for the ETL process.

8.Who are the leading providers of ETL tools?

The following is a list of well-known ETL tool vendors:

§ Informatica^® (PowerCenter)

§ IBM (InfoSphere Information Server, DataStage)

§ Ab Initio

§ SAP Business Objects (Data Integrator/Data Services)

§ Microsoft (SQL Server Integration Services)

§ Oracle Data Integrator (ODI)

§ Oracle Warehouse Builder (OWB)

§ SAP (Sybase^® ETL)

§ SAS (SAP Enterprise Data Integration Server)

§ Adeptia^® (Adeptia ETL Suite)

§ DB Software Laboratory^® (Advanced ETL Processor)

§ Others include Syncsort^®, ETI^®, iWay Software^®, and Open Text^®

The following vendors provide open-source ETL tools:

§ Pentaho^® (Pentaho Data Integration)

§ Jaspersoft^® (Jaspersoft ETL)

§ Talend^® (Data Integration)

§ Clover^® (CloverETL)

9.What are the leading ETL frameworks?

There are numerous proprietary ETL frameworks. Among the well-known vendors in this category are IBM, Informatica, Oracle, Ab Initio, SAP Business Objects, Information Builders, SAS, Microsoft, Pervasive^®, Digital Fuel^®, HiT Software^®, Altova^®, and Embarcadero^®. The open-source framework vendors include Pentaho, Talend, Jitterbit, RapidMiner, Clover, Apatar, Enhydra, and Scriptella^®. Freeware ETL frameworks are also provided by some vendors, including Jaspersoft.

10.What is the impact of parallel processing on ETL?

The implementation of parallel-processing technology, which involves simultaneous processing by two processors, has helped improve the performance of ETL processing, especially where huge data volumes are involved. There are three main methods for implementing this technology:

§ Data (splits a data file into smaller ones)

§ Pipeline (enables the execution of several components on the same field)

§ Component (runs multiple processes on different data streams on the same job)

11.How can ETL performance be improved?

The ETL process that is often the slowest is in the database load phase, during which the lack of speed can be attributed to indexing, concurrency, and data integrity. Therefore, that is an area where many performance improvement measures are being targeted. There are several techniques employed to improve performance:

§ Partitioning

§ Indexing

§ Performing validation in the ETL layer before loading

§ Disabling triggers

§ Disabling integrity checking (during loading in the target)

§ Parallel bulk loading

§ Maximizing the use of parallelism

12.What is the ELT process?

In the conventional ETL process, based on business rules, data is transformed in the intermediate step and is then loaded into the final destination—the target database. A variation of the ETL process is the ELT process—extraction, loading, and transformation. In the ELT process, data is first extracted and then loaded directly into the datawarehouse database. The transformations, if required, are applied in the third step—only after the data has been loaded into the database.

13.How much data storage is required?

The storage requirements are driven by the business requirements, such as the period of time encompassed by the data that has to be made available in the data warehouse. For example, storage required for three years of data will be quite different compared with storage needed for twenty years of historical data.

The individual storage requirements must be calculated for four different storage areas:

§ Staging area

o Depends on a number of factors, such as the specific staging requirements, data volume to be loaded, the number of dimensions and facts to be loaded, etc.

§ Main data warehouse

o The largest storage repository, into which data is loaded from the source systems and retained indefinitely

§ Data marts

o Storage volume depending on requirements of the individual data marts, which can vary considerably

§ MDDB database

o The smaller multidimensional database required for storing OLAP data

14.What are the levels of data stored in data warehouses?

Data is stored in data warehouses for efficient retrieval at three levels:

§ Current detailed data

o Contains transaction-level data

§ Lightly summarized data

o Contains aggregated data, such as revenues by week or by subregion

§ Highly summarized data

o Contains aggregated data, such as revenues by month or by region

Each data level targets a different type of user, ranging from senior executives to operational users. An inventory clerk might need to review the available inventory for a specific item, which requires the availability of detailed data; a Chief Financial Officer (CFO), on the other hand, will typically be more interested in reviewing the performance of a business unit or region, which would require the availability of highly summarized data.

15.What is current detailed data?

The bulk of the data stored in a data warehouse consists of current detailed data, at the lowest level of granularity (detail). It is organized by subject areas and can be stored as detailed or aggregated data. All data items at this level represent snapshots in time. The period for which current detailed data is maintained in a data warehouse varies by organization. Typically, data for two-five years is maintained. However, many organizations store data for very length periods, perhaps as long as 20-25 years. The frequency ofdata refresh also varies according to business needs. Many organizations perform a daily data load, whereas some prefer to refresh their data in real time.

16.What is lightly summarized data?

Lightly summarized data refers to data that has been rolled up for particular dimensions or attributes. For example, the detailed sales data can be at the daily level, whereas the lightly summarized sales data can be at the monthly level. The requirement for light data summarization in a data warehouse is based on the fact that most users run queries that repeatedly access and analyze the same data elements at a summarized level. Therefore, by storing summarized data, there can be considerable improvement in the performance and storage requirements.

17.What is highly summarized data?

Highly summarized data refers to data that has been rolled up to an even higher level than lightly summarized data. The previous example demonstrates that data can be summarized at a very high level at the annual sales level. The source for highly summarized data can be lightly summarized data or current detailed data. The primary users of highly summarized data are senior executives and strategic users. Their needs are primarily limited to this level, though, if required, they can also access data at a lower level of detail through a drill-down process.

18.What is aggregated data?

Frequently accessed data, such as monthly or annual sales, can be aggregated or accumulated along predefined attributes. For example, car sales data can be aggregated by geography and model by adding the sales dollars for each model within a specific geography. Similarly, overall sales can be cumulated for a week, month, quarter, or year. Thedata to be stored in an aggregated format is determined by a number of factors, including the frequency and complexity of queries.

19.What are the benefits of aggregating data?

The objective of creating aggregates is to improve performance by reducing the amount of data to be read by a query. When a query is run against aggregated data, the response is faster: less data needs to be accessed, because it is already aggregated. Aggregates enable faster navigation as well as faster query run times. Although aggregates reduce the retrieval cost by reducing the amount of data to be retrieved, there is a cost associated with updating them. The reason is that aggregate rollup is required whenever new datais loaded. Therefore, dependent aggregates need to be recalculated whenever there are changes in the detailed data, master data, or hierarchies.

20.What is the effect of granularity?

Granularity is an important data warehouse design issue because it affects the volume of data to be stored and the types of queries that can be executed. If data is highly granular, with a very high level of detail, the volume of data to be stored in the warehouse will be huge. A highly granular data warehouse contains very detailed data, which can include every captured transaction, such as individual sales orders and purchase requisitions. A less granular data warehouse contains more highly summarized data, such as total purchase orders issued for each month or total monthly sales by region.

If stored data is very granular, practically any type of query can be run against it. However, if the data is less granular, the types of queries that can be executed will be limited. Usually, senior executives and decision makers require less granular data, because they work with summarized and aggregated data, whereas operational staff requires more granular data. In recent years, however, this distinction has been blurred as the needs and requirements of these two types of users have started to overlap because of changes in the decision-making levels and the empowerment of lower level employees.

21.How can data stored across multiple data marts be accessed?

Sometimes, the data stored in a data mart might not contain all the dimensions or facts that are required by a query. For example, a query cannot be created in the bookings datamart if it requires billing data as well, because the bookings data mart will not contain any billing data.

The solution in such a case is to create a view that will enable access to the two data marts (bookings and billing), without actually joining them physically. For example, in the SAP Business Warehouse (SAP BW) system, this feature is implemented as a MultiProvider. This entity combines data from multiple data marts, however, it does not actually contain any data. When a query is executed, the MultiProvider combines data from its sources and then provides data to the requesting query.

22.What is data archiving?

The data to be archived, and the frequency of its archival, depends on how the data is to be used. If the data warehouse is expected to support operational needs, the requirements could be met by retaining data for a two-year period. However, if the data is to be used for strategic purposes, the retention requirement will be for a considerably longer period, extending for 5-25 years. The archived data granularity might be the same as it is for the current detailed or aggregated data.

23.What are the steps involved in designing a database?

Database design is accomplished in five steps: (1) planning and analysis, (2) conceptual design, (3) logical design, (4) physical design, and (5) implementation. The data model is created in the conceptual design step. It focuses on the data that needs to be stored in the database and how the tables are designed. The functional model focuses on how thedata is to be processed, and also on how the queries are to be designed for accessing the database tables.

24.What are data warehouse engine requirements?

Some of the engine's technical and business requirements, and factors that determine its selection, include the following:

§ Ability to support expected data volumes

§ Scalability

§ Types of queries

§ Flexibility

§ Reliability

§ Performance

§ Cost

§ Data loads

§ Data-loading time

§ Load balancing

§ Parallel processing

§ Query governor and optimizer

§ Replication features

§ Hardware and database combination

§ Ease of monitoring and administration

§ Portability

§ Ability to work with various access tools from different vendors

§ Application Programming Interfaces (APIs) for tools from leading vendors

§ Security

§ Metadata management

§ Extensibility

§ Vendor (reputation and number of database installations)

25.What are database selection criteria?

The following are widely used criteria for selecting databases:

§ Data storage (volume) requirements

§ Reliability

§ Performance

§ Scalability

§ Types of queries

§ Support for SQL

§ Metadata management

§ Distribution requirement

§ Ease of monitoring and administration

§ Proprietary or open source

§ Vendor support

26.Which are the most popular relational databases?

The following companies provide the leading relational database products:

§ Oracle

§ IBM

§ Microsoft

§ Teradata

§ Sybase (SAP)

There are three leading open-source products:

§ MySQL^™

§ PostgreSQL^®

§ SQLite

27.Should an open-source database be considered?

There are advantages as well as disadvantages associated with using an open-source product. Among the critical items to consider are the ability to support the expected datavolumes, reliability, and vendor support. As with a proprietary database, it should have the ability to meet the technical and business requirements, which were listed previously.

28.Should a relational or OLAP database be used?

A key data warehouse design decision concerns the database type to be selected: relational database or multidimensional (OLAP) database. The selection of the database type influences the choice of the data access tool, which can be a simple relational query tool, an OLAP tool that provides a multidimensional view of the data, or some other type of specialized decision support tool.

Each database type is characterized by strengths and limitations. Conventional relational databases support the specialized technical requirements of data warehouses such asdata extraction and replication, query optimization, and bit-mapped indexes. However, they provide limited support for data cleanup and transformation functions. The strengths of multidimensional databases include the benefits associated with OLAP, as well as fast querying and performance. Their primary drawback is that they are based on a proprietary database solution.

Conventional relational databases provide many of the features that characterize multidimensional databases. In many cases, either type of database can be used. However, for specialized or complex analysis requirements, multidimensional databases are often preferred.

29.Which data warehouse limitations became data mart drivers?

Data warehouses have provided many benefits that organizations have leveraged to their competitive advantage. Enterprises have been forced to seek alternative solutions for their information needs, however, because of some of the following data warehouse limitations:

§ Lag between need and implementation—it used to take a very long time, sometimes years, from the time a data warehouse was requested till the time it was actually rolled out to the users.

§ Huge size and scope of data warehouse projects encompassing the entire organization—such projects were very expensive, costing millions. They also spanned a very lengthy period and, in some cases, took years to implement.

§ For enterprise implementations, which spanned disparate computer systems and numerous departments, complexity and integration proved to be overwhelming for most IT departments.

§ Conflicting requirements, priorities, and the schedules of different departments and business units derailed many projects.

§ Politics ruled.

§ Cost overruns and delays were common.

All these problems became the drivers that have led to the development and widespread growth of data marts in the past couple of decades.

30.What are data mart characteristics?

The following characteristics define data marts:

§ Do not have to be based on the enterprise data model

§ Are typically limited to one or a few subject areas (such as sales and finance)

§ Each contain only a small part of an organization's data; enterprise can have many

§ Can be implemented as a small project and within months

§ Are easy to design, build, and test; are less expensive to implement and maintain

§ Have far fewer users than a data warehouse

§ Can have one or multiple sources—a data warehouse, another data mart, or an OLTP database

§ Have databases that are far smaller—typically only a few gigabytes

§ Require simpler hardware and supporting technical infrastructure

§ Can be implemented by staff with less experience and fewer technical skills

§ Are typically built by different project teams without a common design, process, tools, hardware, or software

§ Can be built independently, in a staggered manner, as needs evolve

§ Can make possible future integration difficult to achieve because of independent construction

31.What are the characteristic differences between data marts and data warehouses?

The following is a list of important data mart characteristics and how they compare with those of a conventional data warehouse:

§ Data marts aim to meet a department's needs, whereas a data warehouse is designed for the enterprise.

§ IT staff with limited or no data warehousing skills can implement a data mart, whereas highly skilled IT professionals are required to build an EDW.

§ A data warehouse spans the entire organization and covers most subjects, whereas a data mart covers only one or a few subject areas.

§ A data warehouse is designed using an enterprise data model and is constructed by a large central specialized professional team; data marts are constructed by different, small, project teams.

§ In contrast to data warehouses, data marts need not be based on the enterprise data model structure.

§ Data warehouses are integrated; data marts are not.

§ Data marts can be linked together, if required, but that task can be challenging.

§ Although a data warehouse is typically unique, an organization can have many data marts, built by different teams, without a common design.

§ The result of independent construction is that data mart integration, if required at a later stage, is difficult to achieve, whereas data warehouses are integrated from the beginning.

§ Data marts are easy to design and build, which makes them less expensive to implement and maintain; designing data warehouses is very complex and time consuming.

§ A data mart, although it can store gigabytes of data, usually contains only a small segment of an organization's data; most of the enterprise data is stored in an EDW.

§ A data warehouse needs to be planned and implemented as a huge project, which is not the case with data marts.

§ Uniformity exists within a data warehouse project; however, it is missing when data marts sprout because various implementation groups might use different processes, tools, hardware, and software.

32.What are the advantages of data marts?

Data marts are characterized by many advantages and benefits, which are listed below:

§ Simpler

§ Less risky

§ More focused

§ Can be implemented in months instead of years

§ Can be implemented in phases

§ Scalable

§ Flexible

§ Can obtain easy and fast approval

§ Not dependent on IT budget

§ More affordable; budget can be a few hundred thousand dollars instead of millions (required for an EDW)

§ Use low-cost hardware and software

§ Avoid political conflicts associated with data warehouse projects

§ Require fewer and less sophisticated resources

§ Can be linked to other data marts or data warehouses

§ Can improve performance by storing the data closer to the users

33.What are the disadvantages of data marts?

Although many benefits are associated with implementing data marts, there are also some disadvantages:

§ Data marts cannot support a key requirement of an EDW—the ability to analyze enterprise-wide data across business units.

§ Their development can be uncoordinated, which creates a hurdle when data marts are used as the building blocks for creating an EDW.

§ Their design is not as thorough as that of a data warehouse and, consequently, consideration for an ultimate upgrade to an EDW is inadequate or lacking.

§ Every data mart has its own narrow view.

§ The age-old problem with multiple legacy systems can also afflict data marts—a query might yield different answers depending on which system was accessed, when it was accessed, and how the query was structured and executed.

§ Redundant and inconsistent data can make "one version of the truth" a casualty.

§ Growth of data marts creates more redundant and inconsistent data, which has a cost associated with it and poses problems during upgrades.

§ Design flaws and the number of data extracts can restrict scalability; data marts might be unable to support massive data volumes associated with data warehouses.

§ Clandestine development and operations might be encouraged.

§ More work is required in reconciling terms, definitions, data, and business rules when data is migrated to an EDW.

§ They are designed and built by less experienced personnel, which can affect the quality of the product.

§ Multiple databases are required to be maintained, which can be inefficient and could require greater breadth of technical skills.

§ The extraction process can be different for each data mart.

§ Activities such as extraction and processing are not centralized; activities can be duplicated and additional staff can be required for maintenance and support.

§ Tools, software, hardware, and processes can be different for each data mart.

§ Knowledge gained by one data mart group may not be shared with other groups.

§ They can be expensive in the long run.

However, despite all these potential issues, when a cost/benefit analysis is performed, the benefits to the business far outweigh the negatives, which is the reason data marts are so popular and widely implemented.

34.How are data marts loaded?

There are two primary ways in which data is usually loaded into a data mart:

§ Data is fed from an EDW to the data mart(s); any changes to the EDW are propagated to all associated data marts receiving its feeds.

§ Data is fed to the data mart(s) by direct extract(s) from the source system(s).

35.Which platforms and tools can be used by data marts?

Data warehouses and data marts have been implemented on many platforms including Unix and Windows. Implementation of such projects involves many technologies, tools, hardware, and software, which range from the very simple to the very sophisticated. Although some vendors dominate in specific areas, such as ETL or databases, and some have offerings across the range of products required for implementing a data mart or data warehouse, no vendor dominates in all areas. The data mart vendors, with some exceptions, are the same as the data warehouse vendors.

36.What are the common data warehouse/data mart implementation approaches?

There are two widely used approaches for building data marts:

§ Top-down

o Build an EDW and then construct dependent data marts, which are its highly summarized subsets.

§ Bottom-up

o Build independent data marts, whose foundation is the enterprise data model, which can then be used to construct an EDW.

The other approaches used to build data warehouses include the hybrid, the federated, and the hub-and-spoke method.

37.What is the top-down approach?

In this approach recommended by Bill Inmon (Figure 6), the data warehouse is planned as the corporate central repository. In this architecture, a normalized data model is used, with the data being stored at the lowest level of detail or granularity. After the enterprise data warehouse is developed, dependent data marts are constructed and are fed by thedata warehouse. Whereas a data warehouse is difficult to implement, data marts are relatively simple to construct because they are developed for specific departments or business processes.

Figure 6: Top-down architecture.

38.What are the characteristics of the top-down approach?

The characteristics defining the top-down approach are as follows:

§ Is methodology-based; addresses important aspects such as modeling and implementation

§ Provides an enterprise-wide view of the organization

§ Has a well-planned architecture

§ Is implemented based on well-documented requirements

§ Avoids integration issues that characterize data warehouses derived from data marts

§ Permits better control and quality, because the EDW drives the construction of dependent data marts

§ Unable to respond to business needs in time, long delivery cycle, high cost, project delays, cost overruns, and other issues associated with large projects

39.What are the drawbacks of the top-down approach?

The primary disadvantage of the top-down approach is that it requires a major project for implementation, because of the massive scope of work that is involved. Consequently, it has a high cost and risk associated with it. The up-front cost for implementing a data warehouse using the top-down methodology is significant and the time involved, from the start of project to the time that end-users experience initial benefits, can be significant. Also, during implementation, this methodology can be inflexible and unresponsive to changing requirements.

40.What is the bottom-up approach?

In the bottom-up approach, independent data marts, whose foundations are the enterprise data model, are built first. These data marts, usually designed for specific business processes or departments, can be combined subsequently to construct the enterprise data warehouse as shown in Figure 7. Their integration is accomplished through what is known as a "data warehouse bus architecture," which involves joining the data marts together by conforming the dimensions. When the same dimension tables are used by different dimensional schemas, the tables are known as conforming dimensions. They enable drill-down from one schema to another and join measures from different star schemas. Conformed dimensions can be used across any business area, because they allow queries to be executed across star schemas. They can be used to analyze facts from two or more data marts.

This approach is recommended by Ralph Kimball, the well-known data warehouse expert.

41.What are the characteristics of the bottom-up approach?

The characteristics defining the bottom-up approach are as follows:

§ This approach can meet some enterprise-wide needs, because data marts can be combined seamlessly for reporting and analysis.

§ Seamless and transparent integration, although possible, is technologically challenging and performance can be poor, especially if many data marts need to be integrated.

§ The bottom-up approach can lead to the sprouting of data marts and data redundancy.

§ Lack of adherence to standards can cause major issues, especially integration problems, when an EDW is constructed from multiple data marts.

§ Fast implementation provides less time for analysis.

§ Implementation is not difficult.

§ Implementation can be accomplished in phases.

§ Implementation risk is lower.

42.What is the hybrid approach?

Both the top-down and bottom-up approaches have their advantages and disadvantages, leading to the hybrid approach, which tries to find a middle ground. It provides faster implementation, just as the bottom-up approach. However, it also leverages the data and design consistency, as well as the integration benefit, of the top-down approach.

In the hybrid approach, the requirements are defined at the enterprise level, followed by definition of the data warehouse architecture. Next, the data content is standardized. Finally, a number of data marts are implemented in sequence. The enterprise data model is implemented in an iterative manner, with the heavy duty infrastructure being implemented at a later stage, only when the need for enterprise reports becomes a requirement.

43.What is the federated approach?

The federated approach tries to leverage existing deployed systems by integrating them when required because of changing business needs or conditions, such as mergers and acquisitions, cross-functional requirements, organizational changes, and other events. These systems can include decision support structures, OLTP systems, data warehouses,data marts, and applications. The benefit is that the existing infrastructure does not need to be discarded and the organization can be more responsive to changing business needs.

44.What is the independent approach?

In this approach, data marts are built randomly, without any enterprise planning or consideration of a common data model. This approach can lead to anarchy over a period of time.

45.Which approach should be used?

All the approaches are characterized by various pros and cons. Each approach meets the needs of many different types of organizations and users. The decision to select a particular approach is influenced by many factors, such as the relative importance of the various selection variables for the organization considering an implementation.

The selection variables can be diverse and can include project objectives, budget, timeline, availability of funding and resources, complexity of the environment, new or upgraded implementation, scalability, state of existing infrastructure, and so forth. For example, if the objective is to have something up and running quickly, or the budget is limited, the obvious choice is the data mart bottom-up approach. However, if strategic concerns and long-term considerations are driving the requirements, and the company is prepared to spend millions for a well-designed system, the top-down approach for implementing an EDW will be the appropriate choice.

46.What is the Inmon approach?

Bill Inmon is a data warehousing pioneer who provided the widely accepted definition of a data warehouse, which is based on four characteristics'subject-oriented, integrated, nonvolatile, and time-variant. His basic approach is to plan for an EDW using the top-down approach. In this architecture, the data warehouse can be used to feed any associateddata marts. The Inmon methodology, as well as the Kimball methodology, has been widely used for implementing data warehouses. There are several other characteristics that define the Inmon approach:

§ "Big-bang" implementation

§ Use of well-known database development methodologies and tools

§ Data warehouse—part of the Corporate Information Factory (CIF), the broader corporate information environment

§ More dependent on IT professionals, which provides better technical solutions

§ Secondary role of business users

47.What is the Kimball approach?

Ralph Kimball is one of the pioneers of data warehousing. He is credited with dimensional modeling or the Kimball methodology, which is one of the most widely used for implementing data warehouses. According to this approach, a data warehouse is the combination of various data marts, all of which are stored in the dimensional model. The Kimball approach includes other defining characteristics as well:

§ Conventional database development methodology not followed

§ Data mart built for each process

§ Multiple data marts to meet all the needs and to create a data warehouse

§ Data marts connected via data bus

48.What are the similarities between the Kimball and Inmon approaches?

The following are the defining similarities between the two approaches:

§ Time-stamped data

o Kimball - Date dimension

o Inmon - Time element

§ ETL process

o Inmon - Data loaded into the data warehouse

o Kimball - Data loaded into the data marts

§ Business requirements gathered first

§ Users can query by time period

49.What are the differences between the Kimball and Inmon approaches?

The following are the defining differences between the two approaches:

§ Complexity

o Inmon - Complex

o Kimball - Simple

§ Analysis

o Inmon - Extensive analysis performed

o Kimball - Limited analysis performed

§ Primary focus

o Inmon - IT professionals because of nature of methodology, architecture, and development approach (spiral); leads to IT ownership

o Kimball - Business users because of simpler development methodology

§ Development methodologies

o Inmon - Top-down

o Kimball - Bottom-up

§ Implementation

o Inmon - Big-bang

o Kimball - Incremental

§ Data-modeling approach

o Inmon - Subject-oriented or data-driven

o Kimball - Process-oriented; users can participate actively

§ Tools

o Inmon - ERDs

o Kimball - Dimensional modeling

§ Data warehouse architecture

o Inmon - EDW serves the entire enterprise and feeds departmental data marts/databases

o Kimball - Single-process data marts connected via a data bus and conformed dimensions

50.Which characteristics favor Kimball?

Each of the defining characteristics can favor either Inmon or Kimball, depending on the requirements and needs of the organization considering the data warehouse/data mart implementation. Therefore, although the answer is enterprise/project specific, some generalizations can be made regarding the characteristics that favor Kimball.

The Kimball methodology is favored when the needs are tactical rather than strategic, implementation is urgently required, budget and scope are limited, resources are limited and lack top-notch technical skills, integration across the enterprise is not required, and focus is on the analysis of business metrics.

51.Which characteristics favor Inmon?

The Inmon methodology is favored when the needs are strategic rather than tactical, implementation can be performed over a very long period, budget is very large and scope is enterprise-wide, resource constraints are limited and top-notch technical professionals are available, integration across the enterprise is required, and focus is on nonmetric dataas well as data that can be used for varied needs across the enterprise.

52.Which approach is better?

Both the Kimball and Inmon approaches have been widely implemented over the past couple of decades. They have met the varied and specific needs of enterprises as well as departments. The approach to be selected should meet the users' needs and must not be based on a dogmatic belief. Depending on the specific requirements, either approach could be appropriate. For example, if the time frame is short and budget is limited, Kimball's approach is the obvious way to go. On the other hand, if the requirement is an EDW, covering the needs of the enterprise spread over many regions, and the organization is flush with cash, then Inmon's approach should be selected.

53.How should the approach be selected?

The following are some items that should be included in the list of variables to be considered when selecting the approach:

§ Scope (EDW or departmental)

§ Implementation schedule

§ Budget

§ Environment/infrastructure

§ Technology requirements

§ Availability of technical expertise

§ Culture

§ Operating cost

§ Data availability and quality

§ Expected growth

54.What is a dimension?

Wikipedia defines a dimension as "a data element that categorizes each item in a data set into non-overlapping regions" (http://en.wikipedia.org/wiki/Dimension_(data_warehouse)). A dimension can also be viewed as a structure, often composed of hierarchies, that is used to analyze business measures or metrics, which are also known as "facts." A dimension, examples of which are customer, region, and time, enables data warehouse data to be easily sliced and diced.

55.What are the characteristics of a dimension?

A dimension represents an attribute such as product, region, sales channel, or time. All data warehouses have one common dimension—time. A spreadsheet is the simplest example of a two-dimensional model. The spreadsheet row and column names are the "dimensions" and the numeric data in it are the "facts." A time dimension can include all months, quarters, and years, whereas a geography dimension can include all countries, regions, and cities. A dimension acts as an index for identifying values in a multidimensional array. If the number of dimensions used is increased, the level of detail that can be queried becomes greater.

56.What is a slowly changing dimension?

A dimension that changes over time is called a slowly changing dimension. For example, the price of a product might change, a region could be renamed, or the address of a customer might change. Such changes are implemented in different ways. For example, the old record can be replaced with a new record, which causes the history to be lost. The two other methods used to capture attribute changes include creating an additional table record with a new value and creating a new field in the dimension table (where the old dimension value is stored).

Learn Big Data,Data Warehousing Fundamentals,Informatica,SQL,Cognos

Sunday, 25 November 2012

Important Data Warehousing Concept Questions Part 1

1 comment: