1.What are the potential
problems and challenges of ETL?
The ETL task can face many
problems due to a variety of reasons:
§ Unanticipated complexity
§ Conflicting business rules
§ Changing business
requirements
§ Frequent changes in design
§ Flawed or inadequate
design
§ Incorrect mapping
§ Longer loading times than
designed
§ Dirtier data than expected
§ Insufficient development
time
§ Problems due to lack of
validation
§ Flawed testing process
§ Inadequate tools
§ Inferior developer skills
§ Infrastructure issues
2.Why is the ETL task often underestimated?
The ETL task is the most
difficult and labor-intensive activity in the data warehouse project.
It is frequently underestimated, leading to cost
overruns and project delays. The scope of this task is often underestimated for
the following common reasons:
§ Incorrect assumption
about data quality
§ Unanticipated complexity
of design
§ Unanticipated complexity
of business logic
§ Incorrect estimation of
programming effort
§ Unrealistic expectations
about ETL tool capabilities
§ Higher-than-expected
complexity in extracting and loading data from numerous sources
§ Additional validation
effort required for larger-than-expected data volumes
§ Lack of guidance from
subject matter experts
3.How
are ETL tasks handled?
ETL tasks are handled by a
variety of tools. Some of them, such as file transfer programs, are fairly simple.
However, some tools, such as those used for data transformation, can be fairly complex and are able to
perform the following functions:
§ Automate data extraction from
multiple sources
§ Map sources to the target
database
§ Transform and/or
manipulate data
§ Load data into the data warehouse
Technologies used in the
ETL process include ASCII, XML, HTML, DB Connect, IDoc, BAPI, and SOAP.
4.What
are the common tasks performed by ETL tools?
Many tools, from a number
of vendors, are available for executing ETL tasks. There are three primary
functions that ETL tools are required to accomplish:
§ Read data from a source, such
as a relational database table, flat file, or other
§ Manipulate data (filter, modify, or
enhance), based on specific rules
§ Write resulting data to the target
database
There are many
intermediate and post-loading steps, during and after the three primary ETL
operations, including building reference data, performing validation, cleaning, checking integrity, building
aggregates, staging, archiving, and purging.
Specific tasks that an ETL
tool might be required to perform include the following:
§ Converting data
o Changing the incoming
source data into a unified
format and definition
§ Deriving data
o Applying mathematical
formulae to fields in order to create brand-new fields
§ Filtering data
o Screening unwanted data from source data files before moving
it to the data warehouse
§ Integrating data
o Merging files from
different databases and platforms
§ Summarizing data
o Combining tables
§ Selecting data
o Selecting and loading data based on triggers
5.What
are desired features of ETL tools?
As users clamor for
complete solutions, vendors are forced to provide more functionality and
reliability. Many database vendors now include several desirable ETL features
and functions in their products:
§ Richness of features and
functions—should include both transformation and cleansing features
§ Ability to handle greater
complexity of mappings and transformations
§ Ability to read various
file formats
§ Ability to read directly
from the source
§ Back-end data management and
processing, including metadata management
§ Ability to handle
real-time, or near real-time, data for ever-tightening batch-processing schedules
§ Improved throughput and
scalability for handling rising data volumes, even as batch-processing timeframes become narrower
§ Ability to handle greater
numbers and varieties of data sources (XML, HTML, etc.)
§ Improved capability for
capturing changes and updates
§ Easy to use—for loading
and/or updating
§ Provide metadata support
§ Ability to communicate
with different relational databases
§ Improved administration
ETL tools should be
selected carefully, based on the unique needs of each implementation. Selection
of an inappropriate tool can cause many problems.
6.What types of ETL tools are available?
The functionality of the
available ETL tools varies considerably and their prices can range from minimal
(some are even free) to hundreds of thousands of dollars. At the lower end are
simple data migration tools
whose functionality is limited to extraction and loading. However, the more
versatile tools are very sophisticated and can perform many tasks such as
enabling transformations, handling a wide range of input formats and sources,
and so forth. In general, the tools fit into three main categories:
§ Data transformation
engines
§ Code generators
§ Data capture through
replication
The ETL tool to be used
must, first and foremost, be appropriate for the environment in which it is to
be used. It should support the basic functions required—transformation and
cleansing. The desired and/or required ETL features for a particular implementation
should be used as the criteria for selecting the tool. Additionally, business
factors that come into play include cost, reliability, scalability, vendor
reputation, and availability of resources with the skills to use the product
effectively. In some situations, it will not make sense to purchase a
third-party tool, which can be very expensive and difficult to use. If
the data transformation is
extremely complex, for example, it might be better to develop in-house custom
programs for the ETL process.
8.Who
are the leading providers of ETL tools?
The following is a list of
well-known ETL tool vendors:
§ Informatica® (PowerCenter)
§ IBM (InfoSphere
Information Server, DataStage)
§ Ab Initio
§ SAP Business Objects (Data Integrator/Data Services)
§ Microsoft (SQL Server
Integration Services)
§ Oracle Data Integrator (ODI)
§ Oracle Warehouse Builder
(OWB)
§ SAP (Sybase® ETL)
§ SAS (SAP Enterprise Data Integration Server)
§ Adeptia® (Adeptia
ETL Suite)
§ DB Software Laboratory® (Advanced
ETL Processor)
§ Others include Syncsort®,
ETI®, iWay Software®, and Open Text®
The following vendors
provide open-source ETL tools:
§ Pentaho® (Pentaho Data Integration)
§ Jaspersoft® (Jaspersoft
ETL)
§ Talend® (Data Integration)
§ Clover® (CloverETL)
9.What
are the leading ETL frameworks?
There are numerous
proprietary ETL frameworks. Among the well-known vendors in this category are
IBM, Informatica, Oracle, Ab Initio, SAP Business Objects, Information
Builders, SAS, Microsoft, Pervasive®, Digital Fuel®, HiT
Software®, Altova®, and Embarcadero®. The
open-source framework vendors include Pentaho, Talend, Jitterbit, RapidMiner,
Clover, Apatar, Enhydra, and Scriptella®. Freeware ETL frameworks
are also provided by some vendors, including Jaspersoft.
10.What
is the impact of parallel processing on ETL?
The implementation of
parallel-processing technology, which involves simultaneous processing by two
processors, has helped improve the performance of ETL processing, especially
where huge data volumes are involved. There are three main methods for implementing
this technology:
§ Data (splits a data file into smaller
ones)
§ Pipeline (enables the
execution of several components on the same field)
§ Component (runs multiple
processes on different data streams on the same job)
11.How
can ETL performance be improved?
The ETL process that is
often the slowest is in the database load phase, during which the lack of speed
can be attributed to indexing, concurrency, and data integrity.
Therefore, that is an area where many performance improvement measures are
being targeted. There are several techniques employed to improve performance:
§ Partitioning
§ Indexing
§ Performing validation in
the ETL layer before loading
§ Disabling triggers
§ Disabling integrity
checking (during loading in the target)
§ Parallel bulk loading
§ Maximizing the use of
parallelism
12.What
is the ELT process?
In the conventional ETL
process, based on business rules, data is transformed in the intermediate step and is then loaded
into the final destination—the target database. A variation of the ETL process
is the ELT process—extraction, loading, and transformation.
In the ELT process, data is first extracted and then loaded directly into the datawarehouse database. The
transformations, if required, are applied in the third step—only after
the data has been loaded into
the database.
13.How much data storage is required?
The storage requirements
are driven by the business requirements, such as the period of time encompassed
by the data that has to be made available in the data warehouse. For example,
storage required for three years of data will be quite different compared with storage needed for twenty
years of historical data.
The individual storage
requirements must be calculated for four different storage areas:
§ Staging area
o Depends on a number of
factors, such as the specific staging requirements, data volume to be loaded, the
number of dimensions and facts to be loaded, etc.
§ Main data warehouse
o The largest storage
repository, into which data is loaded from the source systems and retained indefinitely
§ Data marts
o Storage volume depending
on requirements of the individual data marts, which can vary considerably
§ MDDB database
o The smaller
multidimensional database required for storing OLAP data
Data is stored in data warehouses for efficient retrieval at three levels:
§ Current detailed data
o Contains transaction-level data
§ Lightly summarized data
o Contains aggregated data, such as revenues by week or by subregion
§ Highly summarized data
o Contains aggregated data, such as revenues by month or by region
Each data level targets a different
type of user, ranging from senior executives to operational users. An inventory
clerk might need to review the available inventory for a specific item, which
requires the availability of detailed data; a Chief Financial
Officer (CFO), on the other hand, will typically be more interested in reviewing
the performance of a business unit or region, which would require the
availability of highly summarized data.
15.What is current detailed data?
The bulk of the data stored in a data warehouse consists of
current detailed data, at the lowest level of
granularity (detail). It is organized by subject areas and can be stored as
detailed or aggregated data. All data items at this level
represent snapshots in time. The period for which current detailed data is maintained in a data warehouse varies by
organization. Typically, data for two-five years is maintained. However, many organizations
store data for very length periods, perhaps as long as 20-25 years. The
frequency ofdata refresh also varies
according to business needs. Many organizations perform a daily data load, whereas some prefer
to refresh their data in real time.
16.What
is lightly summarized data?
Lightly summarized data refers to data that has been rolled up
for particular dimensions or attributes. For example, the detailed sales data can be at the daily level,
whereas the lightly summarized sales data can be at the monthly level. The requirement for light data summarization in a data warehouse is based on the
fact that most users run queries that repeatedly access and analyze the same data elements at a summarized
level. Therefore, by storing summarized data, there can be
considerable improvement in the performance and storage requirements.
17.What
is highly summarized data?
Highly summarized data refers to data that has been rolled up to
an even higher level than lightly summarized data. The previous example
demonstrates that data can be summarized at a very high level at the annual sales level.
The source for highly summarized data can be lightly summarized data or current detailed data. The primary users of
highly summarized data are senior executives and strategic users. Their needs are
primarily limited to this level, though, if required, they can also access data at a lower level of detail
through a drill-down process.
18.What
is aggregated data?
Frequently accessed data, such as monthly or annual sales, can be aggregated or
accumulated along predefined attributes. For example, car sales data can be aggregated by
geography and model by adding the sales dollars for each model within a
specific geography. Similarly, overall sales can be cumulated for a week,
month, quarter, or year. Thedata to be stored in an
aggregated format is determined by a number of factors, including the frequency
and complexity of queries.
19.What
are the benefits of aggregating data?
The objective of creating
aggregates is to improve performance by reducing the amount of data to be read by a query.
When a query is run against aggregated data, the response is faster:
less data needs to be accessed, because it is already aggregated. Aggregates
enable faster navigation as well as faster query run times. Although aggregates
reduce the retrieval cost by reducing the amount of data to be retrieved, there is
a cost associated with updating them. The reason is that aggregate rollup is
required whenever new datais loaded. Therefore, dependent aggregates need to be
recalculated whenever there are changes in the detailed data, master data, or hierarchies.
20.What
is the effect of granularity?
Granularity is an
important data warehouse design issue because it affects the volume of data to be stored and the types
of queries that can be executed. If data is highly granular, with a very high level of detail, the volume
of data to be stored in the warehouse will be huge. A highly granular data warehouse contains very
detailed data, which can include every
captured transaction, such as individual sales orders and purchase
requisitions. A less granular data warehouse contains more highly summarized data, such as total purchase orders issued for each month or total
monthly sales by region.
If stored data is very granular,
practically any type of query can be run against it. However, if the data is less granular, the
types of queries that can be executed will be limited. Usually, senior
executives and decision makers require less granular data, because they work with summarized and aggregated data, whereas operational staff requires more granular data. In recent years, however, this distinction has been blurred as
the needs and requirements of these two types of users have started to overlap
because of changes in the decision-making levels and the empowerment of lower
level employees.
21.How
can data stored across multiple data marts be accessed?
Sometimes, the data stored in a data mart might not contain all
the dimensions or facts that are required by a query. For example, a query
cannot be created in the bookings datamart if it requires
billing data as well, because the bookings data mart will not contain any billing data.
The solution in such a
case is to create a view that will enable access to the two data marts (bookings and
billing), without actually joining them physically. For example, in the SAP
Business Warehouse (SAP BW) system, this feature is implemented as a
MultiProvider. This entity combines data from multiple data marts, however, it does not actually contain any data. When a query is executed, the MultiProvider combines data from its sources and then
provides data to the requesting query.
The data to be archived, and the
frequency of its archival, depends on how the data is to be used. If the data warehouse is expected to support operational needs, the
requirements could be met by retaining data for a two-year period. However, if the data is to be used for
strategic purposes, the retention requirement will be for a considerably longer
period, extending for 5-25 years. The archived data granularity might be the
same as it is for the current detailed or aggregated data.
23.What
are the steps involved in designing a database?
Database design is accomplished
in five steps: (1) planning and analysis, (2) conceptual design, (3) logical
design, (4) physical design, and (5) implementation. The data model is created in the
conceptual design step. It focuses on the data that needs to be stored in the database and how the tables are
designed. The functional model focuses on how thedata is to be processed, and also on how the queries are to be designed
for accessing the database tables.
24.What are data warehouse engine requirements?
Some of the engine's
technical and business requirements, and factors that determine its selection,
include the following:
§ Ability to support
expected data volumes
§ Scalability
§ Types of queries
§ Flexibility
§ Reliability
§ Performance
§ Cost
§ Data loads
§ Data-loading time
§ Load balancing
§ Parallel processing
§ Query governor and
optimizer
§ Replication features
§ Hardware and database
combination
§ Ease of monitoring and
administration
§ Portability
§ Ability to work with
various access tools from different vendors
§ Application Programming
Interfaces (APIs) for tools from leading vendors
§ Security
§ Metadata management
§ Extensibility
§ Vendor (reputation and
number of database installations)
25.What
are database selection criteria?
The following are widely
used criteria for selecting databases:
§ Data storage (volume) requirements
§ Reliability
§ Performance
§ Scalability
§ Types of queries
§ Support for SQL
§ Metadata management
§ Distribution requirement
§ Ease of monitoring and
administration
§ Proprietary or open source
§ Vendor support
26.Which
are the most popular relational databases?
The following companies
provide the leading relational database products:
§ Oracle
§ IBM
§ Microsoft
§ Teradata
§ Sybase (SAP)
There are three leading
open-source products:
§ MySQL™
§ PostgreSQL®
§ SQLite
27.Should
an open-source database be considered?
There are advantages as
well as disadvantages associated with using an open-source product. Among the
critical items to consider are the ability to support the expected datavolumes, reliability, and vendor support. As with a proprietary
database, it should have the ability to meet the technical and business
requirements, which were listed previously.
A key data warehouse design decision
concerns the database type to be selected: relational database or
multidimensional (OLAP) database. The selection of the database type influences
the choice of the data access tool, which can be a simple relational query tool, an OLAP
tool that provides a multidimensional view of the data, or some other type of specialized decision support tool.
Each database type is
characterized by strengths and limitations. Conventional relational databases
support the specialized technical requirements of data warehouses such asdata extraction and replication, query optimization, and bit-mapped
indexes. However, they provide limited support for data cleanup and transformation
functions. The strengths of multidimensional databases include the benefits
associated with OLAP, as well as fast querying and performance. Their primary
drawback is that they are based on a proprietary database solution.
Conventional relational
databases provide many of the features that characterize multidimensional
databases. In many cases, either type of database can be used. However, for
specialized or complex analysis requirements, multidimensional databases are
often preferred.
29.Which data warehouse limitations became data mart drivers?
Data warehouses have provided many benefits that organizations have
leveraged to their competitive advantage. Enterprises have been forced to seek
alternative solutions for their information needs, however, because of some of
the following data warehouse limitations:
§ Lag between need and
implementation—it used to take a very long time, sometimes years, from the time
a data warehouse was requested till the time it was actually rolled out
to the users.
§ Huge size and scope of data warehouse projects
encompassing the entire organization—such projects were very expensive, costing
millions. They also spanned a very lengthy period and, in some cases, took
years to implement.
§ For enterprise
implementations, which spanned disparate computer systems and numerous
departments, complexity and integration proved to be overwhelming for most IT
departments.
§ Conflicting requirements,
priorities, and the schedules of different departments and business units
derailed many projects.
§ Politics ruled.
§ Cost overruns and delays
were common.
All these problems became
the drivers that have led to the development and widespread growth of data marts in the past couple
of decades.
The following
characteristics define data marts:
§ Do not have to be based on
the enterprise data model
§ Are typically limited to
one or a few subject areas (such as sales and finance)
§ Each contain only a small
part of an organization's data; enterprise can have many
§ Can be implemented as a
small project and within months
§ Are easy to design, build,
and test; are less expensive to implement and maintain
§ Have far fewer users than
a data warehouse
§ Can have one or multiple
sources—a data warehouse, another data mart, or an OLTP database
§ Have databases that are far
smaller—typically only a few gigabytes
§ Require simpler hardware
and supporting technical infrastructure
§ Can be implemented by
staff with less experience and fewer technical skills
§ Are typically built by
different project teams without a common design, process, tools, hardware, or
software
§ Can be built
independently, in a staggered manner, as needs evolve
§ Can make possible future
integration difficult to achieve because of independent construction
31.What
are the characteristic differences between data marts and data warehouses?
The following is a list of
important data mart characteristics and how they compare with those of a
conventional data warehouse:
§ Data marts aim to meet a department's needs, whereas a data warehouse is designed for
the enterprise.
§ IT staff with limited or
no data warehousing skills can implement a data mart, whereas highly skilled IT professionals are required to
build an EDW.
§ A data warehouse spans the entire
organization and covers most subjects, whereas a data mart covers only one or a
few subject areas.
§ A data warehouse is designed
using an enterprise data model and is constructed by a large central specialized
professional team; data marts are constructed by different, small, project teams.
§ In contrast to data warehouses, data marts need not be based on
the enterprise data model structure.
§ Data warehouses are integrated; data marts are not.
§ Data marts can be linked together, if required, but that task can be
challenging.
§ Although a data warehouse is typically
unique, an organization can have many data marts, built by different teams, without a common design.
§ The result of independent
construction is that data mart integration, if required at a later stage, is difficult to
achieve, whereas data warehouses are integrated from the beginning.
§ Data marts are easy to design and build, which makes them less
expensive to implement and maintain; designing data warehouses is very complex
and time consuming.
§ A data mart, although it can
store gigabytes of data, usually contains only a
small segment of an organization's data; most of the enterprise data is stored in an EDW.
§ A data warehouse needs to be
planned and implemented as a huge project, which is not the case with data marts.
§ Uniformity exists within a data warehouse project;
however, it is missing when data marts sprout because various implementation groups might use
different processes, tools, hardware, and software.
Data marts are characterized by many advantages and benefits, which are
listed below:
§ Simpler
§ Less risky
§ More focused
§ Can be implemented in
months instead of years
§ Can be implemented in
phases
§ Scalable
§ Flexible
§ Can obtain easy and fast
approval
§ Not dependent on IT budget
§ More affordable; budget
can be a few hundred thousand dollars instead of millions (required for an EDW)
§ Use low-cost hardware and
software
§ Avoid political conflicts
associated with data warehouse projects
§ Require fewer and less
sophisticated resources
§ Can be linked to other data marts or data warehouses
§ Can improve performance by
storing the data closer to the users
33.What
are the disadvantages of data marts?
Although many benefits are
associated with implementing data marts, there are also some disadvantages:
§ Data marts cannot support a key requirement of an EDW—the ability to
analyze enterprise-wide data across business units.
§ Their development can be
uncoordinated, which creates a hurdle when data marts are used as the building blocks for creating an EDW.
§ Their design is not as
thorough as that of a data warehouse and, consequently, consideration for an ultimate upgrade
to an EDW is inadequate or lacking.
§ Every data mart has its own narrow
view.
§ The age-old problem with
multiple legacy systems can also afflict data marts—a query might yield different answers depending on which
system was accessed, when it was accessed, and how the query was structured and
executed.
§ Redundant and inconsistent data can make "one version
of the truth" a casualty.
§ Growth of data marts creates more
redundant and inconsistent data, which has a cost
associated with it and poses problems during upgrades.
§ Design flaws and the
number of data extracts can restrict scalability; data marts might be unable to
support massive data volumes associated with data warehouses.
§ Clandestine development
and operations might be encouraged.
§ More work is required in
reconciling terms, definitions, data, and business rules when data is migrated to an EDW.
§ They are designed and
built by less experienced personnel, which can affect the quality of the
product.
§ Multiple databases are
required to be maintained, which can be inefficient and could require greater
breadth of technical skills.
§ The extraction process can
be different for each data mart.
§ Activities such as
extraction and processing are not centralized; activities can be duplicated and
additional staff can be required for maintenance and support.
§ Tools, software, hardware,
and processes can be different for each data mart.
§ Knowledge gained by one data mart group may not be
shared with other groups.
§ They can be expensive in
the long run.
However, despite all these
potential issues, when a cost/benefit analysis is performed, the benefits to
the business far outweigh the negatives, which is the reason data marts are so popular and
widely implemented.
34.How
are data marts loaded?
There are two primary ways
in which data is usually loaded into a data mart:
§ Data is fed from an EDW to the data mart(s); any changes to the EDW are propagated to all associated data marts receiving its feeds.
§ Data is fed to the data mart(s) by direct extract(s) from the source system(s).
35.Which
platforms and tools can be used by data marts?
Data warehouses and data marts have been implemented on many platforms including Unix and
Windows. Implementation of such projects involves many technologies, tools,
hardware, and software, which range from the very simple to the very
sophisticated. Although some vendors dominate in specific areas, such as ETL or
databases, and some have offerings across the range of products required for
implementing a data mart or data warehouse, no vendor dominates in all areas. The data mart vendors, with some
exceptions, are the same as the data warehouse vendors.
36.What are the common data warehouse/data mart implementation approaches?
There are two widely used
approaches for building data marts:
§ Top-down
o Build an EDW and then
construct dependent data marts, which are its highly summarized subsets.
§ Bottom-up
o Build independent data marts, whose foundation is
the enterprise data model, which can then be used to construct an EDW.
The other approaches used
to build data warehouses include the hybrid, the federated, and the
hub-and-spoke method.
37.What is
the top-down approach?
In this approach
recommended by Bill Inmon (Figure 6), the data warehouse is planned as the corporate central repository. In this
architecture, a normalized data model is used, with the data being stored at the lowest level of detail or granularity. After
the enterprise data warehouse is developed, dependent data marts are constructed and
are fed by thedata warehouse. Whereas a data warehouse is difficult to implement, data marts are relatively
simple to construct because they are developed for specific departments or
business processes.
38.What
are the characteristics of the top-down approach?
The characteristics
defining the top-down approach are as follows:
§ Is methodology-based;
addresses important aspects such as modeling and implementation
§ Provides an
enterprise-wide view of the organization
§ Has a well-planned
architecture
§ Is implemented based on
well-documented requirements
§ Avoids integration issues
that characterize data warehouses derived from data marts
§ Permits better control and
quality, because the EDW drives the construction of dependent data marts
§ Unable to respond to
business needs in time, long delivery cycle, high cost, project delays, cost
overruns, and other issues associated with large projects
39.What
are the drawbacks of the top-down approach?
The primary disadvantage
of the top-down approach is that it requires a major project for
implementation, because of the massive scope of work that is involved.
Consequently, it has a high cost and risk associated with it. The up-front cost
for implementing a data warehouse using the top-down methodology is significant and the
time involved, from the start of project to the time that end-users experience
initial benefits, can be significant. Also, during implementation, this
methodology can be inflexible and unresponsive to changing requirements.
40.What
is the bottom-up approach?
In the bottom-up approach,
independent data marts, whose foundations are the enterprise data model, are built first.
These data marts, usually designed for specific business processes or
departments, can be combined subsequently to construct the enterprise data warehouse as shown in Figure 7. Their integration is
accomplished through what is known as a "data warehouse bus architecture," which involves joining the data marts together by
conforming the dimensions. When the same dimension tables are used by different
dimensional schemas, the tables are known as conforming dimensions. They enable
drill-down from one schema to another and join measures from different star
schemas. Conformed dimensions can be used across any business area, because
they allow queries to be executed across star schemas. They can be used to
analyze facts from two or more data marts.
This approach is
recommended by Ralph Kimball, the well-known data warehouse expert.
41.What
are the characteristics of the bottom-up approach?
The characteristics
defining the bottom-up approach are as follows:
§ This approach can meet
some enterprise-wide needs, because data marts can be combined seamlessly for reporting and analysis.
§ Seamless and transparent
integration, although possible, is technologically challenging and performance
can be poor, especially if many data marts need to be integrated.
§ The bottom-up approach can
lead to the sprouting of data marts and data redundancy.
§ Lack of adherence to
standards can cause major issues, especially integration problems, when an EDW
is constructed from multiple data marts.
§ Fast implementation
provides less time for analysis.
§ Implementation is not
difficult.
§ Implementation can be
accomplished in phases.
§ Implementation risk is
lower.
42.What
is the hybrid approach?
Both the top-down and
bottom-up approaches have their advantages and disadvantages, leading to the
hybrid approach, which tries to find a middle ground. It provides faster
implementation, just as the bottom-up approach. However, it also leverages the data and design consistency, as
well as the integration benefit, of the top-down approach.
In the hybrid approach,
the requirements are defined at the enterprise level, followed by definition of
the data warehouse architecture. Next, the data content is standardized.
Finally, a number of data marts are implemented in sequence. The enterprise data model is implemented in an
iterative manner, with the heavy duty infrastructure being implemented at a
later stage, only when the need for enterprise reports becomes a requirement.
43.What
is the federated approach?
The federated approach
tries to leverage existing deployed systems by integrating them when required
because of changing business needs or conditions, such as mergers and
acquisitions, cross-functional requirements, organizational changes, and other
events. These systems can include decision support structures, OLTP systems, data warehouses,data marts, and applications. The benefit is that the existing
infrastructure does not need to be discarded and the organization can be more
responsive to changing business needs.
44.What
is the independent approach?
In this approach, data marts are built randomly,
without any enterprise planning or consideration of a common data model. This approach can
lead to anarchy over a period of time.
All the approaches are
characterized by various pros and cons. Each approach meets the needs of many
different types of organizations and users. The decision to select a particular
approach is influenced by many factors, such as the relative importance of the
various selection variables for the organization considering an implementation.
The selection variables
can be diverse and can include project objectives, budget, timeline,
availability of funding and resources, complexity of the environment, new or upgraded
implementation, scalability, state of existing infrastructure, and so forth.
For example, if the objective is to have something up and running quickly, or
the budget is limited, the obvious choice is the data mart bottom-up approach.
However, if strategic concerns and long-term considerations are driving the
requirements, and the company is prepared to spend millions for a well-designed
system, the top-down approach for implementing an EDW will be the appropriate
choice.
46.What is the Inmon
approach?
Bill Inmon is a data warehousing pioneer who provided the widely accepted definition of a data warehouse, which is based
on four characteristics'subject-oriented, integrated, nonvolatile, and
time-variant. His basic approach is to plan for an EDW using the top-down
approach. In this architecture, the data warehouse can be used to feed any associateddata marts. The Inmon methodology, as well as the Kimball methodology,
has been widely used for implementing data warehouses. There are several other characteristics that define
the Inmon approach:
§ "Big-bang"
implementation
§ Use of well-known database
development methodologies and tools
§ Data warehouse—part of the Corporate Information Factory (CIF), the
broader corporate information environment
§ More dependent on IT
professionals, which provides better technical solutions
§ Secondary role of business
users
47.What
is the Kimball approach?
Ralph Kimball is one of
the pioneers of data warehousing. He is credited with dimensional modeling or
the Kimball methodology, which is one of the most widely used for implementing data warehouses. According to
this approach, a data warehouse is the combination of various data marts, all of which are
stored in the dimensional model. The Kimball approach includes other defining
characteristics as well:
§ Conventional database
development methodology not followed
§ Data mart built for each process
§ Multiple data marts to meet all the
needs and to create a data warehouse
§ Data marts connected via data bus
48.What
are the similarities between the Kimball and Inmon approaches?
The following are the
defining similarities between the two approaches:
§ Time-stamped data
o Kimball - Date dimension
o Inmon - Time element
§ ETL process
o Inmon - Data loaded into the data warehouse
o Kimball - Data loaded into the data marts
§ Business requirements
gathered first
§ Users can query by time
period
49.What
are the differences between the Kimball and Inmon approaches?
The following are the
defining differences between the two approaches:
§ Complexity
o Inmon - Complex
o Kimball - Simple
§ Analysis
o Inmon - Extensive analysis
performed
o Kimball - Limited analysis
performed
§ Primary focus
o Inmon - IT professionals
because of nature of methodology, architecture, and development approach
(spiral); leads to IT ownership
o Kimball - Business users
because of simpler development methodology
§ Development methodologies
o Inmon - Top-down
o Kimball - Bottom-up
§ Implementation
o Inmon - Big-bang
o Kimball - Incremental
§ Data-modeling approach
o Inmon - Subject-oriented
or data-driven
o Kimball -
Process-oriented; users can participate actively
§ Tools
o Inmon - ERDs
o Kimball - Dimensional
modeling
§ Data warehouse architecture
o Inmon - EDW serves the
entire enterprise and feeds departmental data marts/databases
o Kimball - Single-process data marts connected via a data bus and conformed
dimensions
Each of the defining
characteristics can favor either Inmon or Kimball, depending on the
requirements and needs of the organization considering the data warehouse/data mart implementation. Therefore, although the answer is
enterprise/project specific, some generalizations can be made regarding the
characteristics that favor Kimball.
The Kimball methodology is
favored when the needs are tactical rather than strategic, implementation is
urgently required, budget and scope are limited, resources are limited and lack
top-notch technical skills, integration across the enterprise is not required,
and focus is on the analysis of business metrics.
51.Which
characteristics favor Inmon?
The Inmon methodology is
favored when the needs are strategic rather than tactical, implementation can
be performed over a very long period, budget is very large and scope is
enterprise-wide, resource constraints are limited and top-notch technical
professionals are available, integration across the enterprise is required, and
focus is on nonmetric dataas well as data that can be used for
varied needs across the enterprise.
52.Which
approach is better?
Both the Kimball and Inmon
approaches have been widely implemented over the past couple of decades. They
have met the varied and specific needs of enterprises as well as departments.
The approach to be selected should meet the users' needs and must not be based
on a dogmatic belief. Depending on the specific requirements, either approach
could be appropriate. For example, if the time frame is short and budget is
limited, Kimball's approach is the obvious way to go. On the other hand, if the
requirement is an EDW, covering the needs of the enterprise spread over many
regions, and the organization is flush with cash, then Inmon's approach should
be selected.
The following are some
items that should be included in the list of variables to be considered when
selecting the approach:
§ Scope (EDW or
departmental)
§ Implementation schedule
§ Budget
§ Environment/infrastructure
§ Technology requirements
§ Availability of technical
expertise
§ Culture
§ Operating cost
§ Data availability and quality
§ Expected growth
54.What is a dimension?
Wikipedia defines a
dimension as "a data element that categorizes each item in a data set into non-overlapping
regions" (http://en.wikipedia.org/wiki/Dimension_(data_warehouse)).
A dimension can also be viewed as a structure, often composed of hierarchies,
that is used to analyze business measures or metrics, which are also known as
"facts." A dimension, examples of which are customer, region, and
time, enables data warehouse data to be easily sliced and diced.
55.What
are the characteristics of a dimension?
A dimension represents an
attribute such as product, region, sales channel, or time. All data warehouses have one common
dimension—time. A spreadsheet is the simplest example of a two-dimensional
model. The spreadsheet row and column names are the "dimensions" and
the numeric data in it are the "facts." A time dimension can include all
months, quarters, and years, whereas a geography dimension can include all
countries, regions, and cities. A dimension acts as an index for identifying
values in a multidimensional array. If the number of dimensions used is increased, the level of detail
that can be queried becomes greater.
56.What is a slowly changing dimension?
A dimension that changes over time is called a slowly changing dimension. For example, the price of a product might change, a region could be renamed, or the address of a customer might change. Such changes are implemented in different ways. For example, the old record can be replaced with a new record, which causes the history to be lost. The two other methods used to capture attribute changes include creating an additional table record with a new value and creating a new field in the dimension table (where the old dimension value is stored).
Really super collection! thanks for providing and update like this Informatica Online Training Hyderabad
ReplyDelete