Module 4. Database systems

Lesson 8

INTRODUCTION TO DATABASES

8.1 Introduction

Students will learn the concept of database, objectives and characteristics of databases, basic terminology, Database Management System (DBMS), components and advantages of DBMS. These topics will be useful for students to build up an understanding of databases and DBMS.

Since the beginning of civilization, data is being collected and processed for drawing conclusions and taking appropriate actions. With the invention of computers during 1950s, data were collected and organized in computer files and processed to generate information and reports. This was generally faster and more accurate than the manual procedures. Group of data and records were stored in separate files based on their applications and usages. Such systems are known as file processing systems. In file processing system, traditionally each department in an organization used to design and maintain their own data files for specific applications. For example, in a dairy plant, personnel section maintains details of all employees working in dairy for various purposes like performance evaluation, timely increments, training etc. Similarly, payroll section responsible for preparation of pay bills of all employees maintains details of employees separately than personnel section, different product manufacturing section may have details of employees working in respective sections for their own purpose.

It can be observed that this kind of arrangement of data in separate files leads to duplication of data and storage space. The details of employees are to be entered many times for different sections/ departments. In this process new errors may be introduced during entry or updation of data in different files. In addition to time consumption for multiple data entry and storage space, when data is allowed to be duplicated, errors can happen if one instance of the data is altered and another instance remains the same. This leads to data inconsistency. In case of duplication, more maintenance and system resources are required to ensure the data integrity. For example change of address or status of an employee in personnel section may not be reflected or performed simultaneously in other similar files maintained in other sections. File formats may be different and incompatible depending on the programming language used to create, read, update and process data from these files. Therefore, even though the data is available in computers but difficult to access from different files due to incompatible formats. File processing system is the simplest approach to process data using computers and easy to manage data for single or personal use.

In view of above mentioned problems and with the advancement in computer hardware, storage media, and software technologies, researchers developed a new approach for organizing data in a different way which was more efficient than the file processing system. A new concept of data management was introduced in form of databases. The term database was first time used in 1963 by System Development Corporation in a symposium and became popular in USA and Europe by late 60s. In simple terms, database is an organized collection of data. According to Charles Bachman, a researcher from General Electric, Arizona, pioneer in the field of databases, the aim of developing the database systems was to make more effective use of new direct access storage devices. Before that, data processing was made using punched cards and magnetic tapes i.e. serial processing was the dominant activity. Two key data models arose at that time first the network model based on Bachman’s ideas, and another (apparently independently) the hierarchical model used in a system developed by North American Rockwell, later adopted by IBM as the cornerstone of their IMS products. E.F. Codd in 1970s proposed new concept of relational data model for data management. He disconnected the schema (logical organization) of a database from the physical storage methods. He criticized the existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms. However relational data model remained of academic research only for a long time. The commercial products based on relational data models Oracle, Ingres and DB2 were appeared during late 1970s. During mid 1980s, the first successful database product launched for microcomputers was dBASE for CP/M and PC-DOS/ MS-DOS operating system. After that a number for software packages appeared in market such as dBASE IV, Foxpro, Visual Foxpro, MS-Access, SQL and MySQL based on the concept of relational models for personal computers.

In the 1990s, attention shifted to Object Oriented Databases. These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as spatial databases, engineering data, and multimedia data. Some of these ideas were adopted by the relational vendors, who integrated new features into their products, as a result in 2000s, the fascinating area for innovation was the XML database and still is continued. XML database aims to remove the traditional divide between document and data allowing all of an organization’s information resources to be held at one place, whether they are highly structured or not.

In future, huge (terabyte) systems are appearing and will require novel means of handling and analyzing large scale databases such as genome project, geological, national security, and space exploration data. Clickstream analysis (the process of collecting, analyzing, and reporting aggregate data for example, in what order the visitors visited the web pages in web sites) is happening now for traffic analysis and e-commerce analysis. Data mining, data warehousing, data marts are a few commonly used techniques today. Mobile database is a product now coming to market in various ways. Distributed transaction processing is becoming the norm for business planning in many arenas.

8.2 Database

The database approach to manage data overcomes many of the shortcomings of the traditional file processing environment. One of the key features of a database system is that data is stored as a single logical unit called database. The data may be spread across multiple physical files in a database but overall it is connected to a single large repository of data. Organizing data in single logical repository allows for easy manipulation and querying of the data, in contrast to traditional file system.

A database may be designed for batch processing, real-time processing, or in-line processing (in which single transactions are processed to completion one at a time but without the tight time constraints of a real-time system). A database can be of any size and varying complexity.

Several authors have defined database in their own way, a few definitions are as follows:

In most simple terms database is collection of records of different type of data.
A database consists of an organized collection of data for one or more uses, typically in digital form.
Database is a mechanized, shared, formally defined and centrally controlled collection of data used in an organization.
A database is a set of logically related files organized to facilitate access by one or more application programs and to minimize data redundancy. It does not mean that all data relating to a business should be contained in a single database, but simply that all records in a database should be related and that redundant data should be minimized.

More regressively database may be defined as follows:

“A database may be defined as a collection of inter-related data stored together without harmful or unnecessary redundancy to serve multiple applications; the data are stored so that they are independent of programs which use the data; a common and controlled approach is used in adding new data and in modifying and retrieving existing data within the database. The data is structured so as to provide a foundation for future application development. One system is said to contain a collection of databases if they are entirely separate in structure.”

8.3 Objectives of a Database

Objective of a database system is to make application development easier. Moreover, the development process should be cheap, fast, and flexible. A database system is a repository of data of an organization therefore it should be accurate, private and protected from damage. A database is to handle many related data and files in an integrated form in order to ensure data authenticity, reliability and security with optimum performance rate having least redundant data and permitting multiple users for processing concurrently.

8.4 Characteristics of a Database

Some of the desirable characteristics of a database system are as follows:

· Ability to represent the inherent structure of the data: The database system should be able to represent the true properties of the data. The implementation procedures should not force the data in to structures which do not represent its inherent nature. For example a system which can only represent tree structures is inadequate.

· Performance: Database applications designed for use by a terminal operator must give a response time appropriate for the man-terminal dialogue. In addition the database system must be able to handle an appropriate throughput of transactions. The response time depends on the nature of man-terminal dialogue, kind of applications, and physical storage devices. Response time is of no concern for databases designed for batch processing applications.

· Minimum cost: Techniques are used to reduce the cost for storage of data and programming and also minimize the cost of making changes.

· Minimum redundancy: Redundancy means duplication of data. In file processing system, lot of duplication occurs. Even with data base techniques, as increasing amounts of information are combined to form integrated databases a great potential exists for the presence of redundant data. As we have seen already redundant data takes more storage space and require more than one updating operation which may give conflicting results. The objective of database organization is to eliminate redundant data wherever possible (and to control the inconsistency that is caused by redundant data values). Some fields are necessary to keep in multiple files in database to join (or link) two or more files using common fields. Therefore redundancy cannot be completely removed.

· Search capability: User of a database may ask a wide variety of questions about the data that are stored in database therefore the database should be able to process anticipated as well as unanticipated queries quickly.

· Constant growth: One of the most important characteristics of databases is that they will need to change and grow. Easy restructuring of the database must be possible as new data types and new applications are added. The restructuring should be possible without having to rewrite the application programs and in general should cause as little upheaval as possible.

· Data integrity: Data integrity refers to the correctness, completeness, wholeness, soundness and compliance with the standards fixed by the organization and database designers. It is achieved by preventing accidental or deliberate but unauthorized insertion, modification or destruction of data in a database. It is important that the data items and associations between items are not destroyed. Hardware failures and various types of accident may occur occasionally. Storage of data, updating and insertion procedures, must be such that the system can recover from these circumstances without harm to the data. In addition to protecting data from system problems like hardware failure, the integrity checks may also be designed to ensure that data values confirm to certain specified rules. Test checks (or constraints) may be expressed in form of rules to check relationship between data values. Example: Age of student in B. Tech. Program must not be less than 17 years. Similarly milk product code and name sold at milk parlor must be from the list of products manufactured in dairy plant. Constraints are important because they help to ensure data integrity.

· Privacy and security: Data in data base systems must be kept secure and private. It must not be stolen or lost since it is vital for the organization. It must be protected from hardware or software failure from catastrophes, and from criminals, incompetent, and people who would misuse it. Data security refers to protection of data against accidental or intentional disclosure to unauthorized persons or unauthorized modifications or destruction. Data privacy refers to the rights of individuals and organizations to determine for themselves when, how, and to what extent information about them is to be transmitted to others.

· Interface with the past: When an organization installs new database software it is important that it can work with the existing programs, procedures, and existing data can be converted. This kind of compatibility can be a major constraint in switching over to new database system.

· Interface with the future: This is an important feature since in future the data and its storage media will change in many ways. An organization has to move along with advancement taking place in technology for its survival. Moreover, the needs of an organization also grow over the years due to changes in policies, expansion in business, etc. Therefore under these circumstances the database should be designed in such a way that these changes can be incorporated in it with minimum cost. This can be achieved to some extent by using the concept of logical and physical data independence concept while designing the database.

· Simplicity: The overall logical view of data should be conceived in a simple and neat fashion. In many systems pointers are used in the logical representation to show relationships between data items. But, as more and more relationships are added the representation between data items becomes more complicated and difficult to represent overall logical view of data with clarity. Multiple pointer links can be highly misleading.

· Data integration: Data integration provides a unified view of data by extracting or combining data available in different files or sources. This process is becoming more important in today’s time since many databases are being developed for commercial and scientific applications by different organizations and users need to extract data from these sources for their work. It has become the focus of extensive research work, and numerous open problems remain unsolved. Data integration is also referred as "Enterprise Information Integration" (EII).

· Data consistency: It summarizes the validity, accuracy, usability and integrity of related data between applications and across organization. This ensures that each user observes a consistent view of the data, including visible changes made by the user's own transactions and transactions of other users or processes. Data Consistency problems may arise at any time but are frequently introduced during or following recovery situations when backup copies of the data are used in place of the original data or by some other reason multiple copies are created.

· Data sharing: Individual pieces of data in database may be accesses and used by multiple applications. This feature makes database different from file processing system.

· Data Independence: Data at different levels should be independent of each other so that the changes in one level should not affect the other level. The ability to modify the schema definition in one level without affecting schema at next higher level is called data independence. There are two types of data independence namely logical and physical data independence. Physical data independence is the ability to modify the physical description of data without causing logical description or application programs to be rewritten i.e. physical layout and organization of the data may be changed without changing either the overall logical structure of data or application programs. Modifications at the physical level are occasionally necessary to improve performance of databases and keeping abreast with latest development in storage media. Logical data independence is the ability to modify the logical schema without causing application program to be rewritten i.e. overall logical structure of the data may be changed without changing the application programs. Modifications at the logical level are necessary whenever the logical structure of the database is altered to incorporate new developments occurred in the organization. For example dairy plant started the production of a new milk product or a new branch is opened. Logical data independence is more difficult to achieve than physical data independence, since application programs are mainly dependent on the logical structure of the data that they access.

· Centralized control: There is centralized control of data in database. All data is collected at one place and distributed or shared from that place. Thus all control controls and rights can be audited from one central place, which make use of database more reliable, trustworthy and easy to use.

8.5 Basic Terminology

Followings are few commonly used terms in database environment:

I. Byte: A byte is the smallest individually addressable group of bits generally eight bits.

II. Data item: A data item is the smallest unit of named data. It may consist of any number of bits or bytes. It is also referred as field or data element.

III. Data aggregate: A Collection of data items within a record, which is given a name and is referred to as a whole. For example DATE may compose of the data items MONTH, DAY and YEAR. It is also known as group or group item.

IV. Persistent data: This data is infrequently accessed and not likely to be modified. It exists from session to session.

V. Transient data: Data that is created within an application session. At the end of the session, it is discarded or reset back to its default and not stored in a database.

VI. Record: A record is a named collection of data items or data aggregates. For example a record named MILKRECEIPT may be a collection of related data items such as RECEIPT DATE, SOCIETY NUMBER, ROUTE NUMBER, TIME, TYPE OF MILK, QUANTITY, SNF%, FAT% etc.

VII. Segment: A segment contains one or more data items and is the basic quantum of data which passes to and from the application programs under control of the data base management software.

VIII. File: A file is a named collection of all occurrences of a given type of record.

IX. Data base: A data base is a collection of the occurrences of multiple record types, containing the relationship between records, data aggregates, and data items.

X. Data base system: Collection of data bases. A data base system should be a repository of the data needed for an organization’s data processing. That should be accurate, private and protected from damage. It should be organized so that diverse applications with different data requirements can employ the data. Different views of data must be derived from a common overall data structure. Their methods of accessing or searching the data will differ.

XI. Data model: Data model is representation of collection of conceptual tools for describing data, data relationships, data semantics, operations that are performed on data and data constraints. It describes the idea of a complex “real-world” data structure. A data model contains structural, manipulative and data integrity parts. Structural part defines data types, relationships and constraints that hold on data.

XII. Schema or global logical database description: A schema is a chart of entire logical database. This is an overall view of the data seen by database administrator. It gives names to entities and attributes, and specifies relationship between them. It is a framework into which values of data items can be filled. Schema is specified during database design and is not expected to change frequently. A displayed schema is called a schema diagram.

XIII. Subschema: This refers to an application programmer’s view of data he/she uses. This is a portion of the schema which is oriented to the needs of one or more application programs. Many different subschemas can be derived from one schema.

XIV. Physical database description: This is concerned with the physical representation, layout and organization of data on storage units. It is concerned with indices, pointers, chains, and other means of physically locating records and with overflow areas and techniques used for inserting and deleting records.

8.6 Entities and Attributes

I. Entity: The item about which information is stored is known as entity. It is a “thing” in the real world with an independent existence. An entity may be a tangible object with physical existence like a car, house or employee, student, etc. It may be intangible such as an event, bank account, MS-windows, abstract concepts, etc.

II. Entity set: Collection of similar entities that share same properties. For example students of a class B.Tech. (DT) III Year.

III. Weak entity set: An entity set that does not have a primary key is referred to as a week entity set. It represented by double outlined box.

IV. Strong entity set: An entity set with a primary key is referred to as a strong entity set.

V. Attributes: Properties of an entity (e.g. student) such as name, age, class, marks obtained, and address of a student are referred as attributes. Attributes may be atomic or composite. Composite attributes can be divided into smaller subparts which represent more basic attributes with independent meaning. For example, the Address can be divided further into Street, City, State, Zip, etc. Attributes that are not divisible are called as simple or atomic attributes.

VI. Flat files: This is an arrangement of attributes and the values of these attributes for each instances of an entity in columns and rows. For example attributes of a student (student is an entity) like name, age, class, marks can be place in columns and values of the attributes for each student are placed row vise. Related set of values of all attributes in one row for one instance of an entity is referred as tuple.

VII. Primary key: An attribute which uniquely identify a record from a given set of values of an entity is referred as primary key. For example a student can be uniquely identified by roll number in a class. Some important properties required for a key field are:

· Unique identification: The value of the key should be able to identify a record in a relation uniquely.

· Non redundancy: No attribute in the key can be discarded without destroying the property of unique identification.

· Numeric or text

· Meaningless

VIII. Candidate key: A relation in which there is more than one attributes possessing the unique identification property. All these keys are suitable candidate for primary key, therefore called as candidate key. For example in relation SOCIETY, societies can also be identified by Society-Name instead of Society#.

IX. Foreign key: An attribute is called foreign key in a relation if it is a primary key of some other relation. For example an attribute Supervisor-ID is an ordinary attribute in relation SOCIETY but it is a primary key of relation EMPLOYEE. Thus primary and foreign keys provide a means of representing relationships between tuples.

X. Alternate key: Out of all candidate keys only one can be chosen as primary key, and rest all candidate keys are an alternative to primary key, hence such keys are called alternate key.

XI. Composite key: Some time a single attribute is not able to uniquely identify a record in a relation, therefore a combination of more than one attributes is used to uniquely identify records. This combination of key is called as composite key. For example in MILKRECEIPT relation three attributes namely Receipt-Date + Society# + Milk-Type in combination are user to identify a record uniquely instead of single attribute.

XII. Secondary key: An attribute which identify a set of records which have certain property. Such attribute is referred as secondary key. For example name of a student.

XIII. Entity relationship diagram: This diagram shows the relationship between different entity and their attributes in a database in graphical way.

8.7 Data Base Management System

A database management system consists of a collection of computer programs to manage, store, creation, organization, updation, access of database and processing of queries to produce desired results from database. The primary goal of DBMS is to provide an environment for retrieving and storing information. DBMS serves as an interface between users and database. A DBMS is a system software package that helps to integrate collection of data records and files known as databases and allows different user application programs to easily access the same database. It provides facilities for controlling data access, enforcing data integrity, managing concurrency, and restoring the database from backups. These systems provide a series of privileges and rights to a particular user. So, DBMS can be designated as a system which controls various functions of file management as well as they provide various administration rights to the users.

A DBMS also provides the ability to logically present database information to users and features for application development. There are different types of DBMS ranging from small systems that run on PCs to large mainframe systems. A few important DBMS used in industry are Oracle, Microsoft Access, SQL Server, MySQL, Ingres, Sybase, dbase, FoxPro, FileMaker and DB2. In simple terms a DBMS is the system in which related data is stored in an efficient and compact manner. "Efficient" means that the stored data can be accessed quickly and "compact" means that the data takes up very little space in the computer's memory. The phrase "related data" means that the stored data pertains to a particular topic.

8.8 Components of DBMS

· DBMS engine: It accepts logical requests from various other DBMS subsystems / components, converts them into physical equivalents, and actually accesses the database and data dictionary as they exist on a storage device.

· Data definition language (DDL): It helps the user to create and maintain the data dictionary and define the structure of the files in a database. A data base schema is specified by set of definitions expressed by special language called DDL.

· Data manipulation language (DML): Used to add, change, and delete information in a database and query it for valuable information. The language enables a user to access or manipulate data as organized by appropriate data model.

· DML compiler: It translates DML statements in a query language into low-level instruction that the query evaluation engine can understand.

· Data manager: It is responsible for converting user queries into physical file system. It maintains data integrity, consistency and security.

· File manager: It take care of file space and file structure. It demands for the block of data from disk. It allows creating new files, storing a record into file, retrieving a record from file, deleting a record from file etc.

· Disk manager: It transfer the block requested by user command to data manager. It transfers the data from main memory to disk.

· Query processor: This is used to interpret any online query and convert it into a series of efficient operations that are carried out one by one by the data manager.

· Telecommunication system: This is generally a remote system or an online input device being used by user to send and receive messages and data.

· Data files: These can store all the data portions of the database. It transfer data requested by user from data files to main memory.

· Data dictionary: It is data about data known as Meta data. It contains information about entities and attributes, their limits checks & validation. Data dictionary stores description of database users and their responsibilities.

· Report generator: Extracts information from one or more files and presents the information in a specified format. Most report generator (or writer) allows to select records that meet certain conditions and to display selected fields in rows and columns. Data can be formatted into pie charts, bar charts, and other diagrams.

8.9 Advantages of DBMS

· Systems integration: In DBMS, all files are integrated into one system thus reducing redundancies and making data management more efficient. In addition, DBMS provides centralized control of the operational data.

· Controlled redundancy: The redundancy is controlled i.e. redundancy is minimized.

· Easy data retrieval: Retrieval of data is very easy and efficient. "Efficient" means that the stored data can be accessed quickly.

· Compactness of data: Data stored in DBMS takes very little space in the computer's memory.

· Restricting unauthorized access: A DBMS provides a security and authorization subsystem, by creating user accounts and to specify restrictions on user accounts before accessing the data. When multiple users share a database, it is likely that some users will not be authorized to access all information in the database. For example, financial data is often considered confidential and hence only authorized persons are allowed to access such data.

· Persistent storage: It provides persistent storage for program objects and data structure.

· Multiple user interface: It provides multiple user interfaces for the users of varying level of technical knowledge.

· Enforcing integrity constraints: The DBMS enforces certain integrity constraints that hold on data.

· Faster development of applications: It provides feature like 4GL for development of new applications.

· Represent complex relationship: It represent complex relationship of data in very and easy and efficient manner for retrieval and updation.

· Providing backup and recovery: A DBMS must provide facilities for recovering from hardware or software failures. The backup and recovery subsystem of the DBMS is responsible for recovery.