Module 1. Descriptive statistics
Lesson 2
CLASSIFICATION OF DATA AND FREQUENCY DISTRIBUTION
2.1 Introduction
As discussed in Lesson 1, statistics are a set of numerical statements and facts collected from any field of enquiry for drawing valid inferences. Data collection is in fact, the most important aspect of a research experiment/statistical survey. After data have been collected, the next step is to present the data in some orderly and logical form so that their essential features may become explicit. The need for proper presentation of data arises because the mass of collected data in their raw form is often so voluminous which cannot be easily comprehended and analysed. Therefore, after the collection of data, it is imperative that data are classified and presented in such a way so as to bring out points of similarities and dissimilarities in the data.
2.2 Collection of Data
To study any problem by means of statistical methods first, the relevant data are collected. Sometimes the data is to be collected from some research experiment or the primary sampling units (households). Sometimes, the relevant data may exist in a published or unpublished form, being collected by a private body or by the Government agency or by some research organization, for its own use or for supplying popular information. In making use of such data (called secondary data), one has to be particularly careful about the definitions of terms and concepts used by the collecting authority and also about the method of collection and the reliability of the data. More often, one has to collect data directly from the field of enquiry. The data are then said to be of the primary type. The collection of primary data may be done by interviewing a number of persons and filling in questionnaires relevant to the problem e.g., in the family income and expenditure survey, one will generally interview the head of each family. The data collected should be carefully scrutinized before they are subjected to statistical treatment.
2.3 Classification of Data
Classification is the process of arranging the data into different groups or classes according to some common characteristics. According to Connor “Classification may be defined as the process of arranging things in groups or classes according to their resemblances and affinities”. The functions of classification may be summarized as follows:
· It condenses the data.
· It facilitates comparisons.
· It helps to study the relationships.
· It facilities the statistical treatment of data.
The classification of data is generally done on geographical, chronological, qualitative or quantitative basis on the following lines:
a) In geographical classification, data are arranged according to places, areas or regions.
b) In chronological classification, data are arranged according to time i.e. weekly, monthly, quarterly, half-yearly, annually, etc.
c) In qualitative classification the data are arranged according to attributes like sex, marital status, educational standard, region, farm, breed, disease etc.
d) Quantitative classification means arranging data according to certain characteristic that has been measured e.g. according to height, weight or milk yield, fat contents in a dairy product etc. In this type of classification, certain classes are formed and the units belonging to these classes are attached to them. The quantitative phenomenon under study is known as variable and hence this classification is also sometimes called classification by variables.
Variable: The quantitative phenomenon under study, wages, barometer readings, rainfall records, heights, weights, milk yield, fat, SNF, age at first calving, first lactation production etc. is termed a variable or a variate. In other words a quantity which can vary from one individual to another individual is called a variable. Variables are of two kinds
a) Continuous variable: Quantities which can take any numerical value within a certain range are called continuous variables e.g., the height of a calf at various ages is a continuous variable since as the calf grows from 200 cm to 300 cm his height assumes all possible values within this limit.
b) Discrete variable: Quantities which are incapable of taking all possible values are called discontinuous or discrete variables e.g. the number of animals in a herd can take only integer values such as 2, 3, 4 etc.
2.4 Frequency Distribution
The frequency distribution is a statistical table which shows the value of a variable in order of magnitude, either individually or in groups, along with the corresponding frequencies side by side. The data pertaining to a quantitative phenomenon can be classified in four ways:
· The set or series of individual observations- ungrouped (raw) or arranged (arrayed) data.
· Discrete or ungrouped frequency distribution.
· Grouped frequency distribution.
· Continuous frequency distribution.
Example 1: The following data pertain to first lactation milk yield (in kg) of 100 Karan Swiss cows
1630 |
1648 |
1663 |
1665 |
1671 |
1677 |
1680 |
1687 |
1690 |
1695 |
1787 |
1788 |
1790 |
1800 |
1862 |
1855 |
1815 |
1835 |
1845 |
1818 |
1974 |
1998 |
2000 |
2000 |
2005 |
2031 |
2045 |
2045 |
2050 |
2056 |
2168 |
2171 |
2180 |
2187 |
2200 |
2218 |
2245 |
2323 |
2372 |
2397 |
2063 |
2069 |
2085 |
2098 |
2100 |
2100 |
2100 |
2105 |
2117 |
2131 |
1736 |
1743 |
1760 |
1765 |
1763 |
1767 |
1775 |
1775 |
1776 |
1780 |
1695 |
1754 |
1698 |
1700 |
1742 |
1732 |
1711 |
1713 |
1718 |
1728 |
1854 |
1850 |
1855 |
1856 |
1857 |
1860 |
1863 |
1863 |
1875 |
1880 |
1890 |
1900 |
1910 |
1912 |
1915 |
1918 |
1928 |
1916 |
1915 |
1947 |
1950 |
1958 |
1951 |
1960 |
1963 |
1968 |
1965 |
1967 |
1970 |
1969 |
The data given in example 1 are called the raw or ungrouped data which does not give us any useful information. Our objective will be to express the huge data in a suitable condensed form which will highlight the significant facts and comparisons and furnish more useful information without sacrificing any information of interest about the important characteristics of the distribution.
2.4.1 Array
A better presentation of above raw data would be to arrange them in an ascending or descending order of magnitude which is called arraying of data. However, this method is better than raw data but does not reduce the volume of the data.
2.4.2 Discrete or ungrouped frequency distribution
A much better way of the presentation of the data is to express in the form of a discrete or ungrouped frequency distribution, where we count the number of times each value of the variable occurs in the data. The number of times a variate value is repeated is called frequency of the variate value e.g. suppose there are seven Karan Fries cows having first lactation milk yield equal to 1900 kg, 7 is the frequency of first lactation yield of 1900 kg.
2.4.3 Grouped frequency distribution
It is a statistical table which shows the values of the variable in groups and also the corresponding frequencies side by side. In this type of set up, the condensation of data consists in classifying the data into different classes (or class intervals) by dividing the entire range of the values of the variable into a suitable number of groups, called classes and then recording the number of observations in each group. The type of such representation of data is called a grouped frequency distribution. The groups are called the classes and the boundary ends are called class limits e.g. for a class interval 0 – 10, 0 is the lower limit and 10 is the upper limit. The difference between upper and lower limit is called magnitude of the class. The number of observations falling within a particular or defined class is called its frequency or class frequency. The variate value which lies midway between the upper and lower limits is called mid value or midpoint of that class.
While preparing the frequency distribution the following points must be kept in mind:
1. The class interval should be uniform i.e. it should be of equal width. A comparison of different frequency distributions is facilitated if the same class interval is used for all. The class interval should be an integer as far as possible.
2. The class interval should be so chosen that all the observations should be reflected by the frequency distribution.
3. The class interval should be continuous open end classes less than ‘a’ or greater than ‘b’ should be avoided. These classes create difficulty in analysis and interpretation.
4. The observations corresponding to the common point between two classes should always be put in the higher class e.g. a number corresponding to the variate 30 is to be put in the class 30-40 and not in 20-30.
5. There should not be too many or too small number of classes. The number of classes should never be less than 6 and not more than 30 i.e. the number of class intervals should lie between 6 and 30. With less number of classes the accuracy may be lost, and with more number of classes the computations become tedious. The optimum number of classes is generally considered as 15.
2.4.3.1 Number of classes
The following formula due to Sturges may be used to determine the number of classes k = 1+3.322 log10N where k is the number of classes and N is the total frequency.
2.4.3.2 Size of class intervals
The
choice of class interval depends on the number of classes for a given
distribution and size of the data. As far as possible the class
intervals should be of equal size. Prof. Sturges has
given the following formula for determining the size of class intervals
Example 2: If we consider the data given in example 1 let us find its size of class interval and prepare its frequency distribution
Solution: The size of class interval is given by
N=100 Largest value =2397 and Smallest value =1630, Range =767
Number of
classes k = 1+3.322 log10(100)=7.644
Taking class intervals as 1630-1730, 1730-1830, ----, 2330-2430 the frequency distribution of first lactation milk yield of 100 Karan Swiss cows is given below in Table 2.1
Table 2.1 Frequency distribution of First Lactation milk yield of Karan Swiss cows
Class
Interval (in kg) |
frequency (fi) |
1630-1730 |
17 |
1730-1830 |
19 |
1830-1930 |
23 |
1930-2030 |
16 |
2030-2130 |
14 |
2130-2230 |
7 |
2230-2330 |
2 |
2330-2430 |
2 |
Advantages of grouping
(i) First advantage of grouping is that in subsequent calculations, much labour is saved in numerical computation by treating all individuals in a class interval as having the value at the centre of that interval.
(ii) The second advantage of grouping is where the observed sample is of moderate size and from a large population. In such a case the frequency table is more likely to exhibit a rise or fall of frequency against class interval.
2.5 Cumulative Frequency Distribution
The cumulative frequency of a class is the total frequency up to and including that class. The table of cumulative frequencies is called a cumulative frequency distribution table. There are two types of cumulative frequency distribution. The cumulative frequency distribution of all values greater than or equal to the lower limit of each class is called more than cumulative frequency distribution. The cumulative frequency of all values less than or equal to the upper limit of each class is called less than cumulative frequency distribution. Let us illustrate this through example 3
Example 3: Prepare the cumulative frequency distribution of the frequency distribution of first lactation milk yield of Karan Swiss cows given in table 2.1.
Solution: The less than cumulative frequency and more than cumulative frequency distribution are shown in table 2.2
Table 2.2 Cumulative frequency distribution of first lactation milk yield
Class
Interval (in
kg) |
frequency
(fi) |
Cumulative
frequency (c. f.) |
|
Less than |
More than |
||
1630-1730 |
17 |
17 |
100 |
1730-1830 |
19 |
36 |
83 |
1830-1930 |
23 |
59 |
64 |
1930-2030 |
16 |
75 |
41 |
2030-2130 |
14 |
89 |
25 |
2130-2230 |
7 |
96 |
11 |
2230-2330 |
2 |
98 |
4 |
2330-2430 |
2 |
100 |
2 |