Module 1. Descriptive statistics

Lesson 2

CLASSIFICATION OF DATA AND FREQUENCY DISTRIBUTION

2.1 Introduction

As discussed in Lesson 1, statistics are a set of numerical statements and facts collected from any field of enquiry for drawing valid inferences. Data collection is in fact, the most important aspect of a research experiment/statistical survey. After data have been collected, the next step is to present the data in some orderly and logical form so that their essential features may become explicit. The need for proper presentation of data arises because the mass of collected data in their raw form is often so voluminous which cannot be easily comprehended and analysed. Therefore, after the collection of data, it is imperative that data are classified and presented in such a way so as to bring out points of similarities and dissimilarities in the data.

2.2 Collection of Data

To study any problem by means of statistical methods first, the relevant data are collected. Sometimes the data is to be collected from some research experiment or the primary sampling units (households). Sometimes, the relevant data may exist in a published or unpublished form, being collected by a private body or by the Government agency or by some research organization, for its own use or for supplying popular information. In making use of such data (called secondary data), one has to be particularly careful about the definitions of terms and concepts used by the collecting authority and also about the method of collection and the reliability of the data. More often, one has to collect data directly from the field of enquiry. The data are then said to be of the primary type. The collection of primary data may be done by interviewing a number of persons and filling in questionnaires relevant to the problem e.g., in the family income and expenditure survey, one will generally interview the head of each family. The data collected should be carefully scrutinized before they are subjected to statistical treatment.

2.3 Classification of Data

Classification is the process of arranging the data into different groups or classes according to some common characteristics. According to Connor “Classification may be defined as the process of arranging things in groups or classes according to their resemblances and affinities”. The functions of classification may be summarized as follows:

·         It condenses the data.

·         It facilitates comparisons.

·         It helps to study the relationships.

·         It facilities the statistical treatment of data.

The classification of data is generally done on geographical, chronological, qualitative or quantitative basis on the following lines:

a)     In geographical classification, data are arranged according to places, areas or regions.

b)     In chronological classification, data are arranged according to time i.e. weekly, monthly, quarterly, half-yearly, annually, etc.

c)     In qualitative classification the data are arranged according to attributes like sex, marital status, educational standard, region, farm, breed, disease etc.

d)    Quantitative classification means arranging data according to certain characteristic that has been measured e.g. according to height, weight or milk yield, fat contents in a dairy product etc. In this type of classification, certain classes are formed and the units belonging to these classes are attached to them. The quantitative phenomenon under study is known as variable and hence this classification is also sometimes called classification by variables.

Variable: The quantitative phenomenon under study, wages, barometer readings, rainfall records, heights, weights, milk yield, fat, SNF, age at first calving, first lactation production etc. is termed a variable or a variate. In other words a quantity which can vary from one individual to another individual is called a variable. Variables are of two kinds

a)      Continuous variable: Quantities which can take any numerical value within a certain range are called continuous variables e.g., the height of a calf at various ages is a continuous variable since as the calf grows from 200 cm to 300 cm  his height assumes all possible values within this limit.

b)      Discrete variable: Quantities which are incapable of taking all possible values are called discontinuous or discrete variables e.g. the number of animals  in a herd can take only integer values such as 2, 3, 4 etc.

2.4 Frequency Distribution

The frequency distribution is a statistical table which shows the value of a variable in order of magnitude, either individually or in groups, along with the corresponding frequencies side by side. The data pertaining to a quantitative phenomenon can be classified in four ways:

·       The set or series of individual observations- ungrouped (raw) or arranged (arrayed) data.

·        Discrete or ungrouped frequency distribution.

·        Grouped frequency distribution.

·        Continuous frequency distribution.

Example 1: The following data pertain to first lactation milk yield (in kg) of 100 Karan Swiss cows

1630

1648

1663

1665

1671

1677

1680

1687

1690

1695

1787

1788

1790

1800

1862

1855

1815

1835

1845

1818

1974

1998

2000

2000

2005

2031

2045

2045

2050

2056

2168

2171

2180

2187

2200

2218

2245

2323

2372

2397

2063

2069

2085

2098

2100

2100

2100

2105

2117

2131

1736

1743

1760

1765

1763

1767

1775

1775

1776

1780

1695

1754

1698

1700

1742

1732

1711

1713

1718

1728

1854

1850

1855

1856

1857

1860

1863

1863

1875

1880

1890

1900

1910

1912

1915

1918

1928

1916

1915

1947

1950

1958

1951

1960

1963

1968

1965

1967

1970

1969

 

The data given in example 1 are called the raw or ungrouped data which does not give us any useful information. Our objective will be to express the huge data in a suitable condensed form which will highlight the significant facts and comparisons and furnish more useful information without sacrificing any information of interest about the important characteristics of the distribution.

2.4.1 Array

A better presentation of above raw data would be to arrange them in an ascending or descending order of magnitude which is called arraying of data. However, this method is better than raw data but does not reduce the volume of the data.

2.4.2 Discrete or ungrouped frequency distribution

A much better way of the presentation of the data is to express in the form of a discrete or ungrouped frequency distribution, where we count the number of times each value of the variable occurs in the data. The number of times a variate value is repeated is called frequency of the variate value e.g. suppose there are seven Karan Fries cows having first lactation milk yield equal to 1900 kg, 7 is the frequency of first lactation yield of 1900 kg.

2.4.3 Grouped frequency distribution

It is a statistical table which shows the values of the variable in groups and also the corresponding frequencies side by side. In this type of set up, the condensation of data consists in classifying the data into different classes (or class intervals) by dividing the entire range of the values of the variable into a suitable number of groups, called classes and then recording the number of observations in each group. The type of such representation of data is called a grouped frequency distribution. The groups are called the classes and the boundary ends are called class limits e.g. for a class interval 0 – 10, 0 is the lower limit and 10 is the upper limit. The difference between upper and lower limit is called magnitude of the class. The number of observations falling within a particular or defined class is called its frequency or class frequency.  The variate value which lies midway between the upper and lower limits is called mid value or midpoint of that class.

While preparing the frequency distribution the following points must be kept in mind:

1.      The class interval should be uniform i.e. it should be of equal width. A comparison of different frequency distributions is facilitated if the same class interval is used for all. The class interval should be an integer as far as possible.

2.      The class interval should be so chosen that all the observations should be reflected by the frequency distribution.

3.      The class interval should be continuous open end classes less than ‘a’ or greater than ‘b’ should be avoided. These classes create difficulty in analysis and interpretation.

4.      The observations corresponding to the common point between two classes should always be put in the higher class e.g. a number corresponding to the variate 30 is to be put in the class 30-40 and not in 20-30.

5.      There should not be too many or too small number of classes. The number of classes should never be less than 6 and not more than 30 i.e. the number of class intervals should lie between 6 and 30. With less number of classes the accuracy may be lost, and with more number of classes the computations become tedious. The optimum number of classes is generally considered as 15.

2.4.3.1 Number of classes

The following formula due to Sturges may be used to determine the number of classes k = 1+3.322 log10N where k is the number of classes and N is the total frequency.

2.4.3.2 Size of class intervals

The choice of class interval depends on the number of classes for a given distribution and size of the data. As far as possible the class intervals should be of equal size. Prof. Sturges has given the following formula for determining the size of class intervals

           

Example 2: If we consider the data given in example 1 let us find its size of class interval and prepare its frequency distribution

Solution: The size of class interval is given by

N=100 Largest value =2397 and Smallest value =1630, Range =767

Number of classes k = 1+3.322 log10(100)=7.644

           

                                            

Taking class intervals as 1630-1730, 1730-1830, ----, 2330-2430 the frequency distribution of first lactation milk yield of 100 Karan Swiss cows is given below in Table 2.1

Table 2.1 Frequency distribution of First Lactation milk yield of Karan Swiss cows

Class  Interval

(in kg)

frequency (fi)

1630-1730

17

1730-1830

19

1830-1930

23

1930-2030

16

2030-2130

14

2130-2230

7

2230-2330

2

2330-2430

2

 

Advantages of grouping

     (i)        First advantage of grouping is that in subsequent calculations, much labour is saved in numerical computation by treating all individuals in a class interval as having the value at the centre of that interval.

   (ii)       The second advantage of grouping is where the observed sample is of moderate size and from a large population. In such a case the frequency table is more likely to exhibit a rise or fall of frequency against class interval.

2.5 Cumulative Frequency Distribution

The cumulative frequency of a class is the total frequency up to and including that class. The table of cumulative frequencies is called a cumulative frequency distribution table.  There are two types of cumulative frequency distribution. The cumulative frequency distribution of all values greater than or equal to the lower limit of each class is called more than cumulative frequency distribution. The cumulative frequency of all values less than or equal to the upper limit of each class is called less than cumulative frequency distribution. Let us illustrate this through example 3

Example 3: Prepare the cumulative frequency distribution of the frequency distribution of first lactation milk yield of Karan Swiss cows given in table 2.1.

Solution: The less than cumulative frequency and more than cumulative frequency distribution are shown in table 2.2

                             Table 2.2 Cumulative frequency distribution of first lactation milk yield

Class  Interval

(in kg)

frequency (fi)

Cumulative frequency (c. f.)

Less than

More than

1630-1730

17

17

100

1730-1830

19

36

83

1830-1930

23

59

64

1930-2030

16

75

41

2030-2130

14

89

25

2130-2230

7

96

11

2230-2330

2

98

4

2330-2430

2

100

2