Module 1.
Descriptive statistics
Lesson 3
MEASURES OF CENTRAL TENDENCY
3.1 Introduction
The collected data as such are not suitable to draw conclusions about the mass from where it has been taken. Some inferences about the population can be drawn from the frequency distribution of the observed values. One of the important objectives of statistical analysis is to determine various numerical measures which describe the inherent characteristics of a frequency distribution. The averages are the measures which condense a huge unwieldy set of numerical data into single numerical value, are representative of the entire distribution. Hence, in finding a central value, the data are condensed into a single value around which most of the values tend to cluster. Commonly, such a value lies in the centre of the distribution and is termed as central tendency.
3.2 Measure of Central Tendency or Average
One of the most important objectives of the statistical analysis is to get one single value that describes the characteristic of the entire mass of unwieldy data. Such a value is called the ‘Central Value’ or ‘an average’. The word ‘Average’ is very commonly used in everyday conversation. For example we talk of average milk yield of a cow, average fat content of milk , average height or life of an Indian, average income etc. When we say ‘he is an average student’ what it means is that he is neither very good nor very bad, just a mediocre type of student. Similarly, when we talk of average size of butter or cheese packet being sold through a retail outlet what we mean is that the size of packet which is being sold to maximum number of individuals by the retail outlet that means it is the modal size. However, in statistics the term average has a different meaning. According to Croxton and Cowden “An average value is a single value within the range of the data and is used to represent all of the values in the series. Since an average is within the range of the data, it is sometimes called a measure of central value.” It may be defined as that value of a distribution which is considered as the most representative of the series or typical value for a group. Such a value is of great significance because
· It depicts the characteristic of the whole group
· It facilitates comparison
Averages are sometime referred to as a measure of location since they enable us to locate the position or place of the distribution in question.
Requisites of a good average
· It should be rigidly defined.
· It should be easy to understand and calculate.
· It should be based on all the observations.
· It should not be unduly affected by extreme observations.
· It should be suitable for further mathematical treatment.
· It should be least affected by fluctuations of sampling.
The various measures of central tendency or averages are discussed below:
3.3 Arithmetic Mean
Its value is obtained by adding together all the items and dividing it by the total number of observations. If X1, X2, . . . . . , Xn are n values of a variable X, then the arithmetic mean (A.M.) in case of raw data, is defined as
In case of frequency distribution
Xi |
X1 |
X2 |
X3 |
--- |
Xn |
fi |
f1 |
f2 |
f3 |
--- |
fn |
If the value X1 occurs f1 times, the value X2 occurs f2 times, …, the value Xn occurs fn times, then the arithmetic mean is given by
where,
3.3.1 Arithmetic mean in case of grouped data
In case of grouped data, arithmetic mean can be calculated by applying any of the following methods:
i) Direct Method ii) Short cut Method iii) Step-deviation Method
3.3.1.1 Direct method
· Multiply each value of Xi (the mid value of the class) by the corresponding frequency fi.
· Obtain the sum of the products .
· Divide this sum of products by the total frequency (N) so as to get mean.
Example 1: Find the mean from frequency distribution given in example 2 of Lesson 2.
Solution: Prepare the following table and calculate arithmetic mean as follows:
Class Interval |
Mid-value
(Xi) |
frequency
(fi) |
fi Xi |
1630-1730 |
1680 |
17 |
28560 |
1730-1830 |
1780 |
19 |
33820 |
1830-1930 |
1880 |
23 |
43240 |
1930-2030 |
1980 |
16 |
31680 |
2030-2130 |
2080 |
14 |
29120 |
2130-2230 |
2180 |
7 |
15260 |
2230-2330 |
2280 |
2 |
4560 |
2330-2430 |
2380 |
2 |
4760 |
Total |
|
100 |
191000 |
3.3.1.2 Short–cut method (Change of Origin)
If the values of X or/and f are large, the calculation of mean by direct method is quite tedious and time consuming. In such a case the calculations can be reduced to a great extent by using short cut method. This method consists in taking deviations of the given observations from any arbitrary value A. The formula for calculation of the arithmetic mean is
Where A is arbitrary mean, di’
= Xi – A i.e.
deviation from the arbitrary or assumed mean.
3.3.1.3 Step-deviation method (Change of origin and scale)
In case of grouped frequency distribution, with class intervals of equal magnitude, the calculations are further simplified by taking; where Xi is the mid value of the class and h is the common magnitude or width of the class intervals. So the formula for calculating mean is. The procedure is illustrated in the example 2. It will be seen that the answer in each of the three cases is the same. The step-deviation method is the most convenient method on account of simplified calculations.
Example 2: Solve example 1 with short-cut and step-deviation method.
Solution: Prepare the following table and calculate arithmetic mean as follows:
a) Short-cut
method: A=2080
b)
Step-Deviation Method: A=2080, h=100
3.3.2 Mathematical properties of arithmetic mean
i) The algebraic sum of the deviations of the given set of observations from their arithmetic mean is zero i.e. .
ii) If
n1 and n2 are the sizes and are
the respective means of two series then the pooled mean of the combined series of size (n1+n2)
observations is given by:
iii) The sum of the squares of deviations of the given set of observations, when taken from their arithmetic mean, is minimum.
3.3.3 Merits of arithmetic mean
· The A.M. is rigidly defined
· It is based on all the observations
· It is easily calculated from the given data
· It is least affected by fluctuations of sampling
· It is suitable for further mathematical treatment. The average of two or more series can be obtained from the averages of the individual series.
3.3.4 Demerits of arithmetic mean
· The strongest drawback of arithmetic mean is that it is very much affected by extreme observations.
· In a distribution with open end classes the value of mean cannot be computed without making assumptions regarding the size of class.
· It can neither be located by inspection nor graphically.
· It cannot be used for qualitative type of data such as intelligence, honesty, flavor, overall acceptability of dairy product etc.
· Arithmetic mean cannot be obtained if a single observation is missing or lost.
· In a skewed distribution, usually arithmetic mean is not representative of the distribution.
3.4 Geometric Mean
The geometric mean (usually denoted by G.M.) of a set of n observations is the nth root of their product. If X1, X2, . . . , Xn are n values of a variable X, none of them being zero, then the geometric mean, G.M. is defined by
Thus logarithm of G.M. of a set of observations is the arithmetic mean of their logarithms.
If
X1, X2, . . . , Xn occurs f1,
f2 . . . , fn times respectively then
3.4.1 Merits of geometric mean
· It is rigidly defined
· The G.M. is based on all observations of a series.
· It is not much affected by fluctuations of sampling.
· It is suitable for further mathematical treatment.
· Unlike arithmetic mean which has a bias for higher values, geometric mean has bias for smaller observations.
· As compared with Arithmetic mean, Geometric mean is affected to a lesser extent by extreme observations.
3.4.2 Demerits of geometric mean
· Computations are difficult
· It is not simple to understand
· It does not give equal weight to every item.
· It cannot be calculated if the number of negative values is odd as well as some value is zero.
3.4.3 Use of geometric mean
It is most appropriate average when dealing with ratios, percentages and rate of increase between two periods. It is applied when increase or decrease in time is proportional e.g. growth of population is proportional to the time, increase in bacterial population is proportional to the time and rate of interest. Geometric Mean is used in the construction of Index numbers.
3.5 Harmonic Mean
If X1, X2,
. . . , Xn are n values of a variable X, then their Harmonic Mean,
abbreviated as H.M. is defined by
In other words Harmonic
Mean is the reciprocal of the arithmetic mean of the reciprocals of the given
observations. In case of grouped frequency distribution, harmonic mean is given
by
3.5.1 Merits of harmonic mean
· It is rigidly defined
· The H.M. is based on all observations of a series.
· It is not much affected by fluctuations of sampling.
· It is suitable for further mathematical treatment.
· Since the reciprocals of the values of the variable are involved, it gives greater weight age to smaller observations and as such is not very much affected by one or two big observations.
3.5.2 Demerits of harmonic mean
· Computations are difficult and not simple to understand.
· It cannot be calculated if any one of the observations is zero.
· It is not a representative figure of the distribution unless the phenomenon requires greater weight age to be given to smaller values.
3.5.3 Use of harmonic mean
H.M. is used in finding averages involving speed, time, price and ratios. It is useful for computing the average rate of increase of profits of a concern or average speed at which a journey has been performed or the average price at which an article has been sold. The rate usually indicates the relation between two different types of measuring units that can be expressed reciprocally. The H.M. is used for the problems about work, time and rate, where the amount of work is held constant and the average rate is required, or in problems about total cost, number of persons and per capita cost is called for or in problems of similar nature involving rates.
The arithmetic mean
(A.M.), the geometric mean (G.M.) and the harmonic mean (H.M.) of a series of n
observations are connected by the relation A.M. ≥ G.M. ≥ H.M.
The computation of G.M. and H.M. is illustrated in example 3.
Example 3: Find G.M. and H.M. of data given in example 1.
Solution: Prepare the following table and calculate G.M. and H.M. as follows:
Class Interval |
Mid-value (Xi) |
frequency (fi) |
log Xi |
fi log Xi |
|
|
1630-1730 |
1680 |
17 |
3.2253 |
54.8303 |
0.0006 |
0.0101 |
1730-1830 |
1780 |
19 |
3.2504 |
61.7580 |
0.0006 |
0.0107 |
1830-1930 |
1880 |
23 |
3.2742 |
75.3056 |
0.0005 |
0.0122 |
1930-2030 |
1980 |
16 |
3.2967 |
52.7466 |
0.0005 |
0.0081 |
2030-2130 |
2080 |
14 |
3.3181 |
46.4529 |
0.0005 |
0.0067 |
2130-2230 |
2180 |
7 |
3.3385 |
23.3692 |
0.0005 |
0.0032 |
2230-2330 |
2280 |
2 |
3.3579 |
6.71587 |
0.0004 |
0.0009 |
2330-2430 |
2380 |
2 |
3.3766 |
6.75315 |
0.0004 |
0.0008 |
Total |
|
100 |
|
327.9316 |
|
0.0528 |
From example 2 and example 3 one
can verify that A.M. ≥ G.M. ≥ H.M.
3.6 Median (Positional Average)
The median is defined the measure of the central value when arranged in ascending or descending order of magnitude. According to L.R. Connor “The median is that value of the variate which divides the group in two equal parts, one part comprising all the values greater and the other, all values less than median”. Thus, as against arithmetic mean which is based on all the items of the distribution, the median is only positional average i.e. its value depends on the position occupied by a value in the frequency distribution.
3.6.1 Calculation of median
3.6.1.1 Ungrouped data
When
the total numbers of observations are odd, then the median is the middle value
after the observations are arranged in ascending or descending order of
magnitude. If the number of observations is equal to n, then the value of
((n+1)/2)th item gives the value of median e.g. the median of 5
observations 65,69,52,58,45 i.e. 45,52,58,65,69 is 58. When the total number of
observations is even then median is obtained as the arithmetic mean of the two
middle observations after they are arranged in ascending or descending order of
magnitude. If number of observations are say 2n, then the arithmetic average of
nth and (n+1)st (central items) gives the value of
median. If it is n then median is the arithmetic average of and
values e.g. the median of 6 observations
65,69,52,58,45,67 i.e. 45,52,58,65,67,69 is arithmetic mean of 58 and 65 which
is equal to 61.5.
3.6.1.2 Grouped data
Steps involved for its computation are:
1) Prepare less than cumulative frequency(c.f. ) distribution table
2) Find N/2.
3) Find cumulative frequency just greater than N/2
4) The class corresponding to step 3 contains the median value and is called the median class.
The median for a grouped series is given by the following formula:
Where: l is the lower limit of the median class
f is the frequency of the median class
h is width of the class interval
c.f. is the cumulative frequency of the class preceding the median class.
The computational procedure is illustrated in Example 3.4.
3.6.2 Merits of median
· It is rigidly defined.
· It is easily understood, very readily calculated and can exactly be located.
· It is readily obtained without the necessity of measuring all the objects.
· It is not affected by abnormally large or small values of the variable.
· Median can be computed while dealing with a distribution with open end classes.
· It can be determined by mere inspections and can be computed graphically.
· The median gives the best results in a study of direct qualitative measurements such as intelligence, honesty etc.
3.6.3 Demerits of median
· The median does not lend itself to algebraic treatment. The median of several series by combining the medians of the component series cannot be computed.
· Median being positional average is not based on each and every item of the observations.
· Median is relatively less stable than mean, particularly for small samples since it is affected more by fluctuations of sampling as compared to arithmetic mean.
3.7 Quartiles
Quartiles
are those values of the variate which divide the total frequency into four equal
parts. Obviously there will be three such points Q1, Q2 and
Q3 such that Q1≤ Q2 ≤ Q3 termed
as the three quartiles. Q1is known as the lower or first
quartile and is the value which has 25% of the items of the distribution below
it and consequently 75 percent of the items are greater than it. Q3
is known as the upper or third quartile and has 75percent of the
observations below it and consequently 25 percent of the observations above it.
3.8 Deciles
Deciles
are those values of the variate which divide the total frequency into 10 equal
parts. The formula for obtaining jth Decile (Dj) in case
of grouped frequency distribution is given as
, j=1, 2, 3, ---, 9
3.9 Percentiles
Percentiles
are the values of the variates which divide the total frequency into 100 equal
parts. The formula for obtaining kth Percentile (Pk) in
case of grouped frequency distribution is given as
, k=1, 2, 3, ---, 99.
3.9.1 Graphical method of locating position values
The various partition values viz., median, quartiles, deciles and percentiles can be located graphically with the help of curve called the cumulative frequency curve or ogive. Draw a perpendicular from the point of the two ogives i.e. more than ogive and less than ogive on the x-axis, the foot of the perpendicular gives the value of median. The points corresponding to N/4, 3N/4, N/10,….., 9N/10, N/100,….., 99N/100 on y-axis with the foot values of the perpendicular on x-axis provide the value of Q1, Q3, D1, ……, D9, P1, ……., P99.
3.10 Mode
Mode is the value which occurs most in a set of observations and around which the other items of the set cluster densely. It is defined to be size of the variable which occurs most frequently or the point of maximum frequency or the point of greatest density. In other words mode is that value of observation for which the height of the ordinate is maximum. Modal value of the distribution is that value of the variate for which frequency is maximum. In the words of Croxton and Cowden “The mode of a distribution is value at the point around which the items tend to be heavily concentrated. It may be regarded as the most typical value of a series of values. ”
3.10.1 Computation of mode
In case of a frequency distribution, mode is the value of the variable corresponding to the maximum frequency. For a continuous frequency distribution, the class corresponding to maximum frequency is called the modal class. The mode is computed by the formula:
Where l = lower limit of the modal class
f1 is the frequency of the modal class.
fo is the frequency of the class just preceding the modal class (pre-modal class).
f2 is the frequency of the class just succeeding the modal class (post-modal class).
h is the magnitude of the modal class.
The computational procedure is illustrated in example 4.
3.10.2 Merits of mode
· It is easily understood.
· It is the most typical value and it is the most descriptive average.
· It is a positional average.
· It can be easily located by mere inspection of certain items.
· It can be easily determined from the graph.
· The extreme items have no effect provided they are not in the modal class.
3.10.3 Demerits of mode
· It is ill defined. A clearly defined mode does not always exist. The value of mode cannot always be determined. A distribution can be bimodal or multimodal.
· It is not based on all the observations of a series.
· It is not suitable of further mathematical treatment.
· As compared to mean, mode is affected to a greater extent by the fluctuations of sampling.
Graphically mode can be located from the histogram of frequency distribution by making use of the rectangles erected on the modal, pre-modal and post modal classes. The method consists of following steps:
a) Join the top right corner of the rectangle erected on the modal class with top left corner of the rectangle erected on the preceding class by means of a straight line.
b) Join the top last corner of the rectangle erected on the modal class with top right corner of the rectangle erected on the succeeding class by means of a straight line.
c) From the point of intersection of the lines in step (i) and (ii) above, draw a perpendicular to the X-axis. The abscissa of the point where this perpendicular meets the X-axis gives the modal value.
Example 4: Find median, first quartile, third quartile and mode of the frequency distribution given in example 1 and obtain them graphically.
Solution: Prepare the following table to calculate median, first quartile, third quartile and mode.
Here N/2=50. Cumulative frequency greater than 50 is 59. Hence the median class is 1830-1930.
N/4=25. Cumulative frequency greater than 25 is 36. Hence the first quartile class is 1730-1830.
Q1 = 1730 + (25-17) × =
1730 + 42.1053 = 1772.1053
3N/4=75. Cumulative frequency greater than 75 is 89. Hence the third quartile class is 2030-2130.
Fig. 3.1 Graphical method to find first quartile, median and third quartile
For computing Mode, the maximum frequency (23) occurs in the class interval 1830-1930, which is called modal class.f1=23, f0=19 and f2=16. Using formula
Fig. 3.2 Graphical method to find mode
Empirical relation between Mean, Median and Mode
In
case of symmetrical distribution mean, mode and median coincide while for
asymmetrical distribution the empirical relationship is Mode = 3 Median -2
Mean.