Review of Researches on Criterion Referenced Tests

Studies Regarding Criterion-Referenced Tests


    Glassers R.(1963) defined criterion-referenced measurement as indicating the content of the behavioral repertory, and. the correspondence between what an individual does and the underlying continuum of achievement. The use of criterion-referenced tests to assess achievement is discussed. The need to behaviorally specify minimum levels of competence or performance was indicated.


    Nunnally, J.C. (1967) suggested basis for comparing the validity, reliability, and item analysis methods and procedures used for norm-referenced tests with procedures used for criterion­ referenced test.


    Popham, al. (1969) discussed the issues of reliability, validity item construction, and item analysis as they apply to criterion­ referenced tests. It was held that variability is irrelevant, that it is not a necessary condition for good criterion-referenced tests. The authors contend that most of the historical guide lines for test and item constriction are not only irrelevant to criterion-referenced test but can also be injurious to their proper development and use.


    Ivens, S. H. (1970) discussed the problems of developing and evaluating empirically set of indices for criterion-referenced tests that might be used to assess item and test quality. Two different approaches to the construction of criterion-referenced measures were considered. The author believed that criterion-referenced tests reliability can be assessed by using indices based on the concept of within subject equivalence of item scores and equivalence of total scores.


    Jackson, R. (1970) suggested some definition of criterion referenced tests and noted some of their insufficiencies. The term “criterion-referenced “was used to apply only to a test designed and constructed in manner that defines explicit rules linking patterns of test performance to behavioural objectives. The author advances some empirical methods for dealing with item analysis, test reliability, and test validity difficulties.


    Dahl. T. (1979) pointed various shortcomings of norm­ referenced standardized tests. Objective based testing by which scores are related to previously specify learning criteria results in test result indicative of student achievement of educational goals. The author considered the attainment and measurement of objective­ item congruence essential to the construction and use of objective based tests.


     Glasses, R. et. at. (1971) pointed out many of the differences between criterion-referenced tests are constructed to support generalizations about an individual’s performance relative to a specified domain of tasks. The task domain must be defined in terms of observable behavior and the test must be a representative sample from which competence is inferred. Item construction, test construction, and domain sampling are some of the areas discussed. Popham, W.J. (1971) discussed the indices of adequacy for criterion-referenced test items. While discrimination indices and internal consistency estimates abound in the norm-referenced area, few such procedures  are now available for use with criterion­ referenced tests. Some approaches to item writing are discussed as well as some methods for assessing item adequacy for criterion referenced measures.


    Livingston, S.A. (1972) developed formulae for computing the variance, covariance and correlation of criterion-referenced tests. It was held that a theory of criterion-referenced reliability using the classical test theory model is possible if new concepts parallel to familiar norm-referenced concept are defined. The new concepts are based on deviations from the criterion score. The formula developed is valid only for data which meet the assumptions of classical test theory.


    Millman, J. (1972) synthesized some of the literature on establishing standards and determining the number of item needed in criterion-referenced measures. A variety of proposed methods for establishing passing scores were reviewed as well as number of methods for determining test length. The author presented a table relating test lengths, proficiency standards and required standards.


    Shave/son, al. (1972) presented critique of Livingston’s paper (1972). The authors content that with Livingston’s formula the reliability of the test is a function of individual’s responses to items. Therefore, the reliability coefficient is not directly related to the repeatability of the measure.


    Hambleton, R.K. (1972) attempted to synthesize some of the current thinking in the area of criterion-referenced testing as well as to provide the beginning of an integration of theory and method for such testing. A Bayesian procedure for estimating true mastery scores was discussed.


    Hambleton, R.K. (1974) compared several methods for estimating student’s mastery. The methods covered are: the proportion of correct score, the classical model II estimate given by Jackson (1972) and Bayesian (1972) estimate. Bayesian solution provides a way of obtaining more requiring the administration of any additional test items.


     University of Texas. (1975) in the Adult Performance Level (APL) project summary specifies the competencies which are functional to economic and educational success in society and describes devices developed for assessing those competencies. The APL theory of functional competency identifies adult needs in general knowledge areas (consumer economics, occupational knowledge community resource, health, and government and law) and in primary skills (communication skills, computation skills, problem solving skills) and interpersonal relations skills. Appended materials include additional notes on goals, objectives and tasks.


    Althey, I, (1975) designed a study to investigate the possibility of constructing a criterion-referenced measure of reading performance. The major inference of this longitudinal study is that the criterion-referenced tests prove to be more sensitive measures of growth in reading achievement over a long term period than norm referenced standardized tests.


     Wilcox (1983) described and compared the seven procedures for estimating the reliability of a CRT. The procedures were based on the single administration of a CRT scored with a latent structure model. Results suggested that the predictive estimate is the most accurate of the procedure.


    If the reliability of a CRT is very low, differences in observed scores can be attributed to errors of measurement rather than to differences in individual’s level of mastery of the domain. The analyses by Kane (1986) suggested that if the reliability (defined in terms of internal consistency) is much below 0/5, the test will not provide more accurate estimates of universe scores defined on a domain of items, than would a simple a priority procedure based on group performance. Thus, Kane demonstrated the role of ‘classical’ reliability in estimating universe scores on the domains of items. He also sagged three solutions for a CRT with low reliability: (a) Lengthening the test; (b) defining the domain and item generation procedures more carefully, and (c) estimating the mean universe score for the group. It should be clear that the analyses presented by Kane do not apply to decision accuracy when a cut-off score is used to place students in mastery categories.


    Researches on reliability of a CRT under the present review have addressed the following dimensions:


  1. Search for a prototype reliability indices of CRTs with cut­ off scores (Swaminathan et, al.,1974).
  2. Use of ANOVA in estimating the reliability of CRTs (Lovett, 1977).
  3. A procedure for estimating the reliability of a composite of CRTs, where the parts of the composite have different cutting scores (Raju, 1982).
  4. Use of single administration for estimating the reliability of CRTs (Wilcox, 1983).
  5. The role of traditional reliability index in estimating universe of domain scores on CRTs (Kane, 1986).


    Hambelton (1983) used IRT models for obtaining accurate examinee domain score estimates and for increasing the probability with which examinees are assigned correctly to mastery state with CRT scores. He compared one-,two-, and three-parameter logistic test models for estimating domain scores and making mastery/ monastery decisions. The one-and three-parameter model resulted in highly comparable results for middle and high ability examinees, while for low ability examinees, the more general model always performed  somewhat better.


    Shanon, A. et al (1987) discussed that several recent papers have argued for the usefulness of items responses theory (IRT) methods of assessing items discrimination power of criterion­ referenced tests (CRTs). To provide users with information that may help them to decide on which conventional indices to employ in evaluating CRT items, Spearman rank order correlations were computed between IRT derived from item information functions (lIFS) and four conventional discriminations between the phi coefficient and (lIFS) were very high, with a median of .96. The remaining conventional indices, with the exception of phi over phi max. also correlated well with IIF.


