Evaluation of Criterion Referenced Test : teaching online

criterion referenced test, norm referenced test, criterion referenced, evaluation criteria, criteria, define criteria, criterion definition, reference, what is criteria, what does criteria mean, criteria examples, criteria meaning, reference meaning, reference definition, teaching online

Evaluation of Criterion Referenced Test

welcome to my site teaching online.net

Methods of Evaluating the Test Items

    The strength of any test lies in the quality of its individual items, and the quality of an item depends upon the quality of its stem, key and distracters. No one wants to write bad or poor items, yet bad items are sometimes born. It is proved beyond doubt that there were many instances where professional item writers misjudged the quality of their items. It is not a strange as there is nothing like perfection in item writing. We have to have some checks and counterchecks to make the system of selection as much scientific and rationale as possible. There is a need to develop methods for the identification of good and poor items with regard to criterion-referenced tests. The methods and procedures employed for criterion referenced tests are as follows:

Evaluation by Subject Expert

     This approach is evaluating the quality of test items which is based on human judgment. The use of subject specialist in judging the quality of test items is recommended. They are of the opinion that subject specialists cannot only complete their rating quickly but also with a high degree of reliability and validity. This method depends more on local conditions such as the number of available judges, number of test items, and so on. In this process of validation one should feel satisfied if the item is completely congruent with the specification. But if an item is found incongruent with any of the stipulations set forth in the test specification, then the judges must label it as incongruent and give their remarks as to why it has been labeled as incongruent.

       Empirical Method

     This method of checking the quality of test item is based on empirical evidences in which one is to look for items parameters like facility value and discrimination index. In criterion-referenced tests the situation is altogether different as there the emphasis is to bridge up the gap between the masters and the non-masters which ultimately decreases the response variance of the examinees. In this case if we further improve our instruction naturally the response variance would still be lower.

Facility Value

    The facility value of an item is a simple statistic which indicates how easy or difficult an item has proved to be and is determined by calculating the percentage of examinees who answered it correctly. The higher the percentage, the easier the item and vice-versa. It is usually calculated in the form of percentage

FV =   R x 100


Where   R = Number of examinees who answered it rightly.

N = Total number of examinees attempting the item.

     What facility value of an item is to be regarded as acceptable will depend of course on the kind of test one is trying to construct. It may be mentioned here that the range of the facility value is from zero to one hundred. Since in criterion-referenced tests we wish to test the basic skills which are expected to be masters by high percentage of the examinees. The items with facility values between 80 and 100 may be accepted as good items in criterion-referenced tests. The low facility value of an item is not always because of the poor quality of the item, sometimes it may also be because of ineffective teaching online.

Discrimination Index

Discrimination index which is widely used for the identification of the quality of items. These statistics show the degree to which a particular item discriminates between the bright and poor students. There are various methods suggested for the estimation of discrimination index of Criterion-referenced test items are as follows:

I Upper and Lower Lower Group Index

     This discrimination index may be estimated by subtracting the pass percentage of the lower group from that of upper group. The groups may be made either on the basis of upper 27% and lower 27% or upper 33% and lower 33% or upper 50% and lower 50% as per the availability of the sample size. Efforts must be made to make the two groups as much distinct as possible to get better index as also as much large as possible to reduce the sampling error.

Dl =  Ru – RI


Where Ru = number of right responses in the upper group

RI = Number of right responses in the lower group

N  = Number of candidates in the group

     For instance Jet-us say that in the upper group an item is correctly answered by .95 of the examinees, but in the lower group only .62 of the examinees answer the item correctly. The difference .33 is the item discrimination index.

II. Masters-Non masters Index

     Criterion-referenced tests are not expected to discriminate among all levels of competence but only between masters (who pass) and non-masters (who fail). Therefore, instead of taking the performance of top 27% and bottom 27% into account, we may compare the performance of masters and non-masters by putting all failures (non-masters) in the lower group and those who pass (masters) in the upper group.

    Discrimination index may be computed with the help of following formula:

Dl =        Rp _ Rf

Np  nf

Where Rp =  Number of examinees who passed the total

Test and answered the item correctly.

Rf= Number of examinees who failed the total test and answered the item correctly.

np = Number of examinees who passed the total test.

nf = Number of examinees who failed the total test.

     Since criterion-referenced tests are expected to discriminate among all levels of competence, one need not strive for higher discrimination indices in criterion-referenced tests unlike norm­ referenced tests. Therefore, items having discrimination indices between zeros to 0.15 may be accepted as good items while the items with negative discrimination indices may be dropped out.

    For further reference it may also be mentioned that besides these methods, two more methods namely Bayesian method based on Bayed theorem and Rasch method based on unidimension approach have also been proposed for the estimation of discrimination indices.

Reliability of Criterion-Referenced Tests


    The determination of the reliability of a CRT is still largely in the theoretical stage. As Stanely, (1971) indicates, “criterion referenced measurements  are meant to be used in situations in which there may be no variation among the true scores of the examinees, these measures are intended not to discriminate among persons, but to discriminate each persons score from a fixed criterion score”. When the variance of test scores is restricted, co-relational estimates of reliability will be low. The classical estimates may support a test but they are more likely to present values that give a pessimistic picture of the precision of the measuring instrument. A CRT could be highly consistent without this consistency being reflected in classical reliability indices. Cotton (1971) considers that one means of dealing with above problem is to use the binomial error model the other is the Bayesian Model.

criterion referenced test, norm referenced test, criterion referenced, evaluation criteria, criteria, define criteria, criterion definition, reference, what is criteria, what does criteria mean, criteria examples, criteria meaning, reference meaning, reference definition, teaching online

    Popham (1990) makes the following suggestion because criterion referenced tests is often employed in connection with impending decisions about students and instructional programs. It has been proposed that a decision-consistency, rather than a score consistency, approach to reliability might be sensibly employed with such tests. In general, test developers who employ a decision­ consistently approach would do well to provide decision-consistency percentages based on different cut ,-off levels. In the present investigation this can be done by split-half method.

Validity of Criterion-Referenced Tests

    Validity has been defined in different ways by different authors (Lindquist 1942,) has said, The validity of a test may be defined as the accuracy with which it measures that which it is intended to measure.” This means that to determine how valid a test is, one must compare the reality of what it does measure with some ideal conception of what it ought to measure. Validity of a test is the degree that we know what a test measures. Validity information permits us to judge whether the test measures the right thing for our purpose. Ebel (1966) has discussed brief characterizations of several types of validity.

     “Concurrent validity is concerned with the relation of test scores to an accepted contemporary criterion of performance on the variable which the test is intended to measure”.

     “Construct validity is concerned with what psychological qualities a test measures and is evaluated by demonstrating that certain explanatory constructs account to same degree for performance on the test”.

     “Content validity is concerned with the adequacy of sampling of a specified universe of content”.

     “Curricular validity is determined by examining the content of the test itself adjudging the degree to which it is a true measure of the important objectives of the course, or a truly representative sampling of the essential materials of instruction “.

     “Empirical validity refers to the relation between test scores and a criterion, the latter being an independent and direct measure of that which the test is designed to predict.”

     “Face validity refers, not to what a test necessarily measures, but to what it appears to measure”.

“The factorial validity of a test is the correlation between that test and factor common to a group of tests or other measures  of behavior such validity is based on factor analysis”.

     “Intrinsic validity involves the use of experimental techniques other than correlation with a criterion to provide objectives, quantitative evidence that the test is measuring what it ought to measure”.

     “Predictive validity is concerned with the relation to test scores to measures on a criterion based on performance at some later time”.

     “Validity by definition for some tests the objectives is defined solely in terms of the population of questions from which the sample comprising was drawn, for example when the ability to handle the on hundred number facts of addition is tested by sampling of those number facts.”

     These types of validity are not distinctly different from each other. In fact, one or two of them are practically identical with one or two others; following types of validity as suggested by Singh (1983) can be used for competency tests:

Content Validity

     This means that test construction should ensure that domain is clearly defined and items are assessing only one thing at a time. All items should be relevant to the content elements of the domain representing adequately the segment of knowledge.

Description Validity

     It means the extent to which it adequately delimits the nature of a set of items and the extent of congruency of test items with the domain definition.

Domain Selection Validity

     It refers to the relevance of the main attributes constituting the main description or the extent of congruence of measurable behaviors.

Functional Validity

    It refers to the degree to which a criterion referenced test performs a function in addition to describing a function.

Educational Objectives in the Cognitive Domain


    Knowledge is defined as the remembering of previously learnt materials. This represents the lowest level of learning outcomes. “Knowledge, as defined here, involves the recall of specific and · universals, the recall of methods and processes, or the recall of a pattern, structure, or setting.” (Bloom. 1956)


    Comprehensive is defined as the ability to grasp the meaning of materials. This may be shown by translating materials from one from to another, by interpreting materials and by estimating future trends. These learning outcomes represent the lower level of understanding.


     Application refers to the ability to use learned materials in new situations. This may include the application of such things as rules, methods, concepts, principles, laws and theories. Learning out comes in this area require a higher level of understanding than those under comprehension.


    Analysis refers to the ability to break own material into its component parts so that its organizational structure may be understood. This may include the identification of parts, analysis of relationships between parts and recognition of the organizational principles involved. Learning outcomes here represent a higher intellectual level than comprehension and application because they require an understanding of both the content and the structure form of the material.


    Synthesis refers to the ability to put parts together to form a new whole. This ma y involve the production of unique communication (theme of speech), a plan of operation (research proposal) or a set of abstract relations (scheme for classifying information) learning outcomes in this area stress creative behavior with major emphasis on the formulations of new patterns or structure.


    Evaluation is concerned with the ability to judge the value of material for a given purpose. The judgments are to be based on definite criteria. Learning out comes in this area are highest in the cognitive hierarchy because they contain elements of all of the other categories plus value judgments based on clearly defined criteria.

About the Author

I am blogger and doing internet marketing since last 3 year. I am admin at https://www.teachingonline.net and many more site. Very sincere thanks for your interest in teachingonline.net, we take our visitors' comments on utmost priority. You will surely get more solved examples very shortly. Kindly let me know any other requirement. .... Please keep visiting. Thanks a lot.