APPLICATIONS OF ITEM RESPONSE THEORY MODELS TO ASSESS ITEM PROPERTIES AND STUDENTS’ ABILITIES IN DICHOTOMOUS RESPONSES ITEMS

A test is a tool meant to measure the ability level of the students, and how well they can recall the subject matter, but items making up a test may be defectives


INTRODUCTION
The main goal of testing is to collect information to make decisions either about the students' abilities or suitability of test items, and different types of information may be needed depending on the kind of decision is intended to be made. Before Item Response Theory (IRT) was the development of Classical Test Theory (CTT) which was a product of pearsonian statistics intelligence testing movement of the first four decades of 20 th century and its attendance controversies (Baker and Kim, 2004). Subsequently, Lord (1968) reformatted the base constructs of CTT using modern mathematical statistical approach where items, and its characteristics played a minor role in the structures of the theory. Earlier, both psychometric theoreticians, and practitioners became dissatisfied over the years with discontinuity between roles of items and test scores in CTT. All were of the opinion that a test theory should start with characteristics of the test items composing a test rather than resultant scores (Brzezinska, 2017).
Two major theories about the development of test are CTT, and IRT (Raykov, 2017). The former is all about reliability with its enormous limitations which includes: estimate of item parameters are group dependent, test item functions that could be either easy or difficult changes as sample changes, ability of students are entirely test dependent, ability of students changes as the occasion changes which result in poor or inconsistency of test, p and r which denote difficulty index and number of students who get item correctly respectively depend on sample of students taking while the latter (IRT) is a bit more complicated than CTT. Rather than looking at the reliability of the test as a whole, IRT looks at each item that makes up the test (Linden, 2018).

ITEM RESPONSE THEORY
An item is a single question or task on a test or an instrument, and Item Response Theory (IRT) is a theoretical frame work organized around the concept of latent trait. It is made up of models, and related statistical methods that define observed responses on instrument to student's level of the ability. It focuses specifically on the items that make up the test, compares the items that make up a test, and then evaluates the extent at which the test measures the student's ability (Raykov and Marcoulides, 2018). IRT models are widely used today in the study of cognitive and personality ability, health responses, items bank development, and computer adaptive testing (Paek and Cole, 2020). For instance, King and Bond (1996) applied IRT to measure anxiety in the use of computer in grade school children, Mislevy and Wu (1996) used IRT in assessing physical functioning in adults with HIV, Boardley et al. (1999) used IRT to measure the degree of public policy involvement in nutritional professionals, (2020) make a comparisons of frequentist and Bayesian approaches to IRT and discovered that Bayesian approach is better in estimating three item properties along with students' abilities simultaneously, and Zeigenfuse et al.
(2020) developed extending dichotomous IRT models to account for test testing behaviour on matching test which violate the assumption of local independence, Bonifay and Cai (2017) findings on the complexity of item response theory models revealed that functional formed of IRT models should be considered not goodness of fit alone when chosen IRT model to be used. However, Suruchi and Rana (2015) identified two uses of item analysis which were the identification of defectives test items, and identifications of areas where students have mastered and not yet mastered. IRT is a potent tool in checking flaws in items and finding ways of correcting them before finally administering the items hence, item moderation needs to follow item analysis. In cases where item cannot be moderated, such item must be discarded and replaced. Ary et al. (2002) asserted that item analysis should make use of statistics that would reveal important and relevant information for upgrading the quality, and accuracy of multiple-choice items. Therefore, IRT plays a central role in the analysis, study of tests and items scores in explaining student test performance, and also provide solutions to test design problems using a test that consists of several items (Baker and Kim, 2004;Baker and Kim, 2017). The potent advantages of IRT over CTT that have propelled us to use IRT are: its treatment of reliability and error of measurement through item information functions which are computed for each item (Hassan, and Miller, 2019),

ASSUMPTIONS OF ITEM RESPONSE THEORY
Undimensionality assumption of IRT implies homogeneity of a test item in the sense of measuring a single ability (Hambleton and Traub, 1973), and the probability of any student's response pattern would be (1's and 0's). On local independence, test item response of a given student is statistically independence of another student's response. The implication of this is that test items are uncorrelated for the students of the same ability level (Lord and Novick, 1968). Monotonicity assumption focuses on item response functions which model relationship between students' trait level, item properties, and the probability of endorsing the item (Rizopoulos, 2006; De Ayala and Santiago, 2016). Finally, Item invariance assumption implies that item parameters estimated by an IRT model would not change even if the characteristics of the student, such as age changes (Peak and Cole, 2019).

STATEMENT OF PROBLEM
Every year, university teachers face the challenge of how to cope with increasing number of examination students, which multiple choice items came to resolve in our educational setting; however, the absence of item analysis in developing these multiple-choice items undermines the integrity of assessments, selection, certification, and placement in our educational institutions. Also, improper use of item analysis leads to same fate while lopsided test items could lead to wrong award of grade, and certificate (Olukoya et al., 2018;Ary et al., 2002). We have seen that hundreds of secondary school students take university entrance examinations, and their results determine the entry into universities, and possible alternatives (Eli-Uri and Malas, 2013; Cechova et' al., 2014). Hence, the needs to maintain the validity of tests using IRT models necessitate this study.

RESEARCH JUSTIFICATION
Professional conduct of item analysis that makes use of statistics would reveal important, and relevant information about the item for upgrading the quality, and accuracy of multiple-choice items, its power lies in identifying defective items, areas where students have mastered, and area not yet mastered thereby find ways of correcting them before finally administered them in order to have integrity in assessment, selection, certification, and placement in our educational institutions.

Method I: Rasch/One-parameter logistic Model
The first model employed is basically for accessing how difficult an item is being perceived by the test takers, it was proposed by Georg Rasch, a Danish mathematician in 1966 (Rasch, 1966) similar to One-parameter logistic model (1PL) proposed by Birnbaum (1968). This model is positioned in equation (1) and its described test item in term of only one parameter called difficulty index. The probability that student with ability ( ) will endorse item with difficulty index ( ) correctly is presented in equation (1): Where: is discrimination index denoting how an item discriminates students which is constrained under this model, is the item difficulty parameter for item (g = 1, 2, … , n), denotes how students perceived the item, and is the student ′ ability ( k = 1, 2, … , N).
Under the model in equation (1), is constrained ( = 1 for Rasch model, and < 1 for One-parameter logistic model).

Method II: Two-parameter Logistic Model
The second model employed called two-parameter logistic model in equation (2) measures how well an item discriminates between different ability levels near the infection point of ICC. It estimates varied item difficulty and discrimination indices simultaneously, this model is useful in determining how items segregate students according to their ability levels. Theoretically, it ranges between −∞ ∞ but in practice negative discriminations are discarded. The model can be obtained from equation (1) by adding varying item discriminating parameters ( = 1,2, … , ). The probability that student with ability endorsed item g correctly is given in equation (2): Where: , and were as defined in equation (1).

Method III: Three-parameter Logistic Model
Finally, the third model is three-parameter logistic model positioned in equation (3) which was used to estimate psuedoguessing indices for the test items. This model described items in term of three parameters which are: difficulty, discrimination, and guessing indices (Lim, 2020). The probability of correct response ( ) to item by student with ability is determined by item discrimination parameter , item difficulty parameter , guessing parameter ( = 1,2, … , ), and the student's ability is as presented in equation (3) (

PARAMETERS ESTIMATION
Parameterization of models position in equations (1), (2), and (3), let be the observed response for outcome for item g from student by taking = 1 as correct option and = 0 as incorrect option. The probability that ℎ student with ability level responds correctly to item g is given by the equation (4) Pr Where: , , and were as defined in equations (1), (2), and (3). When guessing parameter is constrained to be equalled to zero, equation (4) becomes equation (2) (two-parameter logistic model), when both = 0 and = 1 or < 1, equation (4) turns to be equation (1).
We fit three-parameter model by using slope-intercept of the form in equation (5) Pr ( = 1| , , , ) = 1+ + 1 1+ + ( + ) and the transformation between these parameterizations is The (that is ) can be constrained to be the same across all items. Let = Pr( = 1| , , , ) = 1 − Conditional on for student since item responses are assumed to be independent is given by Adetutu & Lawal, 2023 OJED 3(1) |05 ∅(. ) denotes density function for standard normal distribution. For N students, the sum of the log likelihood in equation (9) is However, the integral for (Ω) in equation (9) is generally in a closed form, we used numerical methods (Adaptive/Gauss-Hermite Quadrature) implemented with stata 16.SE software on window 7.

RESULTS AND DISCUSSIONS
Knowing fully that IRT models are useful in test development by supplying indices of item difficulty, discrimination and, guessing to match the ability level of a target population. The estimated item difficulty indices in (descending order of item difficulty indices) using equation (1) for selected items were presented in Table 1 together with their indices of precision (SE), the probability of an average student correctly endorsing each of the items (Prob), the confidence interval of the estimates, and remark on the item suitability while Interpretations of difficulty indices are displayed in Table 2.     The graphical evidences in Figures 1, and 2 buttress the fact that Items 11, 34, and 5 were mainly for less able students for the fact that the items needed low level of ability for correct endorsement. Students only need to be from -5.2 to -4.7 on trait scale to correctly endorse items 11, and 34 respectively.
On the basis of equation (1), item 8 provided more of its information on able students while item 7 provided information on students who were on both sides of location point (higher and lower ability students), this is displayed in Figure 2. Difficulty indices of items describe where the item functions along the ability scales, and the model suggests through their difficulty indices that items 15, 5, 3, 13, 28, 34, 23, and 11 displayed in Table 1 needs attention as remarked.
Application of equation (2) which describes test items in terms of two item properties yields output displayed in Table 3, with the confidence interval, probability of correct endorsement by an average student, indices of items discrimination (descending order) that classified students according to their ability as well as the precision of the estimates (SE). Item 29 was identified as the most discriminating (a = 1.7889) though not so difficult (b = -0.8064). An average student would endorse this item correctly with probability 0.7333 which means about 73% of the students are most likely to endorse correct option to the item. Again, students only need to be -0.8 on ability scale to endorse a correct option. Followed by item 29 was item 34, and so on in that order. Another item that draw attention in Table 3 was item 7 which did not discriminate (a = 0.01505) between the students of different ability levels.  Table 4 (Ebel and Frisbie, 1991), not minding their difficult indices, with this model position in equation (2) where items had varied item discriminations, student must be on too extremely high ability level ( = 32.6302 ) to endorse a correct option to this item 7. This is an indication that this item needs attention, either the item was poorly written or there was misinformation in the item. The probability that an average student endorsed item 7 correctly was 0.3697; that is about 37% of the students got the item. In the order of least item discrimination were items 7, 9, 6, and so on. If the purpose of using the instrument is to segregate students into those who mastered and not yet mastered, Table 3 suggested those remarked "poor" are likely to be defectives, hence need attention based on their item discrimination indices as remarked. Poor item, should be eliminated or moderated The graphical display in Figures 3 and 4 perfectly agreed with Table 3 in the sense that item 7 did not discriminate and had no sufficient information about students; this was followed by item 5 which was also not informative.
Item 29 was much informative about students on both sides of ability continuum while item 8, which was perceived as the most difficult going by equation (1), here only, gave very little information which was only on high ability students.  A further application of three parameter logistic model in equation (3) to the item responses describes each item in terms of three item properties yields the results in Table 5 along with indices of guessing, difficulty when discriminatory power of all items is held constant. The results suggested that item 5 was the most likely guessed by the students. An average student would endorse this correctly with probability 0.8649, and 0.9855 probability of its guessing for students to select correct option to the item, he or she must be on a high ability level = 03029 despite the difficulty of this item, it was guessed correctly. To improve quality of multiple-choice items such items as 5, 23, and 3 remarked "poor" which were suggested to be defectives and must be revisited, moderated due to their high psuedoguessing indices.  Table 6 meaning that students most unlikely guessed these identified items. Figures 5 and 6 Adetutu & Lawal, 2023 OJED 3(1) | 12 agreed with Table 5 on this, no wonder these items provided a reasonable amount of information graphically. A careful cursory review of the figures attested that item 5 needs attention as suggested earlier. Their item information functions were too narrow. Poor item, it should be eliminated or moderated In conclusion, all items must be trial tested to identify flaw items and thereby make necessary corrections which would involve collaboration work of test developers, and psychometricians for the purpose of improving quality of selection, certification and graduates in higher institution of learning.

CONFILCT OF INTREST
There is no conflict of interest.