Evaluating Instrument Quality: Rasch Model – Analyses of Post Test of Curriculum 2013 Training Komalasari Lembaga Penjaminan Mutu Pendidikan, Kalimantan Tengah, Indonesia

The main purpose of this study was to evaluate the quality of post test utilized by LPMP Central Kalimantan Indonesia in curriculum 2013 training for X grade teachers. It uses Rasch analysis to explore the item fit, the reliability ( item and person), item difficulty, and the Wrigh map of post test. This study also applies Classical Test Teory (CTT) to determine item discrimination and distracters. Following a series of iterative Rasch analyses that adopted the “data should fit the model” approach, 30 items post test of curriculum 2013 training was analyzed using Acer Conquest 4 software, software based on Rasch measurement model. All items of post test of curriculum 2013 training are sufficient fit to the Rasch model. The difficulty levels (i.e. item measures) for the 30 items range from –1.746 logits to +1.861 logits. The item separation reliability is acceptable at 0.990 and person separation reliability is low at 0.485. The wright map indicates that the test is difficult for the teachers or the teachers have low ability in knowledge of curriculum 2013. The post test items cannot cover all the ranges of the teachers‟ ability levels. Items discrimination of post test of curriculum 2013 training grouped into fair discrimination (item 2, 4, 5, 8, 11, 18) and poor discrimination (1, 3, 6, 7, 9, 10,12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30). Some distracters from item 1, 2, 6, 7, 8, 9, 11, 13, 16, 17, 18, 19, 20, 22, 24, 25, 27, 28, 29 and 30 are problematic. These distracters require further investigation or revision.


Introduction
A new national curriculum for primary and secondary school was implemented by Indonesian Government in 2013. The new curriculum is an improvement of the previous curriculum which is known as curriculum 2006. The new curriculum is named curriculum 2013. In addition, there are some reasons for the changing of the former curriculum such as internal challenges, external challenges, mindset completion, strengthening curriculum management and reinforcement of material.
The curriculum 2013 is being implemented gradually started from academic year 2013/2014 until for academic year 2018/2019. Furthermore, for academic year 2017/2018, the implementation of curriculum 2013 is extended to the whole of Indonesia"s regency/city in total number of the school is 4510 of senior high schools or about 60%, including Central Kalimantan Province. Moreover, in implementing the curriculum 2013, the teachers are trained an activity that is called "curriculum 2013 training". In every province of Indonesia, the training session of curriculum 2013 is conducted by LPMP (Educational Quality Assurance Institution). The funds for curriculum 2013 training is come from the budget of state expenditure and revenues (APBN).
The curriculum 2013 training is done in order to improve the competence of teachers and for the preparation of Curriculum 2013 implementation exactly.
The objective of this training to improve the participant"s competence in: a) understanding the dynamics and policies of curriculum development, the policy of strengthening character education and the application of literacy in learning; b) analyzing the learning objectives, material, learning, and assessment covered: Document: learning objectives and guidelines subjects, the material in textbooks, Implementation learning model, assessment of learning; c) designing the learning implementation plan with 21st century skill (critical thinking, creativity, communication, and collaboration); d) doing a Higher Order Thinking Skill (HOTS) learning practice and assessment and doing a review of the practice result; and d) practicing the process and reporting the assessment of learning and the introduction to e-raport aplication ( Assessment should be precise, technically sound, producing accurate information for decision making in all circumstances (Dubois and Rothwell 2000;Stetz andChmielewski 2015, in Nornazira S, 2015). It is very important to check the credibility of the instrument utilized in curriculum 2013 training at LPMP Central Kalimantan. Thus, this study concerns on analyzing post test items of curriculum 2013 training to evaluate the quality of curriculum 2013 training post test. It is beneficial to ensure that the assessment of teacher in curriculum 2013 training using post test instrument could give accurate information of teacher competency.
The analysis will be done based on items respond theory approach by Rasch"s model. The Rasch measurement model was opted for this study because of its sophisticated approach to evaluate patterns of items responses and scale, and item performance (Linacre 2002;Bond andFox, 2015 in Nornazira S, 2015). Analysis using Rasch measurement model is a more sophisticated approach to evaluate patterns of items responses and scale, and item performance (Chen et al. 2014in Nornazira S, 2015. This study also uses Classical Test Teory (CTT) to analyse item discrimination and distracters.

Methods
In order to maintain the accuracy of the curriculum 2013 training post test instrument, it is very important to analyzed it. The study emphasized on four aspects of Rasch Analysis diagnoses which are (i) item and person reliability, (ii) item fit, (iii) item difficulty, and (iv) Wright Map, This study also used CTT for two aspects: (i) item discrimination, and (ii) distracter The post test instrument of curriculum 2013 training was tested on 711 teachers that follow the curriculum 2013 training in 2017. Data for post test item analysis of curriculum 2013 training were secondary from Educational Quality Assurance Institution Central Kalimantan Indonesia.
This stduy was population study. The summarized of data obtained can be seen in this following table. The post test of curriculum 2013 training was analyzed using Acer Conquest 4 software, software based on Rasch measurement model. The analysis covered:

Item Fit
Bond anf Fox (2015) describe the concept of fit is a quality-control mechanism"(akin to the use of fit in industrial statistics). Fit statistic provide one indication as to whether the researcher has completed a task of sufficient quality to allow that values for person and items can be represented with interval-level measures. Fit indices help the investigator to ascertain whether the Rasch requirement for unidimensionality holds up empirically. Fit statistics help to determine whether the item estimations may held as meaningful quantitative summaries of the observation.
Item fit statistics (i.e., infit/outfit and ZSTD) show which items fit the estimated model (Neumanna, Neumannb and Nehmc, 2011). Infit and outfit statistics adopt slightly different techniques for assesing an item"s fit to the Rasch Model.

Reliability
There are two kinds of reliability in Rasch Model. First the person reliability index and second the item reliability index. The person reliability index indicates the replicability of person ordering we could expect if this sample of person were given another a paralel set of items measuring the same construct ( Wright andMasters, 1982 in Bond andFox, 2015). The item reliability indicates the replicability of item placements along the pathway if these same item were given to another same-sized sample of persons who behaved in the same way (Bond and Fox, 2015).

Wright Map
Rasch Analysis software produce some form of the map is often called a Wright map. Wright map report the relations between only two key aspect of the variable: the item difficulty estimates and person ability estimates. One delightful aspect of this Rasch representation of data analysis is that many of the person and item relations are shown in meaningful pictorial, or "map" form (Bond and Fox, 2015). The distribution of persons (on the left) and items of instrument (on the right) are displayed on the same so-called logit scale.

Item Discrimination Index
A good question is a question that can be distinguishing the groups of students who have a high ability and low ability. The index which can be measured that difference is item discrimination. Also, to determine the item discrimination also can be used discrimination index, biserial correlation index, point biserial correlation index, and alignment index. Thus, the item discrimination of question is exactly same with the question validity.
According to Rahmah Zulaiha (2012) the question item discrimination the proportion difference of right answer on the students group with the high ability (high group) and students group with the low ability (low group). The question item discrimination is between -1 till +1. The negative sign means the group of students with low ability who answer correctly certain items of question more than the student"s group with high ability. The item-discrimination index for item i, di, is calculated by the formula (Allen & Yen, 1979):

Distractor Effectiveness
According to Saifuddin Azwar (1996) the item difficulty and item discrimination are not sufficient to assess the item test quality, how are the examinees" information answers distributed on the applicable optional answer that needs to note in order the function of a test item can be fulfiled maximumly.
Distractors effectiveness from the test item are analyzed from the answer distribution toward the test item related to any provided alternative. Distractor effectiveness is reviewed to know whether all of the distractor or optional answer which is not as the key answer has functioned as it is. It means those distractors have chosen by more or all examinees from the low group, while the examinees from the high group only a few or none choose it.
A distractor can be said it is functioned well if the distractor has chosen at least 2.5% (>0.025). (Rahmah Zulaiha, 2012). The distribution of optional answer is gained through calculation by using this following formula: Where: P PJ = the answer distribution for certain optional answer J PJ = amount of the students who chose the certain optional answer n = total of the students
Further,the second analysis was conducted without misfit person, with number of case are 665. The indicator properties of the test are presented in Table  2. The evaluation of goodness of fit to the Rasch model for post test items of curriculum 2013 was based on the weighted and unweighted mean square (i.e., Infit and Outfit mean square) statistics and the t-values for each item. As a working guide, Infit and Outfit values range from 0.70 to 1.30 are considered fit while t-values between -2.00 and +2.00 are considered acceptable, with 95% confidence interval level of significance.  Table 3 presents that the difficulty levels (i.e. item measures) for the 30 items range from -1.746 logits to +1.861 logits, associated with standard errors of 0.079 or 0.134 logits. The unweighted and weighted fit MNSQs range from 0.96 to 1.11 with t-values within -2.0 to +2.0 indicating sufficient fit to the Rasch model. Appendix 1 shows the 30 plots that were produced by the plot icc command. The ICC plot shows a comparison of the empirical item characteristic curve (the broken line, which is based directly upon the observed data) with the modelled item characteristic curve (the smooth line).
The item separation reliability is 0.990 and is acceptable. The person separation reliability is low at 0.485. The low person reliability indicates that the individuals who took this test are likely to get different estimated ability scores if a parallel test is given to them. Figure 1 displays a Wright Map which visually summarizes several aspects of the Rasch analysis. Figure 1 presents the map of the latent distributions and response model parameter estimates for the 30 items. Teachers are placed on the left side of the scale according to knowledge of curriculum 2013 ability, and the item difficulty indicators are shown on the right side. The teachers with the highest ability levels and the items with highest difficulty levels are located at the top of the map, while the teachers with the lowest ability level and the easiest items are located at the bottom. The distribution of persons (on the left) and items of the instrument (on the right) are displayed on the same so-called logit scale. Persons at the same position (or "height") on the scale as a particular item have a 50% chance of answering the item correctly. Questions of equivalent difficulty lie at the same point on the logit scale (e.g., Questions 4, and 21; 7 and 8;). Individuals ("persons") located above an item, however, have an even greater chance of answering the item correctly (i.e., the itemis likely to be easier for such individuals). Those persons located below an item have a lower probability of being able to answer it correctly (i.e., the item is more difficult for them).

Wright Map
The mean of the measure is shown as 0 on the map. It can be seen that the distribution of the teachers"s ability tends to be lower than the mean. Only a small portion of the teachers are able to answer the difficult items (from item 7 to item 13) correctly. Items 24,30,17,19,24,30,6 and 22 are considered too difficult for the teachers and none of the teachers are able to get the correct answers. There is also a noticeable a teacher who have low ability and are not able to get any of the items correctly. Overall, the test is difficult for the teachers or the teachers have low ability in knowledge of curriculum 2013. The post test items cannot cover all the ranges of the teachers" ability levels.
Based on the facts obove, the curriculum 2013 training for X grade teachers that implemented by LPMP Central Kalimantan in 2017 could not improve the ability of teachers succesfully. It could be caused by the the lack of the training facilitators and the lack of teacher competency. It is very important to strengtheen mentoring activities of curriculum 2013 in every school.    (2009), Point Biserial Index (PBI) is defined as correlation between score on an item and score on the exam; differentiates between those who have high or low test scores; Range from -1 to +1; Positive PBI indicates those who scored well on exam answered item correctly; PBI should be positive for correct answer and PBI should be negative for distractors. As a working guide, this study uses PBI"s General rules: Below 0.2: Poor; revise item; 0.2-0.29: Fair; 0.3-0.39: Good; 0.4-0.7 Very good (Penn, 2009;McGahee & Ball, 2009) The results of analysis of item discriminations and distractors using Acer Conquest 4 for post test items of curriculum 2013 training summaries in the following table. For each item, there is an item discrimination, frequency or percentage for correct answer and distractors and interpretation. They will use as information to determine the quality of the item. The item discrimination represented by Point Biserial value (0.1) but very low (below 0.2) indicating poor discrimination. The positive PBI (0.09) on distracter D indicates that teachers who performed well on the post test selected it. This is problematic.  The item discrimination represented by Point Biserial value (0.1) but very low (below 0.2) indicating poor discrimination. The positive PBI (0.03) on distracter A indicates that teachers who performed well on the post test selected it. This is problematic. This item should be examined and revised.   The item discrimination represented by Point Biserial value (0.11) but very low (below 0.2) indicating poor discrimination. The positive PBI (0.14) on distracter B indicates that teachers who performed well on the post test selected it. This is problematic. This item should be examined and revised.   The item discrimination represented by Point Biserial value (0.15) but very low (below 0.2) indicating poor discrimination. The positive PBI (0.08) on distracter B indicates that teachers who performed well on the post test selected it. Distractor D is problematic. This item should be examined and revised. The significant Point Biserial value (0.12) for the correct answer indicating poor discrimination. Distractors A, C, and D are considered good distractors since the PBI values are negative.   The Point Biserial value (0.14) for the correct answer indicating poor discrimination for this item. Distractors C and D are considered good distractors since the PBI values are negative. Distractor A is problematic since the majority of the students choose the incorrect answer (43.87%) and it has positive Point Biserial value.   The Point Biserial value (0.11) for the correct answer indicating poor discrimination for this item. Distractors C and D are considered good distractors since the PBI values are negative. The positive PBI (0.03) on distracter A indicates that teachers who performed well on the post test selected it. Distractor A is problematic. The Point Biserial value (0.13) for the correct answer indicating poor discrimination for this item. Distractors B and D are considered good distractors since the PBI values are negative. The positive PBI (0.05) on distracter A indicates that teachers who performed well on the post test selected it. Distractor A is problematic. The significant Point Biserial value (0.24) for the correct answer indicating fair discrimination for this item. Distractors B and C are considered good distractors since the PBI values are negative. The positive PBI (0.01) on distracter D indicates that teachers who performed well on the post test selected it. Distractor D is problematic. The Point Biserial value (0.02) for the correct answer indicating poor discrimination for this item. Distractors A and C are considered good distractors since the PBI values are negative. The positive PBI (0.14) on distracter D indicates that teachers who performed well on the post test selected it. Distractor D is problematic.    The Point Biserial value (0.06) for the correct answer indicating poor discrimination for this item. Distractors B and D are considered good distractors since the PBI values are negative. The positive PBI (0.03) on distracter A indicates that teachers who performed well on the post test selected it. Distractor D is problematic.   The Point Biserial value (0.02) for the correct answer indicating poor discrimination for this item. Distractors C are considered good distractor since the PBI values are negative. The positive PBI (0.01 and 0.04) on distracter A and B indicates that teachers who performed well on the post test selected it. Distractor A and B is problematic. The Point Biserial value (-0.09) for the correct answer is problematic as poor teachers also selected this answer. This item did not discriminate well. Distractors C and D are considered good distractors since the Point Biserial values are negative. Distractor A is problematic since the majority of the teachers (55.91%) choose the incorrect answer and showing non-significant Point Biserial value of 0.12. Options A and B require further investigation or revision.
Based on table 6table 35, items of post test of curriculum 2013 training could be grouped in two categories: fair discrimination and poor discrimination. Table 33 shows the items for every category.  3,6,7,9,10,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28,29,30  Furthermore, the post test do not have good discrimination. Even item 30 is the most problematic item. It describe by MCC plot on figure 2 below. In particular it shows the proportion of students in each of a sequence of ten ability groupings that responded with each of the possible responses. indicates that distracter A is problematic since the majority of the teachers (55.91%) choose the incorrect answer and showing non-significant Point Biserial value of 0.12. This item did not discriminate well. The negative PBI for the correct answer B indicates that students who performed poorly on this exam answered correctly; the positive PBI (0.12) on distracter A indicates that students who performed well on the exam selected it.This item may have wrong key answer or the item is wrong. Some distracters are problematic (see table 37). These distracters require further investigation or revision.

Conclusion
As discussed earlier, Rasch measurement analysis was initially, conducted with 30 items of post test of curriculum 2013 training. The study emphasized on four aspects of Rasch Analysis diagnoses which are (i) item and person reliability, (ii) item fit, (iii) item difficulty, (iv) item discrimination, and two aspect based on CTT (i) distracter, and (ii) Wright Map.
The item separation reliability is 0.990 and is acceptable. The person separation reliability is low at 0.485. All items of post test of curriculum 2013 training are sufficient fit to the Rasch model indicated by the unweighted and weighted fit MNSQs range from 0.96 to 1.11 with t-values within -2.0 to +2.0.
The difficulty levels (i.e. item measures) for the 30 items range from -1.746 logits to +1.861 logits, associated with standard errors of 0.079 or 0.134 logits.
The wright map indicates that the distribution of the teachers"s ability tends to be lower than the mean. The test is difficult for the teachers or the teachers have low ability in knowledge of curriculum 2013. The post test items cannot cover all the ranges of the teachers" ability levels.
Post-test analysis of the 2013 curriculum training by using Rasch model is very important to know and do to acknowledge the characteristics information of test items. Based on the results of the analysis could be done improvements and development of post test instruments which is used on curriculum 2013 training which will come.
In addition, the results of the analysis show that the test is difficult for the teachers or the teachers have low ability in knowledge of curriculum 2013. The Educational Quality Assurance Central Kalimantan needs to reinforce teachers especially on difficult materials (item 6, 22,24,30,17,19,3,27) through the 2013 curriculum assistance activities in schools.
The findings of this study could provide a better knowledge basis for interpreting teachers assessment results in another training.