Quasi-adaptive version of SAM tests by Anton Agapov, READ Program Expert
For several years we have been developing the SAM (Student Achievement Monitoring) test of academic competencies and the theoretical framework on which it is based. The SAM framework is based on L.S.’s cultural development model. Vygotsky and is designed to capture the development of subject concepts at one of three levels: formal, reflective and functional. Each of these levels corresponds to a specific type and structure of the test task. Three tasks of different levels, aimed at testing the mastery of one concept, form a content block.
Formal level
A general criterion for the formal level of mastery of subject content is an orientation towards the external characteristics of the problem situation and sample solutions. The action of this level includes classifying a problem based on external characteristics to a known class (type), to which a ready-made solution scheme (algorithm, technique, rule) is assigned.
Indicators of the first level of action are tasks similar to those used by teachers for presentation and initial development of individual methods of action. Problems of this kind, already by their appearance, are recognized by students as belonging to a certain class with a known solution scheme (for example, in mathematics: problems “on adding numbers with transition through digit”, “on counter movement”, etc. ). Such tasks are called standard, or typical. They are distinguished by the relative simplicity of the content, the unambiguity of the conditions, and the transparency of the wording.
The first level also includes formally more complex tasks, built on the basis of standard ones. For example, tasks in which several situations or objects are given, and it is required to determine those to which a certain rule of action is applicable (or, conversely, not applicable).
Or problems whose conditions contain a ready-made sample solution that needs to be reproduced on similar material, i.e., to carry out a simple transfer.
In all of these cases, the objective relations that are essential for the solution are linked to the external characteristics of the problem situation and, as such, do not require conscious (reflexive) establishment.
Reflective level
A general criterion for reflexive mastery of subject matter is the ability to focus on essential relationships connecting the elements of a task situation. The action of this level includes the analysis of conditions with the identification (modeling) of the structure of essential relations (the “skeleton” of the problem) and the determination on this basis of a specific solution scheme. That is, the action is based on the internal (essential) plan.
Indicators of this level are tasks that cannot be solved by the direct application of standard rules or procedures and require one to independently construct a scheme (program) of action based on an analysis of conditions and identification of significant relationships. These include tasks, the solution of which involves the use of means of modeling essential relationships (schemes, drawings, formulas, etc.), one or another transformation of conditions to bring the problem to a more convenient or standard form, reversal of standard action schemes ( switching from direct to reverse train of thought, for example: from the sought-after to the conditions), etc.
Functional level
The general criterion for mastering the material at the third level is the ability to take as a guideline the area of variable possibilities of the general method of action – the functional field. The key point of action at this level is the reconstruction and fitting of this field to the conditions of the problem, i.e. a thought experiment.
The first, initial version of the SAM test was developed to diagnose educational and subject competencies based on the material of the Russian language and mathematics. The test for each subject consisted of 45 tasks, combined into three-level blocks. The tests covered the content of the primary school curriculum.
The main complaint about the resulting diagnostic tool was the labor involved in its use. SAM was designed as an international comparative study in the CIS, and, accordingly, quite high requirements were placed on it in terms of its validity and reliability. This, in turn, led to the need to increase the number of tasks and time to complete the test. In the original version, it took at least two hours to complete.
Considering that the test is designed for the end of primary school, that is, either the end of the fourth or the beginning of the fifth grade, schools experienced difficulties in organizing testing: it is rarely possible to allocate such time in the schedule for one subject, double lessons are rarely found in primary schools, and the curriculum is difficult to adjust on the fly. In this regard, a request arose to design a version of the test that would require no more than one academic hour to complete in each subject, but at the same time provide information to the teacher and school about the situation in the classroom. Thus, the task was set to design and study shortened versions of the test.
The first such version contained only 15 tasks, selected by statistical methods as those most associated with the result on the original test, and when calculating the results, the tasks were assigned appropriate weights to improve the forecast. This shortened version of the test satisfactorily predicted the result of the original one, but it was still not enough for mass application and high-quality interpretation of the results.
The problem of reducing labor costs for testing while maintaining the psychometric characteristics of the test could be solved by the design of computer adaptive testing (CAT). This is a form of computer testing in which each subsequent task offered to the test subject depends on the results of solving previous tasks: depending on success or failure, the test subject receives a correspondingly more difficult or easier task, which makes it possible to clarify his level of preparedness without overloading him with unnecessary tasks .
Designing such tools is technically and organizationally difficult. They require an impressive bank of tasks in order to be able to select tasks in accordance with the individual trajectory of each subject, and the development of a special platform on which they will be placed. It is also obvious that such testing can only be carried out in computer form, which automatically narrows and biases the potential sample, since only subjects with free access to a computer or educational organizations with a sufficient number of hours in a computer class can take such testing.
In this regard, the task was set to develop a paper version of the test that would use the same principle as the CAT, that is, it would free the test taker from the need to complete tasks that do not provide information about his preparedness – that is, either very easy tasks for strong students, or too difficult for weak ones.
We mentioned above that one of the obstacles in the testing process was the need to organize double lessons in primary schools, where this usually does not happen. However, two lessons separated in time are a more feasible task for educational organizations participating in testing. These ideas led to the development of the following format for holding it.
The original test, which included 45 tasks, is divided into 3 equal parts, performing different functions.
The first part is “distributive”. It consists of tasks that best predict the subject’s placement in the top or bottom half of the rating (r=0.8). This subtest is not required to accurately rank test takers or assess their preparedness; its function is to separate the “most likely to be strong” from the “most likely to be weak.” This part of the test is completed during one lesson, that is, examinees have a little more time to solve problems than in the original version, which helps reduce the number of distortions associated with lack of time.
The second part is “easy”. These are tasks that demonstrate the best discriminative ability for “weak” subjects. Obviously, extremely difficult tasks can provide information about differences in the preparedness of “strong” students, but they are essentially useless, at the bottom of the ranking, where no one solves them. In turn, less difficult tasks do not provide the necessary information about “strong” subjects, since they are solved by almost everyone, but they can help in distinguishing within a weak group.
The third part (rather, it would be more correct to call it another version of the second part) is “difficult”, that is, a set of tasks that best distinguishes between “strong” subjects.
According to the results of the distribution test, the subject falls into either the “strong” or the “weak” group, and at the second step receives the corresponding second part of the test. The tasks are selected in such a way that the expected result on the “difficult” part of the test for “weak” children is 2 points out of 15, and the result of “strong” children on the “easy” part is 11 out of 15. That is, the test subject falls into one or another group allows him to fairly confidently predict his results on some tasks, and relieves him of the need to complete these tasks.
Data from the initial test were used to formulate the options. Different versions of the shortened test were compared with each other based on predictive power, that is, how close the prediction made on each subtest is to the score on the original test. The first version of the abbreviated test, which consisted of 15 tasks and was designed to be completed during one lesson, as mentioned above, made it possible to draw a general conclusion about the preparedness of the subjects, but was “wrong” in a fairly large number of cases – about 6% of subjects during simulation modeling received a shortened test result that differed from the original by more than 5 points (out of 45).
In the case of the quasi-adaptive version, the proportion of such subjects did not exceed 0.5%. This is not only less than in the previous step, but also less in relation to the best set of the same number of tasks performed at once, without applying adaptivity. In addition, the area in which such a test is “wrong” is easier to localize. In the simple abbreviated version, the discrepancy can occur among both high and low values of the final score, while in the quasi-adaptive version the error is concentrated mainly in the upper part of the rating and is due to the assumption that “strong” subjects perform equally well the “easy” part, in while in reality they may be mistaken in it due to various random factors.
The results obtained allow us to take the next step towards creating an accurate, reliable and easy-to-use measurement tool for assessing educational and subject competencies. We expect that this tool will be adopted by educators, due to its more relaxed organizational requirements compared to the full version of SAM, and, at the same time, will provide the same quality feedback to research commissioners at all levels – from educational administration to country ministries and international organizations. We continue to work on improving and expanding the diagnostic tools in the theoretical framework of SAM based on feedback from testing users.