Assessment in CLIL : Test Development at Content and Language for Teaching Natural Science in English as a Foreign Language

On-going bilingual programs without regard to needs analysis; little research on the actual effects of CLIL in Colombia and vague awareness or knowledge about the necessary considerations for effective CLIL programs, underpin the need to address a particular issue of curriculum as it is summative assessment. This small scale study takes place in a Natural Science class using a CLIL approach with thirdgrade students at A2 proficiency level who have been progressively immersed in a bilingual program at a private school in Bogotá, Colombia. Regularly scheduled tests were analyzed in order to identify suitable assessment items that simultaneously report on the content and language achievement in order to provide guidelines for test development that are aligned with the teaching goals, consistently measure students’ progress, and facilitate teaching practices. This study entails a systematic examination of test items using formal item analysis to depict test validity from an assessment grid that integrates content, at different knowledge levels, CALP functions and cognitive skills. The study concludes that the assessment grid is a helpful tool to discriminate language and content achievement in the results of multiple-choice CLIL tests, by increasing teachers’ understanding of the language demands of test items and the level of difficulty of content tasks.


Resumen
Los programas bilingües actuales carentes en cuanto a análisis de necesidades, la investigación insuficiente relacionada con los efectos de AICLE en Colombia, así como la poca conciencia y conocimiento acerca de las consideraciones necesarias de los efectos de AICLE, señalan la necesidad de enfocarse en un aspecto curricular particular como es el de la evaluación sumativa. El presente estudio a pequeña escala se realizó en una clase de ciencias naturales en la que AICLE es el enfoque seleccionado para la enseñanza a estudiantes de tercer grado con un nivel de competencia A2 y quienes se encuentran en un programa de bilingüismo progresivo en un colegio privado en Bogotá, Colombia. Se analizaron pruebas ordinarias para identificar preguntas de evaluación apropiadas que permitan reportar simultáneamente los logros en contenido y lengua, con el fin de construir lineamientos para el diseño de pruebas que estén alineadas con las metas de enseñanza, que midan consistentemente el progreso de los estudiantes y faciliten las prácticas de enseñanza. Este estudio implicó el análisis sistemático de las preguntas de las pruebas usando un análisis formal de preguntas para determinar la validez de las pruebas a partir de la aplicación de una matriz de evaluación que integra el contenido en diferentes niveles del conocimiento, el dominio cognitivo del lenguaje académico (DCLA) y las habilidades cognitivas. El estudio concluyó que la malla de evaluación es un instrumento útil para discriminar los logros en el aprendizaje de contenido y lengua en los resultados de pruebas de selección múltiple de AICLE, al facilitar e incrementar la comprensión de los profesores en relación con las exigencias de la lengua en las preguntas de las pruebas y el nivel de dificultad en cuanto a contenido.

Palavras
Hence, there is an urgency to initially focus on specific aspects of the curriculum that can provide information about the effectiveness of the program in the short term. Assessment is an alternative used to gather information about the teaching and learning process (Bailey, 1998)  In this regard, this study developed an assessment grid adapted from two tools: the CLIL Matrix suggested by Coyle, Hood, & Marsh (2010) and a conceptual framework proposed by the project Assessment and Evaluation in CLIL -AECLIL. The former tool sets the route of difficulty among content and language, reports on literature (Short, 1993;Coyle, Hood, & Marsh, 2010;Lo & Lin, 2014) and reveals how information provided by this Matrix support informed decisions by teachers (Coyle, Hood, & Marsh, 2010, p. 68). The latter test provides the theoretical assumptions to define and relate content, cognition and language skills. The assessment grid seeks to facilitate the process of sorting test items through a route that integrates cognitive and linguistic demands. This study focused on determining to what extent this assessment grid of content and language demands provides a guideline for test development that aligns with the teaching goals, consistently measures students' achievement, and could be implemented under regular teaching conditions. This study entails a systematic examination of test items using Wesche's framework (1983 as cited in Bailey, 1998, p. 13) as the categories to classify items in the assessment grid and ensure test validity.
Finally, this small scale study aims at impacting curriculum development in approximately 175 bilingual schools officially registered in Colombia (Ministerio de Educación Nacional, 2009) by providing a guideline to design multiple-choice tests that simultaneously provides information about content and foreign language development. Valid and reliable assessment items can initially support content teachers in their process of lesson planning and material design as they are better informed about the content and language needs of their students.

METHOD
This study examined three tests that went through the research design. Firstly, a systematic design of tests using Wesche's framework (1983 as cited in Bailey, 1998, p.13) to place each test item in the assessment grid. Tests were collaboratively developed by a Content and Language Integrated Learning (CLIL) teacher and an English as a Foreign Language (EFL) teacher in order to ensure construct validity in terms of content and language. Secondly, an item analysis was carried out to determine the reliability of each item. Consequently, a report was built to elucidate the items' validity and reliability and define the overall results of the test in terms of content and/or language achievement. The framework provided by Mari Wesche (1983 as cited in Bailey, 1998, p.13) is a simple yet useful tool for examining tests in four parts: stimulus material, the task posed to the learner, the learner's response, and the scoring criteria. Particularly, this study focused on two aspects of Wesche's framework (1983 as cited in Bailey, 1998, p.13): the stimulus material to analyze test input in terms of language demands and the task posed to the learner to identify the content demands of each test item. The data provided by this framework allowed for the placement of test items in the assessment grid.

Assessment Grid
The main goal of this study comes from the concern that CLIL, as a dual-focus approach, requires assessment of students' achievement in both content and language components, so teachers can identify which area is interfering in students' learning. In order to reach this goal, this study has combined two theoretically-accepted tools: The CLIL Matrix suggested by Coyle, Hood, and Marsh (2010) and a conceptual framework proposed by the Evaluation and Assessment in CLIL Project (Quartapelle, 2012). The product of this integration is illustrated in Table 1.  Table 1, it is clear that the CLIL Matrix provides the parameters to place the conceptual framework in four quadrants that make visible the interconnectedness among content and language demands. Each quadrant frames a particular connection of knowledge, thinking skills and the language necessary for its understanding. Accordingly, Quadrant I -QI -denotes all items that require high content demand at a low language level. Quadrant II-QII-describes items at the highest levels of content and language demands. In contrast, Quadrant III -QIII-corresponds to the lowest content and language demands. Finally, Quadrant IV -QIVchallenges students with high language levels to answer low content demanding questions.

As seen in
In pedagogical terms, Coyle, Hood, and Marsh (2010) highlight that whilst QIII might build initial confidence in students, in CLIL is likely to be a transitory step on the way towards QII. However, the transition from QIII to QII or IV focuses on progression of individuals and the realization of their potential over time (2010, p. 44).

Context of the Study
This study took place at a private school that has established a bilingual program within the characteristics of an early partial immersion (Baker, 2006 as cited in Pacific Policy Research Center, 2010) in which students from age 5 or 6 have 50% of the curriculum taught through English as a Foreign Language -EFL during their elementary education. The program is at a stage of on-going implementation in which students currently in third grade have increased the number of subjects instructed in English since 2014 to date (2016) when they finally have 50% of their curriculum in English. This study focused on the evaluation of CLIL in science as it is the only content subject that is assessed by the national standard tests, has a relevant number of hours in the curriculum, and is the second most popular content subject taught in Colombian Bilingual Schools (McDougald, 2015).
In accordance with this context, bilingual teachers are mainly content specialists who have an upper-intermediate mastery of EFL. They have a tendency to be concerned more with the development of content competencies, ignoring language constraints that regularly affect mixed-ability language learners in CLIL settings. Furthermore, administrators at this private school did not carry out a needs analysis to set specific guidelines for the implementation of CLIL as it is suggested by many authors (Coyle, Hood, & Marsh, 2010)

Validation
Validation of the study was underpinned by the use of different sources of analysis in each phase. In phase I, the collaborative work done by the CLIL teacher and the EFL teacher, through individual and pair analysis systematically using Wesche's framework (1983 as cited in Bailey, 1998, p.13) and the assessment grid allowed certain degree of quality, that could be latter assessed during phase II. In phase II item analysis was performed from three different perspectives that are commonly used to examine the quality of multiple-choice test on classrooms: Item Facility, item discriminability, and distractor analysis. The individual results and its analysis as a whole provided a holistic picture of each test item and determined whether those items were acceptable or not for the purpose of the study.
Item Facility is an index that represents the portion of students who answered each item correctly. It provides a source of analysis to help establish the level of difficulty claimed for each test item according to the assessment grid. In order to uncover the variability in skills and/or knowledge that is assumed to exist in a group of test-takers, a comparison of the good students and the poor students, in terms of how they perform each item, provides useful information in the discrete-point, norm-reference approach. Item Discrimination -I.D. examines test items in a more accurate way as it shows how the top scorers and the lower scorers performed on each item. These statistics allow you to determine whether the item with a low I.F. is actually difficult, or if other factors might influence the low rate of correct responses for that item. Point-Biserial correlation coefficient is the most appropriate tool suggested by Bailey to determine item discrimin-ability. Finally, Distractor Analysis is a procedure specifically related to the multiple-choice formats. It shows how each individual distractor is functioning. An important aspect affecting the difficulty of multiple-choice test items is the quality of distractors. Some distractors, in fact, might not be distracting at all, and therefore serve no purpose. This approach assumes that there is some variability (Bailey, 1998, p. 134).

RESULTS
Three tests were analyzed in order to identify their characteristics in terms of language and content demands, and placed their items in the assessment grid with the intention to discriminate which of the two constructs required more instruction, or have been mastered by students.
By and large, tests items were mainly placed in QI and QII, suggesting that there is a high emphasis in assessment of content knowledge with low demand on language. Only Test Three had a valid item in QIV. This brings attention to the difficulty that may entail the design of low content demand questions with high language demands. The number of items that need revision varied from 1 to 3. A positive improvement was observed in the number of distractors that needed replacement. The assessment yielded useful categorization of items, in particular when they were related to each other in terms of content components.

Test One
This test was a diagnostic that contributed to the starting point as to how tests were initially developed. At the beginning of the school year, 89 third-grade students in five different classrooms took a 12-item multiple-choice test that had as a purpose to determine students' entry levels of content competencies according to the exit outcomes planned for second grade, and the corresponding foreign language understanding. This is shown in Table 2.
The CLIL teacher and the EFL teacher collaboratively wrote the questions; meanwhile they classified each test item in the assessment grid. The process of sorting out each item was supported by the Wesche's framework (see Appendix A) and resulted in the information showed in table 3.
It is evident from the assessment grid that the test focused on low language demands as items are mainly placed in QI and QIII, which could be explained due to the diagnostic intention of testing students who just started their school year and faced for the very first time this content class in a foreign language. Accordingly, test items that depicted cognitive academic vocabulary were placed in QI or QII because they demanded more content knowledge while their language features were mainly illustrated or contextualized. Items 11 and 12 required students to understand complex sentences as well as related cognitive academic vocabulary with the specific concepts and processes of the content subject.
Test One had a total of 12 items: 4 placed in QI, 2 in QII, and 6 in QIII. The content of the items was focused on three different components that affected the analysis of the reliability among items. Both Item Facility (I.F.) and Item Discrimination (I.D.) (See Appendix B) showed acceptable values for most of the items, although two items were found to need revision: Items 4 and 11. However, 17% of distractors (See Appendix C) corresponding to items 1, 4, 5, 6, 8 and 11 needed to be revised. Special attention should be paid to students when they are taking the exam because there was a meaningful number of items, whose performance was affected by no or wrong answers following the item instructions. It is important also to notice that the first test did not include any item in QIV due to the teaching tendency to focus more on the content demands rather than the language demands. Table 4 consolidates overall results around item 12. This review does not include items 4 and 11 because they were found to affect the overall performance. This table shows that 45% of students achieved the high content and language demands of QII. In this regard, only 30% of students answered correctly low content and language demands in QIII, and a similar percentage (29%) the high content at low language demands of QI. These findings show that items placed in the assessment grid do not depict the expected discrimination between content and language demands. This event might have been influenced by a few things, (a) the test is a diagnosis before instruction, (b) items measure different content components, and (c), items are not balanced within the assessment grid. Conclusions on these tests are twofold. First, test development needs to be enhanced by clarifying its purposes and content components. Second, students seem to need instruction in test-taking skills and academic language in order to understand test tasks. *Quadrant in the assessment grid. **Number of students who answered correctly each item. Numbers in the other quadrants are taken from the set of students who answered correctly item in QII. ***n=89

Test Two
Test Two was applied as an achievement measurement at the end of the first school term that lasted three months. In order to design Test Two, the CLIL teacher defined the content outcomes that were expected to be achieved and the EFL teacher identified the language components. Both are shown in Table 5.  Table 6 shows that most of the items in the test included specific vocabulary of the subject such as types of cells, domains, and kingdoms. Items 2, 3 and 10 used basic interpersonal vocabulary. In regard to the difficulties yielded by the different content components assessed in Test One, Test Two involved a specific target component as it is identifying and classifying organisms in terms of domains and kingdoms. This is not the case of items 1 and 11, placed in QI because they demand an understanding of specific content terms such as scientific questions and hypotheses for general skills development of the content. Test Two analysis examined each of the 12 items in detail according to the assessment grid due to its emphasis on a specific content component. Five items placed in QI, 1 in QII, five in QIII, and one in QIV and described a test with better distribution of items compare to Test One which had more items in QIII and none in QIV. Additionally, it is worth noting, that items in Test Two had more specific content vocabulary, although its items had more context clues. Only item 10 needed replacement or further analysis due to its low I.F. and I.D (see Appendix D). The rest of the items yielded difficulty according to the expected levels claimed by each quadrant of the assessment grid. In this test, 50% of items (6 different) have at least one distractor that needed revision (see Appendix E). Table 7 shows the consolidated results of Test Two within the assessment grid. This time 61% of students correctly answered items in QII. Regarding these students, performance in QIII (39%) showed that they had little difficulty answering questions at low content/language demands and a little more difficulty with questions in QI (35%). Although, there was not a valid item to compare the levels of language difficulty in QIV, it seems that this group of students require more language support in order to perform better at the content demands, as they were able to answer items in both QIII and QI with a similar level of language demands but different demands in terms of content. * Quadrant in the assessment grid. ** Number of students who answered correctly each item. Numbers in the other quadrants are taken from the set of students who answered correctly item in QII. *** n=89

Test Three
The last test, Test Three was applied as an achievement measure of the second term. In this case, 115 students took the tests in the same five groups. This test was developed taking into account the information shown in Table 8. The content components were defined according to the school curriculum. The language components were identified by the EFL teacher taking into account the curriculum, and the textbook. This time questions clearly differentiated whether students understood what adaptations are and how to explain them, or whether they had difficulties with the language used in understanding the questions.
Items in the assessment grid (Table 9) were carefully assigned to each quadrant as a result of the need to examine the item performance in terms of their relationship among each quadrant to spot the difference between language and content demands. Hence 50% of items had con-textualized clues and the other 50% required students to recall concepts or understand without any support. Particularly, the assessment grid of Test Three showed a higher level of correspondence among items. This means, a question has at a minimum another question that measures similar knowledge or skills placed in another quadrant with a different level of demand. For instance, item 1 (QIII) the task posed to the learner was to define what adaptation is, parallels item 7 (QI) that aims at assessing whether the students know what adaptation is by comprehending its concept from a short text. The former item limits its language input to the question and the simple-statements of its answer options. The latter one demands a similar task but it includes reading the text and discarding other concepts from the options. Items 4, 8 and 10 (QIII) similarly correspond to items 5, 6, and 11 in QI. Likewise, items 2 and 3 QIV in comparison to Items 9 and 12 in QII.
The previous patterns of test design are relevant for the study because they allow for the examination of the role of the assessment grid for test development; whether it helped to discriminate between content and language demands of test items, or it did not. Hence, the item analysis, that follows, uncovered this concern and checked the reliability of each item.
Test Three had 12 items placed in each of the quadrants as follows: items 5, 7, and 11 in QI, items 9 and 12 in II, Items 1, 4, 8, and 10 in QIII, and Items 2 and 3 in QIV. A total of 3 items (2, 6 and 8) were found invalid, requiring further analysis or replacement (See Appendix I). This test had the fewer number of distractors to be revised in comparison to previous tests (see Appendix D). Table 10 consolidates the results of Test Three. It is evident that students who answered correctly items in QII are better discriminated by the other quadrants. In detail, results show that students had a similar performance when language demands are minimum and the content demands vary. Performance in item 12 QIV (50%) revealed that students have better results when the language is more demanding (QIV 35%) than the content. A similar pattern is visible with item 9 in QII. 52% of the students had better performance (41%) at QIV in comparison to QIII (33%) and QI (35%).
Three tests were analyzed in order to identify their characteristics in terms of language and content demands, and placed their items in the assessment grid with the intention to discriminate which of the two constructs required more instruction, or have been mastered by students. By and large, it is evident that the assessment grid provides a valid framework to place the items. This information enriches the report of the tests by pointing out students' achievement by the levels of difficulty framed by each quadrant.
In general, tests items were mainly placed in QI and QII, suggesting that there is a high emphasis in assessment of content knowledge with low demand on language. Only Test Three had a valid item in QIV. This brings attention to the difficulty that may entail the design of low content demand questions with high language demands. The number of items that need revision varied from 1 to 3. A positive improvement was observed in the number of distractors that needed replacement. The assessment yielded useful categorization of items, in particular when they were related to each other in terms of content components.

DISCUSSION
There are two main contributions of this study. Firstly, it attempts to describe the summative assessment process that was actually carried out in a CLIL classroom, picturing the state of this curricular aspect from the inside. Although there is a lot of research on alternative assessment approaches (Short, 1993 ) aimed at obtaining accurate information about students' learning processes in formal education, summative tests, in their multiple-choice version, are still widely used to make decisions about students' promotion, Note: * Quadrant in the assessment grid. ** Number of students who answered correctly each item. Numbers in the other quadrants are taken from the set of students who answered correctly item in QII. *** n=115 students' achievement, teacher performance, and even effectiveness of programs (Short, 1993 ). This study is evidence of this practice and how it is still rooted in classroom assessment yet at new curricular development approaches such as CLIL.
Sometimes assessment practices are flawed by practicability as the main way to judge tests. Elements such as validity and washback are vaguely applied. This study encourages the careful examination of tests, given its value aforementioned. So, teachers can evaluate their common assumptions by testing them systematically once in a while to guide their practice and enlighten their work with less subjectivity. An item analysis is a simple yet helpful instrument to build a set of informed decisions in test development.
Consequently, accepting that multiple-choice tests are pivotal in school dynamics, this study proposes an alternative to enriching this practice by using an assessment grid that reports distinctly students' achievement in terms of content and language demands. One of the most critical aspects in CLIL implementation is to establish this difference. According to test reports, generally the use of the assessment grid provides a valid framework to place test items in four different quadrants that combine the possible alternatives among content knowledge, thinking skills, and the required language to understand this at two levels of difficulty.
It is essential, though, to clarify that the assessment grid must be supported by a clear definition of the content and language components of each test, a consistent criterion to describe test items, and a valid set of items distributed in each of the quadrants. Besides, agreement on the levels of difficulty depends on the curricular outcomes suggested for the grade, in the case of the study, third grade.
In conclusion, the assessment grid allows reporting in detail the difficulties and the strengths of students after instruction or before it. This information could be helpful for CLIL teachers to increase their understanding of the language demands of any test item, to address specific strategies to actually attend students' needs, and afford foreign language learning beyond incidental language gains.  Table 11 shows the analysis of Test One using Wesche's (1983) "Components of a Test". Identify the cause of an event.
Choose the picture that explains the problem. Choose the best description. Table 12 shows item facility and item discriminability for Test One.  Table 13 shows the distractor analysis for Test One. Test Two Table 14 shows the item facility and item discriminability for Test Two. Test Three Table 16 shows the item facility and item discrimination for Test Three.