The More the Merrier – Revisiting CLIL-Based Vocabulary Growth in Secondary Education

One crucial aspect of CLIL-based foreign language learning in instructional settings is vocabulary growth. As a consequence, research should be interested in how CLIL fosters vocabulary learning. Noticing an apparent shortage of data-driven quantitative research on vocabulary growth in this field of CLIL is, therefore, problematic. The present paper reports findings from a mixed-methods study of vocabulary growth in an Austrian lower secondary school CLIL setting, with English as the language of instruction and learning. The aim of the study was to analyse how the use of CLIL in the English classroom could benefit learners in their acquisition of vocabulary in the target language. First, a repeated-measure-design with experimental and control groups assessed receptive vocabulary growth by means of a standardized vocabulary size test. Second, students’ questionnaire data as well as vocabulary profiling of the CLIL teachers’ linguistic input explored possible covariates for the vocabulary test scores. We found that CLIL-related effects were only co-determined by input frequency, while extra-mural factors did not play any role in this study. As a consequence, overly optimistic expectations regarding the linguistic impact of CLIL in a mixed-ability setting guided by a predominantly implicit language teaching approach need to be re-evaluated critically.

The More the Merrier -Revisiting CLIL-Based Vocabulary Growth in Secondary Education Entre más mejor: una revisión del incremento del vocabulario en un contexto AICLE de bachillerato

INTRODUCTION
CLIL is "an educational approach where curricular content is taught through the medium of a foreign language, typically to students participating in some form of mainstream education at the primary, secondary, or tertiary level" (Dalton-Puffer, 2011, p. 183).One of its major premises holds that providing rich amounts of foreign language input in a mostly immersive context will lead to higher proficiency in the target language (Dalton-Puffer, 2011;Perez-Canado, 2011).However, some critical voices have been raised lately, see Bruton (2011) and Paran (2013).
This criticism became the driving force for the present study.It investigated CLIL-based vocabulary growth in lower secondary Austrian students.This sample arguably differed from more typical European CLIL contexts concerning student selection and target language contact time.
The students in this CLIL research project, for example, worked within a non-selective, sub-optimal (Grandinetti, Langellotti & Ting, 2013), rather low-achieving learning background and a modular CLIL approach.The hallmark of modular CLIL is a sequence of various CLIL projects spread out throughout the school year, each interspersed with mother tongue teaching sequences (Krechel, 2005).Therfore, notwithstanding a certain lack of necessary CLIL criteria according to Tedick & Wesely (2015), this setting of non-selectivity of student population and modular input constitutes an "authentic" European CLIL context (author 1, 2007;Denman, Tanner & de Graaff, 2013;Krechel, 2005).
CLIL studies repeatedly report visible growth in areas such as receptive and productive vocabulary (Dalton-Puffer, 2011), and CLIL proponents have pointed out that such growth can be expected to happen even after a comparably short time of exposure within an immersive or incidental-language-learning-approach (Pérez-Cañado, 2011).This approach has had a marked influence on CLIL and foreign language pedagogy (Ellis & Shintani, 2013).Llinares & Whittaker (2009, p. 189), for example, maintain that "in most courses run by content teachers only, a foreign language is only used as a vehicle for learning content with the assumption that this will lead the student to learn the language naturally and incidentally".

GIERLINGER, WAGNER
However, there seem to be three challenges to such a view.First, recent theoretical considerations in second language acquisition (SLA) research shed serious doubts on a simplistic relationship between input and language learning.Ortega (2015, p. 259), for example, considers input as only one of several ingredients of SLA to be "necessary but not sufficient, and perhaps not even the most crucial one".Second, positive evidence for vocabulary growth in CLIL appears to be particular to studies coming from environments where access to CLIL depends on school-based selection procedures, such as language proficiency, parental background and school achievement (Bruton, 2011;Dalton-Puffer, 2011;Küppers & Trautmann, 2013;Rumlich, 2013).Third, there is a growing body of evidence on the limited and possibly non-optimal effect of incidental language growth in instructed settings in general (Laufer & Nation, 2012;Leow, 2015;Lyster, 2013;Nation, 2011).
All in all, vocabulary teaching and learning in CLIL is still strongly affected by beliefs in the effectiveness of such a language bath metaphor (Hüttner, Dalton-Puffer, & Smit, 2013) 1 , even though Vollmer (2010, p. 50) states that there is "a paucity of representative and empirically valid studies concerning the strengths of CLIL students and thus a lack of evidence concerning the central assumptions about the benefits and the superiority of CLIL programmes".This sentiment is also supported by Bonnet & Dalton-Puffer (2013) and not improved by the even higher scarcity of research into CLIL for low-achieving populations (Denman, Tanner & de Graaff, 2013;Grandinetti, Langellotti, & Ting, 2013;Schwab, 2013).
Finally, we need to point out that we are fully aware of the long tradition of bilingual and immersion programmes in many different parts of the world (Genesee, Lindholm-Leary, Saunders & Christian, 2006;Tedick & Wesely, 2015), but focusing on mostly European CLIL studies is, in our opinion, due to important differences between CLIL and immersion programmes and the particular research aim and context of our study (Dal-ton-Puffer, Llinares, Lorenzo, & Nikula, 2014;Lasagabaster & Sierra, 2010).In Austria, for example, CLIL teaching is guided by a highly flexible legal context, which allows for schools to set up locally appropriate and tailor-made programs.Basically, these can range from short, project-based modules, to one-year courses in which English is used as a means of instruction for one or more subjects.The modular program, as described above, has turned into a very popular CLIL approach in lower secondary and primary education in Austria Replace with (Gierlinger, 2007).
To get a solid basis for our research design we reviewed 15 studies of quantitatively measured vocabulary growth in European CLIL classes.These studies, at first sight, unanimously show advantages in vocabulary growth for CLIL.However, a closer look revealed various caveats, and the purported advantages need therefore to be interpreted with caution.
For a start, since CLIL classes in these studies normally received extra language support, possible input frequency effects through additional exposure need to be taken into account (Dalton-Puffer, 2011;Jimenez Catalan & Ruiz de Zarobe, 2009).Second, reporting on absolute vocabulary gain can be misleading.When Pietilä & Merikivi (2014), for instance, described an advantage of CLIL classes with regard to absolute vocabulary growth, they failed to point out that the non-CLIL learners were eventually rapidly catching up, showing a far better relative gain.Third, similar to the above, vocabulary gain may need to be investigated more carefully in terms of its relative growth in comparison to control groups.As Mewald, Prenner, and Sprenger (2004, p. 12) report, significant differences between CLIL and control groups, as assessed in year six, were in fact levelling out by year eight.Therefore, their initially proposed Scherenhypothese2 had to be discarded.A similar phenomenon was reported in Admiraal, Westhoff, and de Bot (2006), where vocabulary scores ceased to increase after four years of CLIL instruction.And fourth, effects attributed to CLIL exposure might actually be mediated by external factors.Sylven (2007), for example, pointed out that the CLIL-induced advantages they found, were in fact co-determined by extra-mural factors.And finally, there is the ongo-GIERLINGER, WAGNER ing methodological issue of finding both an appropriate tool and an appropriate design in order to measure vocabulary growth longitudinally (Schmitt, 2010;Dóczi & Kormos, 2016) .
In light of the current state of affairs, there appears to prevail a certain ambiguity of empirical evidence on the superiority of CLIL-induced vocabulary learning, as, for example, voiced by Vollmer five years ago.This paper tries to address this gap by providing empirical data from a project with Austrian lower secondary school students.As already mentioned above, CLIL in these schools was carried out in a modular project format, in non-selective classes, and through mostly implicit language instruction (Ellis, et al., 2009, Gierlinger, 2015).We formulated two research hypotheses for this study.
After about six months of target language exposure through modular CLIL, CLIL learners will outperform non-CLIL learners with respect to their receptive vocabulary knowledge as measured in relative gains.In other words, CLIL learners will show a significantly higher receptive vocabulary growth in their post-testings.
A possible superiority of CLIL-induced receptive vocabulary growth will be, apart from the CLIL intervention, co-determined by extra-mural factors.

Context and design of the study
A data-driven mixed-methods study was carried out between 2010 and 2011, combining both naturalistic qualitative and manipulated quantitative classroom data.Following a quasi-experimental non-randomised pre/ post-test control-group-design, vocabulary test scores were taken twice from all students in order to quantify their vocabulary size before and after the instructional intervention.The instructional intervention was the exposure to CLIL teaching.In our setting, CLIL teaching was exclusively done through modular projects.Our CLIL teachers carried out around 5-7 CLIL projects extending for up to 4 weeks each throughout the school year.The overall contact time resulted in either 60 or 80 additional hours of CLIL teaching.The different figures are the result of one class being exposed to CLIL in two subjects and different project lengths.
We are acutely aware that such a research design could be challenged because of conflating variables, namely methodology and language input.However, CLIL (through English) in a European classroom almost necessarily entails additional exposure to the foreign language (Dalton-Puffer, 2011).Therefore, the teaching method (CLIL) and this additional exposure inevitably constitute conflating variables.Despite this dilemma, European CLIL research seems to have accepted this as an intrinsic design problem.In authentic educational settings, a clear-cut separation of these two variables can arguably and regrettably not be modelled as an experimental condition for English.
The following table summarises the instructional and learning environment of all five classes.While the control group had received no extra language input, the students from the CLIL group had gone, depending on the number of CLIL modules throughout the school year, through either 60 or 80 hours of extra CLIL class time within the treatment period.This happened as part of the schools' language enrichment policy.As far as CLIL methodology was concerned, most of the teaching was held in English, and there was hardly any pre-planned and systematic language-focused work.Teachers' language interventions were predominantly reserved for quick content knowledge clarifications which also resulted in some code switching.The GIERLINGER, WAGNER following quote by a CLIL teacher seems to be representative of the language teaching policies: "Of course, students have to learn technical terms but that is not any focused vocabulary work, it is just the German translation so that one knows it when one needs it" (Gierlinger, 2015).Students were encouraged to speak English, but code switching was not strictly forbidden.CLIL class 1 used English subject course books, CLIL class 2 worked with English materials provided by the teachers.These materials were marginally but not systematically enhanced, such as providing translations or short definitions.The teacher in CLIL class 1 was a language and subject specialist, whereas the teachers in CLIL class 2 were only subject specialists.

Participants
In our study, 87 students from four different Austrian lower secondary schools, 45 boys and 42 girls (Mage = 13.79 years, sd = 0.59, range = 12-14.5years) took the standardised vocabulary test twice (t1 = November 2010, t2 = May 2011).Such an interval may appear short for vocabulary growth in quasi-immersive settings, but other studies, such as Grandinetti, Langellotti, and Ting (2013), worked with even shorter intervals and less input.
What is more, our control-group-design was geared towards tracing even minute growth effects across this comparably brief time-span.
Our sample consisted of an experimental group (two CLIL classes, n = 39) and the control group (three regular classes, n = 48).One class attended two CLIL subjects (chemistry and history).The CLIL students were not preselected but formed part of a whole-class and mixed-ability strategy within the school's overall language policy.The students' mother tongues included, apart from German (83%), Albanian, Arabic, Mandarin Chinese, Romanian, Serbo-Croat-Bosnian (SCB), Tagalog, and Turkish.The CLIL students participated in this study as part of their school-wide CLIL enrichment project; the three classes that served as the control group were recruited in order to match the experimental group for type of school, students' L1, age group, exposure to regular English classes, their English textbook, as well as the general communicative language teaching approach.

Materials and procedure
Three different tools were used for data elicitation.First, the effect of the instructional intervention (CLIL teaching) was assessed by measuring vocabulary size before and after exposure.For vocabulary measurement, the standardised and computer based vocabulary size test X-Lex The Swansea Levels Test (Meara & Milton, 2003) was chosen.X-Lex measures vocabulary size by prompting students to rate 120 English words from several vocabulary frequency bands as either known or not known (including nonce-words as distractors).From these ratings, a test score is calculated which reflects a student's vocabulary breadth.The rationale for this choice was manifold.First, X-Lex has already been recommended for vocabulary research in CLIL (Canga Alonso, 2013).Second, measuring receptive vocabulary proficiency, as tested by X-Lex, is apparently strongly related to word learning by incidental exposure, which is typical of CLIL environments (Jimenez Catalan & Ruiz de Zarobe, 2009, p. 84).Third, X-Lex has been standardised and validated, without resorting to one particular norming group, though, for English as a second language speakers in a number of studies (Huibregtse, Admiraal & Meara, 2002;Mochida & Harrington, 2006).Fourth, it measures vocabulary size against pre-defined corpora of different word frequencies, thus tracing growth constrained by word frequency.Fifth, students often react positively to computer-based applications, much more so than to paper-and-pencil designs.Sixth, from a pragmatic view, the schools only allowed for short periods of testing time, which in turn ruled out a more comprehensive assessment tool.Finally, the software automatically produced an output file that was easily saved and fed into spreadsheets and statistical software; this, in turn, ruled out well-known problems with computerisation procedures and the treatment of missing values.
In order to examine the lexical variety of teacher input (more than 11 hours of videoed classroom observation), we ran frequency analyses of the teachers' spoken input using VocabProfile.This software is part of the New General Service List and the New Academic Word List (Browne, Culligan, & Phillips, 2013) and was adapted for online application by Tom Cobb from the Université du Quebec in Montreal, Canada (Cobb, 2014).VocabProfile performs lexical text analyses by dividing a given corpus into several categories by frequency.These include k1, the most frequent GIERLINGER, WAGNER 1.000 words of English, k2, the second most frequent thousand words of English, up to k25 (based on the British National Corpus -BNC20 as the reference corpus), as well as academic words of English and a residual category.It thereby assesses the proportions of low and high frequency vocabulary, indicating lexical variety.In addition to that, the software returns standard lexical statistics, such as type-token ratios of the corpus.The general reliability of this tool was assessed, among others, in studies by Meara and Fitzpatrick (2000) as well as Cobb and Horst (2001).In our study, the teachers' videoed input was transcribed and then fed as a text-file into the software.
The third methodological tool was a background questionnaire.Complementing the assessment of vocabulary breadth through X-Lex, all participants from the experimental group filled out a background questionnaire at the time of the first measurement in a paper-and-pencil fashion.The questionnaire explored extra-mural English-related activities along with bio-data such as gender, age, family background, school grades in the CLIL subjects, and self-assessed proficiency in the four skills listening, reading, writing, and speaking.

RESULTS
The analyses of the vocabulary test first focussed on the overall as well as k1 scores (1,000 most frequent words of English) at t1 and t2.3 First, it was checked that assumptions of normality within the data were met and that there was no over-homogeneity within the variances.Normality was first checked through the inspection of Q-Q-plots.Moreover, both Shapiro-Wilk tests and Anderson-Darling tests confirmed that there was no significant departure from normality (p-values of both groups and both measurements > .05).Variances across both groups were checked through Bartlett's tests of sphericity (all p-values > .05).
A first inspection of the two test scores and a visual display shows that both groups exhibit remarkable similarities (see Table 2, Figure 1).What we can see in Table 2 and the two boxplots to the left of Figure 1 is that the groups' results do not differ much.CLIL students show the highest scores at both t1 and t2, as can be seen in the length of the upper whiskers in the plot.Both groups show a considerable range and variability in test scores (all SDs above 650 points).The proximity of the medians in the boxplots and the overlap of the box notches (quasi confidence intervalls) illustrate that group medians do not seem to differ significantly, neither at t1 nor at t2. Independent-samples t-tests confirmed that the CLIL group had a slightly but not significantly4 better start at t1, with a small standardised effect-size of d = 0.35, and behaved similarly to the control group at t2, with a negligible effect size of d = 0.13.
The interaction plot on the right-hand side of Figure 1 illustrates a mild increase in test scores over time for the CLIL group as well as a pronounced increase for the control.A one-way repeated measure ANOVA over group interacting with time revealed that there was no significant main effect for the CLIL treatment (F(1, 77) = 1.72, p = .19),nor for the interaction over time (F(1, 77) = 1.22,p = .27).In order to check for a regression effect, the vocabulary scores at t2 were modelled using an OLS regression with the test scores at t1 and the grouping factor as predictors.However, even in such a model with the test scores at t1 held constant (F(11.9),df = 2;76, p < .001,R 2 = 22%), the treatment effect was still insignificant (β = 49.38,t(151.67)= 0.33, p = .75)and had a negligible magnitude of η p 4 = 0.0014.Thus, coming back to hypothesis 1, it is not the CLIL exposure over time that predicts the vocabulary gain between t1 and t2.So far, these results are somewhat at odds with predictions made by many CLIL proponents.While we can see vocabulary growth in both groups, and while the CLIL group outperforms the control group in terms of absolute test scores, the relative gain of the control group exceeds the CLIL pupils by far.In order to investigate such idiosyncratic behaviour, two possible explanations will be explored in the following.The first one relates to frequency effects as well as the interaction of the vocabulary input students received and justifiable expectations about vocabulary growth based on this specific input.5 5 The second one focuses on extra-mural influences (Sylven, 2007;Sylven, 2013), which might prove to be co-determining factors in our design.

Frequency effects of vocabulary input
Frequency is considered a key determinant of language acquisition, and higher frequency forms in the input are predicted to enable earlier automatisation (Milton, 2009;Ortega, 2015).In order to find out if and to which extent vocabulary input by the CLIL teachers could have stimulated vocabulary growth in the CLIL group, teachers' input from three different subjects across 11 hours of class time was submitted to frequency analyses using VocabProfile.The video data were first transcribed, computerised, and then stripped off all proper names, since those would have skewed the true nature of the corpus' size (Milton, 2009).The three clean corpora were then fed into the software, which segmented them into frequency bands of the first (k1), second (k2), and third thousand (k3) most frequent words, academic words (AWL), and a residual category called off-list.The profiler produced the frequency counts as illustrated in Table 3.
The three different subjects (chemistry, geography, history) show a surprisingly similar picture.First, there is the relatively high percentage (88.11-91.45%) of utterances belonging to the 1,000 most frequent English words (k1).This and the low Guirard index of 7.54 -8.65 indicate the repetitive nature of high-frequency words in the CLIL teachers' classroom language (Milton, 2009).While such a prevalence of basic lexis might well be pedagogically justified with respect to the entrenchment of English BICS vocabulary (Ellis, 2013), the low proportion of k2 to k3 words suggests that more advanced and broader word learning, especially with respect to the more closer frequency bands, may not have been fostered through this kind of linguistic input in our CLIL groups.
However, two reservations need to be raised immediately.First, following White (2013), it is still unclear what the optimal ratio of unknown vocabulary items to the word total in a text would be; in other words, how many unknown words does a teacher's input need to exhibit in order to provide stimulating but comprehensible input?This reasoning does not refer to the well-known debate around the minimum vocabulary knowledge required to understand authentic texts in reading comprehension tasks (Hu & Nation, 2002;Nation, 2006).While there seems to be a general consensus that this minimal proportion of known words ranges around 95%, there is much less agreement as to the minimal proportion of known words in teachers' input in order to be both comprehensible and stimulating.
Second, a closer look at the off-lists in our research data raises doubts as to how and whether the corpus underlying this analysis reflects the desired subject learning in CLIL.For example, various subject-relevant words were relegated to this obscure category.In other words, the underlying algorithm VocabProfile employs may not align with the instructional goals of CLIL, as a considerable number of the subject-specific words are very important for a full understanding of the CLIL subject.In this respect, Hyland Note.k1 = the first 1,000 most frequent words of English, k2 = 1,001 -2,000 most frequent words of English, k3 = 2,001 -3000 most frequent words of English, AWL = academic word list.

Table 3. Corpus Analysis of Teachers Vocabulary Input from three CLIL Classes
and Tse ( 2007) call for more discipline-specific studies of vocabulary use in academic setting.Nevertheless, according to teachers' input analysis and the power of frequency effects (Ellis, 2013), vocabulary growth should have happened at least within the 1,000 most frequent English words, since those words featured prominently in the input corpora.
If we now take those insights from our corpus analysis and re-examine our quantitative vocabulary scores within the k1 band, we get a rather different picture.Consider Figure 3 now, which contrasts the overall test scores with the k1 scores (1,000 most frequent words).

Figure 3. Interactions plots for mean test scores at both measurements for overall (left) and k1 (right) results
While the left panel of Figure 3 illustrates a pronounced increase in vocabulary growth for the control group, the right-hand-side panel shows that, within the 1,000 most frequent words of English, only the CLIL group benefits.When k1 test scores at t1 were centred (M = 0, SD = 1) and controlled for, a regression model over t2 test scores (F(3, 75) = 23.79,p < .001,adjusted R 2 = 47%) showed that this CLIL effect over time is significant (β = 0.44, t(0.19) = 2.30, p = .024).In other words, CLIL exposure is effective among the 1,000 most frequent words when measured against a standardised and constant t1-value.
Coming back to hypothesis 2, our results suggest that, within the k1 vocabulary band, vocabulary development was co-determined by CLIL exposure.Let us now have a look at the influence of background variables as determiners for receptive vocabulary growth.

The influence of background variables
As Sylven (2007;2013) & Pietilä and Merikivi (2014) pointed out, extra-mural factors might co-determine CLIL-induced vocabulary growth.Consequently, we also explored possible background variables.These included pupils' sex and age, the families' education level, if they had been on a stay abroad or not, their grade in the CLIL subject (geography), their English proficiency (amalgamation of school grades and self-assessment in the four skills), their English activities outside school as well as the grouping factor CLIL vs. control.After these dimensions of our questionnaire were pooled using principal component analyses, the vocabulary scores at t2 were examined in regression models with eight predictors.After all predictors were checked for variance inflating factors and collinearity, a step-wise linear regression revealed that in the final model (F(2,69) = 13.67,p < .001,R 2 (adjusted) = 0.26) only pupils' English proficiency (β = -344.64,t(74.64)= -4.62,p < .001,η p 2 = 0.24) had a significant and substantial partial effect.The β-coefficient in this model was negative, because a lower value for this predictor (fed into the model as the scores from the principal component analysis), corresponded to a high proficiency level.
A complementing classification analysis confirmed that only pupils' English proficiency level predicted the second test scores.Coming back to hypothesis 2, extra-mural factors outside the CLIL setting did not play a significant role in our data.

DISCUSSION
Vocabulary growth is one of CLIL's major language learning driving forces (Bonnet & Dalton-Puffer, 2013).Thus, the aim of this study was to revisit language growth in CLIL classrooms to find out whether new production data match concepts such as frequency effects in teachers' input (Ellis, 2013) and extra-mural factors (Sylven, 2007;2013).
Let us now discuss the major two findings from our study.First, CLIL students fail to outperform the controls in terms of overall receptive vocabulary growth.However, the frequency analyses of teachers' input revealed that CLIL exposure actually centred mainly on the 1,000 most frequent words of English (k1).And this is probably the reason why it was only within this band that we found significant vocabulary growth for CLIL students.We can think of two possible explanations for this unorthodox result.The first one relates to the power of frequency effects.Since CLIL students were vastly more exposed to vocabulary from the k1 band, deeper learning and entrenchment was to be expected (Ellis, 2013).The reasons for such a high occurrence of k1 may lie in the particular pedagogical context of CLIL, in which subject-specific content comprehension and clarification are considered to be of utmost importance by the teachers (Gierlinger, 2007;2015;Hüttner et al., 2013;Llurda & Lasagabaster, 2010;Nikula, 2010).And one of the strategies to reach this aim is explaining and elaborating on subject-specific concepts through basic, high-frequency vocabulary.The high type-token ratio plus the high coverage of the k1 band in our CLIL teachers' input suggests a deliberate effort towards content comprehensibility and clarification.Other research by Nation & Webb (2011) points out that by keeping the vocabulary load lower and increasing its repetitions, the amount of vocabulary learnt will increase6 .
Second, the CLIL specific vocabulary growth may reside more significantly in the area of subject specific vocabulary, which was not covered by the testing tool.However, this raises the question why the use of subject specific vocabulary apparently only had a negligible priming effect (Hoey, Mahlberg, Stubbs, & Teubert, 2007) on academic and general vocabulary?In other words, one would have expected a much more pronounced effect between academic and subject specific vocabulary within the subject classroom discourses.Arguably, this linguistic puzzle may result from a more general lack of academic language use at this age level and within the context of a broad spectrum of learner achievements.Comparative research between the use of academic language in CLIL classes and in mainstream classes could shed more light on this issue.

GIERLINGER, WAGNER
Third, as suggested by Zydatiss (2012, pp. 27-28), visible receptive vocabulary growth within CLIL may only be expected after a certain critical mass of treatment exposure.Thus, a period of 5-6 months of project-based exposure of CLIL might simply fail to reach such a critical mass and thereby prove less effective.This critical mass phenomenon may be further aggravated by an implicit teaching approach, which may have a tendency to delay noticing and hence language learning (Svalberg, 2007;Williams, 2013).A possible language threshold for CLIL is also tentatively pointed out by Agustin Llach (2014) in her research on primary CLIL.Summing up, frequency and noticing effects may play a vital role in CLIL vocabualry growth.
The other main finding of our study pertains to the role of extra-mural factors.While Sylven (2007) found that extra-mural factors did play a significant role in her study, in our data only learners' proficiency level in English predicted the final vocabulary results.Notice, however, that Sylven (2013), in a theoretical article, related her research outcomes to the extraordinary linguistic situation of Sweden, where English, in her words, is "omnipresent" (p.310).
These findings raise at least three more issues.First, the question remains whether richer teacher input may not result in a broader vocabulary gain at least for more advanced learners.Second, would the results have turned out to be the same in a less immersive, more instructed and vocabulary-focused teaching and learning context?The massive amount of recent SLA literature pertaining to the important role of language awareness, the noticing hypothesis, and explicit knowledge for language learning (Bot, Lowie, & Verspoor, 2006;Ellis, 2015;Ellis & Shintani, 2013;Leow, 2015;Williams, 2013) suggest that these issues need to be addressed by future CLIL research.Third, on a more general level, the overall heterogeneity of CLIL contexts and implementations makes it dangerous to jump to foregone conclusions, or, as Bonnet and Dalton-Puffer (2013, p. 279) put it, "CLIL is itself subject to existing teaching cultures rather than an omnipotent agent of systemic change".
To sum up, our results remain puzzling but maybe also pioneering for the moment.Although we believe, we can trace parts of these issues to methodological design problems that come with X-Lex's -and other vocabulary tools' -difficulty to deal with subject-essential CLIL vocabulary.Notwithstanding these issues, as implications of our research we propose some tentative recommendations for further research and CLIL practice.Given the importance of vocabulary growth in CLIL, there is ample room for further research into the development of CLIL-induced receptive vocabulary development over time.We believe that researching CLIL's potential over longer periods, together with a careful description of the methodological instantiations, will reveal a more realistic picture of the effect of CLIL on the learning of subject and language content.In addition to this, we need more studies on learner and teacher vocabulary with respect to frequency and typology (general, academic, and technical) but also its relationships to CLIL methodologies, ranging from (totally) immersive to (more) form focussed approaches.Our data suggest that the mainstream language bath CLIL metaphor needs to be complemented by more deliberate and form-focused instructional approaches (Grandinetti et al., 2013;Lyster, 2013;Nation, 2011).
Finally, our research turned out to be intrinsically complex, because it studied the development of a complex phenomenon (vocabulary growth) in complex ecologies (classroom learning) among a multi-variant dynamic population (school learners, teachers).Controlling quasi-experimental conditions in such a setting appears challenging.Such factors can make it extremely difficult to adopt traditionally formulated, linearly framed research methods.By applying a longitudinal and mixed method approach, we tried to go beyond a popular but possibly too simplistic comparison of CLIL and non-CLIL outcomes only.Although the majority of these comparative studies paints a positive picture with respect to language growth in CLIL, the results of our study prove to be much less straightforward and point towards a complex set of factors influencing language growth in CLIL.All in all, the benefits of our study lie in its explorative and critically reflective nature.Despite these constraints, we hope that our results prove to be sufficiently interesting to merit further investigations.

Figure 1 .Figure 2 .
Figure 1.Boxplots and interaction plot for the X-Lex scores at t1 and t2 by group

Table 2 . T-Test Results for the Groups' overall X-Lex Vocabulary Test Scores at t1 and t2
Note.M = mean, SD = standard deviation, d = Cohen's d.