Visible Learning Research Hub

The Hub provides a space for the Visible Learning team to share the background information relating to the theory and research that supports the core aspects of Visible Learning. It also offers a space for presenting the ongoing research that is being conducted by Professor John Hattie and other researchers relating to Visible Learning and related educational topics.

One of the purposes of the Visible Learning Research Hub is to provide the opportunity for questions relating to Visible Learning research, to be submitted directly to the Visible Learning team. This will ensure that valid, reliable and current information can be provided by the team, and that the frequently asked questions (FAQs) will provide guidance that represents and supports the Visible Learning philosophy and practice.

For further information on the research, please email 

Visible Learningplus impact Reports

 The Visible Learning Impact Reports bring together data sets from our work with you and your consultant teams. While we know that the Visible Learning program, framework and tools have been built on Hattie’s research of what works best. These reports are proof that when the research is put into practice in schools that is does have a positive impact on student learning and achievement. 

Professor Hattie’s Research 
for details of Professor Hattie’s more recent publications which include research that has been extended based on the initial findings from the Visible Learning (2009) meta-analysis.

The Visible Learning Research: Frequently Asked Questions

Often there are questions regarding Visible Learning which relates directly to the research, methodologies, and the interpretation of data. For this reason, the following statements are given in response to common questions about Visible Learning.

Why does VL use effect sizes? 
An effect size can be defined as the degree to which a phenomenon is present in a population – thus, the magnitude of an intervention’s effect or impact (Cohen, 1988; Kline, 2004). Although there are many reasons for using effect sizes, there are two main reasons why effect sizes have become so popular and wide-spread across various sectors:

  1. The statistic places emphasis on impact of the difference found amongst samples, without being confounded by the size of the samples being compared. Although tests of statistical significance also estimate the size of the effect, the result is affected by the size of the sample.
  2. An effect size is a standardised and scale-free measure of the relative size of the effect of an intervention.

The use of effect sizes has grown significantly of the past three decades. So much is the perceived value of using effect sizes, that across various disciplines many professional bodies, journal editors and statisticians have mandated it’s inclusion as necessary in order to clarify and substantiate differences in research findings (for example, American Psychological Association Manual 2001, 2010a; Baugh & Thompson, 2001; Kline, 2004).

How to calculate effect sizes? 
There are various ways that you can calculate an effect size, and the choice really depends on what effect is being assessed – either assessing the same students pre- and post-intervention effect over time, or when the comparison relates to comparing those that received the intervention (experimental group) against those that did not (control group) [see What type of effect is appropriate]   Once the type of effect size is established, there are numerous web pages that offer calculators that will generate effect sizes from groups mean and standard deviations. The VL website also provides an excel spreadsheet and supporting video for guidance on how to calculate effects sizes that assess the same group of student pre- and post-intervention {insert link}. Alternatively, two of the most popular statistical software packages – SPSS and SAS have effect sizes capabilities. 

Why are effect sizes used when conducting meta-analysis? 
Meta-analysis is the process of quantitatively synthesizing results from numerous experimental studies. One of the main goals for conducting a meta-analysis is to estimate an overall or combined effect of an intervention/s across multiple studies. For example, Professor John Hattie’s Visible Learning (2009) conducted a meta-analysis of research from a variety of educational contexts measuring a large number of educational interventions, so to quantify the effect that each contributor had on the outcome of student learning and achievement. The broader the pool of research data that is included, the more accurate the quantitative estimate can be on how much particular contributor’s (e.g., teacher feedback) affect student achievement learning and achievement over others (e.g., homework).

Using effect sizes is one of the most common ways of robustly assessing the effects of interventions across studies. Further, effect sizes themselves promote scientific inquiry because when a particular experimental study has been replicated, the different effect size estimates from those studies can be easily combined to produce an overall best estimate of the size of the intervention effect.

The bases of the method are straight forward and much of the usefulness of meta-analysis is its simplicity. The following link provides an excellent description of meta-analysis:   ). 

For further reading and technical details on conducting meta-analysis, click .

Can effect sizes be added (or averaged)? 
Use of effect sizes promotes scientific inquiry because when a particular experimental study has been replicated, the different effect size estimates from those studies can be easily combined to produce an overall best estimate of the size of the intervention effect.  However, it has been claimed that you cannot merely average effect-sizes as this ignores possible moderating influences. 

One of the most fascinating aspects of meta-analysis is the opportunity to evaluate the influence of moderators.  Indeed the search for moderators has long dominated the educational research literature. For example, does ability grouping work differently in math from music, for 5 year olds compared to 15 year olds?  This search, for what is commonly called Aptitude-Treatment Interactions, is as old as the discipline.  These interactions have been actively sought by researchers as finding them would indeed be powerful. However, very few have been reported, and hardly any replicated.  But the search must continue.  As was noted in VL there were very few, and where they did exist they were pointed out (e.g., the differential effects of homework in elementary and high school).  If there is no evidence for moderators, then the average across all moderators can be used to make statement about the influence. 

Similarly, there is an established methodology about whether the variance of the effects is so heterogeneous that the average may not be a good estimator.  Conducting such tests is basic practice in meta-analyses and readers were encouraged to go to the original studies to see these analyses.  An estimator of the variance was included within each influence (see the dial for each influence) and appropriately commented when these were large.  Much time has been spent studying many of the influences with large variance (e.g., feedback) and the story is indeed more nuanced than reflected in the average effect. 

How accurate are the conclusions drawn from meta-analysis? 
The findings derived from any meta-analysis are only as good as the individual research studies selected for the review. It is an empirical question whether the quality of the study is a moderator.  As  stated in VL  Lipsey and Wilson (1993), for example, summarized 302 meta-analyses in psychology and education, and found no differences between studies that only included random versus non-random design studies (d = .46 vs. d = .41), or between high (d = .40) and low (d = .37) quality studies. There was a bias upwards from the published (d = .53) compared to non-published studies (d = .39), although sample size was unrelated to effect size (d = -.03). Further, Sipe and Curlette (1996) found no relationship between the overall effect size of 97 meta-analyses (d = .34) and sample size, number of variables coded, type of research design, and a slight increase for published (d = .46) versus unpublished (d = .36) meta-analyses.

There is one exception which can be predicted from the principles of statistical power, where if the effect sizes are close to zero, then the probability of having high confidence in this effect is probably related to the sample size (see Cohen, 1988, 1990).The aim should be to summarize all possible studies regardless of their design and then ascertain if quality is a moderator to the final conclusions.  For example, in Visible Learning (2009) Professor Hattie noted when required that quality was a moderator.

The research that Professor John Hattie used in Visible Learning (2009) was based on over 1,100 carefully selected articles and books that met the criteria of having been conducted using rigorous methodological frameworks and robust analysis of the findings. Details supporting the validity, reliability and error associated with the tools used to measure each intervention were also critically reviewed. Further, an overwhelming number of the research had been conducted by researchers who are considered experts in their field. Where PhD theses have been used, they have been deemed to be of an exceptional level robustness. Given these selection criteria, Professor John Hattie has full confidence of the integrity of the data used for his meta-analysis review.

How can the variability associated with each influence be evaluated? 
There is a rich literature on calculating variance associated with each influence.  This is a major concern when conducting a meta-analysis and the methods include evaluating the degree of heterogeneity across the studies and assessing whether the mean is a reasonable typical measure.  To mitigate this issue, it is important that the focus is then the search for moderators or mediators to better help explain what is happening across the studies.  It is this focus that was adopted by Professor John Hattie in the Visible Learning (2009) and subsequent publications and the following articles and book provide an excellent technical understanding of the approaches that can be used, and those which are presented as being optimal.

Takkouche, B., Khudyakov, P., Costa-Bouzas, J., & Spiegelman, D. (2013). Confidence intervals for heterogeneity measures in meta-analysis. American Journal of Epidemiology, kwt060.
Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. R. (2011). Introduction to meta-analysis. John Wiley & Sons.
Huedo-Medina, T. B., Sánchez-Meca, J., Marin-Martinez, F., & Botella, J. (2006). Assessing heterogeneity in meta-analysis. Psychological methods, 11(2), 193.

What type of effect size is appropriate? 
There are two major types of effect sizes. One is appropriate for research that is experiment in nature such as those based on comparing groups. For example, a class that received a treatment, such as reciprocal teaching, and another class that did not receive the intervention. The other effect size approach is used to establish the impact of an intervention over item (pre-post).  They have different interpretations and it is an empirical question as to whether they differ.  In Visible Learning (2009) the majority of the effects sizes were based on research that investigated group comparisons.

Why is the effect size of .4 seen as a ‘hinge point’? 
Various researchers have shown that teachers can and do get effect sizes greater than .4. This is seen in our work in schools on many occasions. Similarly in Visible Learning (2009) the .4 average was derived from research based students in classroom where teachers had created their own tests or use standardised ones.  In many of the VL schools both nationally and internationally, the average is much greater than .4, and often the .4 effect is used as the hinge-point to then better understand the nature of success and impact of VL programs.

This average has not changed from when the first studies were published in 1989 and this provides proof of the robustness of this average. However, as stated in Visible Learning, care is needed in using this .4.  If the tests are measuring a narrow concept then the effects can be higher than if they are measuring a broad concept.  For example, the effects can be higher in junior compared to senior classes (often because of the narrow to broader notion).  Any effect is only as robust as the measures they are based on. The time over which any intervention is conducted can matter (we find that calculations over less than 10-12 weeks can be unstable, the time is too short to effect change, and there is a danger of doing too much assessment relative to teaching). These are critical moderators of the overall effect-sizes and any use of .4 hinge point must take these into account. 

Independent estimates, external to the research that is included in Visible Learning (2009), indicates that the average growth per year in mathematics and reading is approximately .4.  This statistic was derived from an analysis of data from NAEP (USA), NAPLAN (Australia) , SaTS (UK), and e-asTTle (NZ).  There is a clear sense of average change – but again it is average and as noted above it is always critical to look for moderators (it is more like .5 in primary and .3 in secondary schools) and for these estimates of narrow excellence it is the interpretation that matters.

Should we only focus on influences with high effects sizes and leave out the low ones? 
It has been emphasized for the readers of the research that it is the interpretations of the effect sizes that are important and the league was primarily used as an organiser.  There are many intriguing influences that Professor John Hattie has continued to research and publish.  This is particularly the case where some effects are low. For example, the effects of class-size are much lower (but note they are still positive) than many have argued. Further, in some cases small effects can be valuable as they can indicate that the intervention is moving in the right direction. They can also indicate that some deeper processes may be changing, and they can indicate that more time, implementation press, or adjustment is needed. Therefore, just because an effect is not >.40 does not mean it is not worthwhile. 

In relation to high effect sizes is the influence of feedback, which continues to be an area of research for the VL team. Feedback is among the highest but also most variable effects, for example, while much feedback is positive much is also negative. Feedback is an effect where the variance is critical.  While this variance can be understood research continues to understand the important moderators and influences relating to feedback.  For example, it is critical to distinguish between giving and receiving feedback, between “how am I going” and “where the next” feedback, and the feedback effects from students to teachers.  More research will be published on feedback from Professor John Hattie and the VL team on the intricacy of this influence.

This has been explored at length and readers are encouraged to:

  • Accept the evidence that the effects are small
  • Understand the reasons that they are small (see Hattie, J (2007). The paradox of reducing class size and improved learning outcomes. International Journal of Education, 42, 387-425). 

How should research be group in meta-analyses? 
When conducting meta-analysis it is important to understand the detail of how research has been grouped for analysis. For example, the research on feedback needs much within-group sorting and this has led to many key interpretations. This is always based on the criteria of the researcher conducting the analysis thus making the justifications for the groupings important to include. In Visible Learning (2009) Professor Hattie provides a detailed argument for each of the groupings used, which gives validity to the subsequent effect size analyses that is conducted.

Is some research just too different to robustly average? 
This is a  common criticism of meta-analyses and has been dealt with in many other sources. Certainly, combining remains critical and this is why care is needed to discover moderators and mediators. Moderators have been continually searched for in the research and there were very few. Where they did exist they were mentioned.   In Visible Learning (2009) Professor Hattie points out the need to think about moderators.

One concern that is considered more important is when  two quite divergent effects are combined and then assuming the average is a good measure of the “typical value“.  For example, in “Inductive Teaching”  two meta-analyses with effect sizes of d = .06 and d = .59 are combined to a mean effect size of d = .33.  Here, it is important to look less at the numbers and more at the interpretation.  An interpretative approach was used when reviewing Lott’s (1983) meta-analysis where this included a comparison of inductive versus deductive teaching approaches in science education (where it made little difference) and Klauer and Phye (2008) who were more interested in inductive reasoning across all subject areas (this is where they did find higher effects).   The details of the interventions matter crucially which is why the story underlying the data is so critical.

Is it appropriate to rank effect sizes? 
Yes, where it assists in understanding of the underlying story and the many nuances around this story. In Visible Learning (2009), Professor Hattie chose to rank the relative effect sizes of 138 influences that related to student learning and achievement. This list provided a visual presentation of the effect sizes for each influence in order address and understand some of those with low effects (e.g., teacher subject matter knowledge), to understand some that are  lower than expected  (e.g., class size), and those with much variance (e.g., feedback). 

Meta-analyses are not as good as the original research 
The reason why Glass (1976) distinguished between primary, secondary and meta-analysis is that often the original data is unavailable. This has  to do with many issues such as anonymity and confidentiality. The methodology approach of meta-analysis overcomes this problem by using each study’s methodology and findings to investigate and compare the effect(s) of an influence/intervention. 

There is a whole compendium of risks to interpretation to statistics and these are not unique to meta-analyses.  For example, Black and Wiliam (1998a) noted that an effect size can be influenced by the range of achievement in the population. “An increase of 5 points on a test where the population standard deviation is 10 points would result in an effect size of 0.5 standard deviations. However, the same intervention when administered only to the upper half of the same population, provided that it was equally effective for all students, would result in an effect size of over 0.8 standard deviations, due to the reduced variance of the subsample. An often-observed finding in the literature—that formative assessment interventions are more successful for students with special educational needs (for example in Fuchs & Fuchs, 1986)—is difficult to interpret without some attempt to control for the restriction of range, and may simply be a statistical artefact. But this problem with restriction of range can occur in primary, secondary and meta-analyses.  Campbell and Stanley (1963) highlight many other possible threats to the validity of interpretation of statistics, no matter whether primary, secondary or meta-analysis is used. 

Is the use of effects sizes contingent on a normal distribution? 
There is no requirement in calculating and using effect sizes to assume a normal distribution.  This may be required if statistical probability statements are made in relation to the findings which is not the task of effect sizes. Similarly, there is no requirement that standard deviation is the same across studies, instead they depend more on the scale of the measures used within each study. 

Explain why confidence intervals were not used to help convey the effect size information. 
Confidence intervals can be used and are easily calculated based on information supplied in meta-analyses. However, where average effects need careful interpretation, and where appropriate commentary about the moderators that seemed to affect the average, providing confidence intervals can provide a false sense of precision. 

 

Editorial Acknowledgments

Common Language Effect Size Estimates (CLEs)
Due to an editing error, CLEs were presented incorrectly in the earlier edition of Visible Learning (2009). However, the interpretations of the CLEs in the 2009 edition were correct. The correct CLEs are available upon request.

Types of Effects Sizes 
As mentioned in the FAQs, there are two main types of effects sizes. When analysing the research for his meta-analysis, Professor John Hattie applied the correct approach to suit the type of methods used by the researchers. For reference, future editions of Visible Learning will include what type of effect was being applied throughout the analyses.

Research used for the meta-analysis 
Some of the meta-analyses have not been included in future editions. This has occurred where there are subsequent research findings which negate its significance or relevance to the initial meta-analysis. However, the removing this research have resulted in minor corrections, and overall has not changed the messages or the story in Visible Learning (2009). 

Corrections 
Every correction, critique and general comments regarding Visible Learning (2009) research is welcomed. 

Future Questions and Comments 
We encourage questions and comments relating to the technical and methodology aspects of Visible Learning to be posted directly to this site so that all can learn from the answers and feedback that is given by Professor John Hattie and the VL team.