By Tim Hannan MAPS
Senior Lecturer, School of Psychology, University of Western Sydney and Director, Australian Standardisation Project for WISC-IV, WIAT-II and CELF-4

Assessing children's cognitive abilities is a large part of the work of many psychologists in educational and health settings. The purpose of an assessment is usually to gather information in order to provide informed advice or recommendations concerning some aspect of the child's educational or psychosocial functioning. A major component of an assessment of cognitive abilities involves the administration of tests of intellectual abilities, academic skills, and other cognitive abilities such as language, memory or executive functions.

The importance of scientific knowledge and skills in the selection, administration and interpretation of psychological tests is well established. Appropriate practice in these areas is not merely a general recommendation, but an obligation for psychologists, as emphasised in the APS 'Code of Ethics': "Psychologists must ensure that assessment procedures are chosen, administered and interpreted appropriately and accurately" (Section A.1).

The proposition that there are basic standards of assessment practice would probably seem uncontroversial to most psychologists; but is there any agreement on the question of what kinds of test selection and interpretation practices should be viewed as "appropriate and accurate"? For example, are common tests so similar that test selection is just a matter of individual preference? Is individual subtest interpretation a useful or even essential part of appropriate test interpretation?

Given the egalitarian and democratic leanings of most psychologists, issues concerning test selection and interpretation might well be commonly viewed as simply matters of personal preference, with little effect on the findings of a psychological assessment. However, the methods employed in gathering and interpreting information may be scrutinised for their accuracy and reliability, and for their contribution to timely and effective diagnostic practice.

There are a few matters that seem particularly worthy of comment.

Normative data

It is well recognised that tests should be selected according to the availability of demonstrated evidence of reliability and validity for the specific purpose for which the test is employed. A glance at the psychometric properties of many a common test should give rise to considerable caution when assuming that the scores accurately reflect a child's cognitive abilities. Large confidence intervals are a sure sign that the test is not sufficiently reliable. Some tests, such as the Neale Analysis of Reading Ability - Third Edition, even suggest the use of a 68 per cent confidence interval: the psychologist adopting this recommendation must be prepared to be wrong about the estimate of a child's true abilities once in every three assessments!

Another important aspect of a test's utility is the availability of up-to-date, local normative data. Historically, most of the tests in common use in Australia have been standardised in the USA or UK, and Australian normative data has not been available. One positive development in this area emerges from a research project nearing completion at the University of Western Sydney, which involves the establishment of Australian normative data for the Australian language adaptations of three psychological and language tests: the Wechsler Intelligence Scale for Children - Fourth Edition (WISC-IV), the Wechsler Individual Assessment Test - Second Edition (WIAT-II) and the Clinical Evaluation of Language Fundamentals - Fourth Edition (CELF-4). The project, which was initiated and sponsored by the publishers of the tests, Harcourt Assessments (formerly The Psychological Corporation), has involved the administration of around 3500 assessments by psychologists and speech pathologists across Australia to 2000 participants aged between five and 21-years old.

Australian psychologists using earlier versions of these tests have relied on the American data in the manuals when evaluating a child's level of performance. Yet data from previous research, and from the recently completed standardisation of the Wechsler Preschool and Primary Scale of Intelligence - Third Edition (WPPSI-III), has revealed that on average Australian children score slightly higher than their American peers on this test. This difference is assumed to reflect a higher mean level of parental education, and a similar effect is expected to be observed in the WISC-IV standardisation sample.


Another aspect of test use concerns the ubiquitous practice of re-testing children at regular intervals, often to satisfy an administrative requirement to ensure that children continue to meet specified criteria for educational placements or support funding. The problem with re-testing is not limited to the possibility that the child remembers the specific test items: the effects of prior exposure to materials include the reduction in novelty of the tasks, and the familiarity with certain strategies for solving the tasks.

Re-use of a psychological test should always be avoided. Kaufman's (1994) comments refer to the WISC-III, but the problem is common to all tests:

"After a child has taken a Wechsler scale once, the novelty is gone. The initial evaluation assessed the quickness of the child's ability to generate strategies for solving new problems. The next time, especially just a few months later, the tests are no longer novel; the strategies for solving the problems are remembered to some extent….The retest Wechsler profile, especially on the Performance Scale, therefore, has additional types of error thrown in, not just the expected kinds of error that are present on the first administration of a subtest. Robert deNiro said to Christopher Walken, in The Deer Hunter, that you get "one shot." What applies to deer hunting applies to Wechsler profiles. You get one shot" (p.31).

Subtest interpretation

As various authors have pointed out, the popularity of the practice of attempting to interpret subtest scores on intelligence scales appears to be inversely related to the evidence in support of its diagnostic utility (Bray, Kehle, & Hintze, 1998; Watkins, 2003).

The practice of individual subtest interpretation relies on the assumption that subtests demonstrate incremental validity with some diagnostically important criterion above and beyond that supplied by both a test's composite intelligence score and the narrower cognitive domains factors identified by factor analysis. However, examination of the research literature over the past two decades suggests that these assumptions do not appear well founded, and that subtest interpretation is rarely a reliable or valid interpretative method.

The myths surrounding subtest interpretation might be frequently debunked, but the practice continues to be encouraged in many popular assessment texts (Kaufman, 1994; Sattler 2001). It is possible that many psychologists view the practice as harmless, a "shared professional myth" (Faust, 1990) which adds colour to a report but does not lead to misinterpretation. However, research exploring the effect of subtest interpretation on the likelihood of misdiagnosis is warranted. It is quite possible that the practice of subtest interpretation is not merely invalid, but a cause of misdiagnosis.

To adapt a line from Jensen's (1965) review of the Rorschach, the rate of scientific progress in psychological assessment might well be measured by the speed and thoroughness with which it gets over IQ scale subtest interpretation.

Measurement of change

In determining the nature of a child's difficulties, psychologists often need to compare scores on two different tests. For example, learning disabilities have traditionally been diagnosed on the basis of a comparison between a child's intelligence and academic achievement, requiring an examination of the difference between an IQ score and measures of reading, spelling and arithmetic. While the theoretical, statistical, and pragmatic flaws in the IQ-achievement discrepancy model have effectively led to its abandonment as a diagnostic marker of learning disabilities by most leading researchers, it remains a popular practice.

The concurrent standardisation of the WISC-IV, WIAT-II and CELF-4 has provided a rare opportunity for exploration of the relationships between measures of intelligence, achievement and language within a fully stratified sample of children. The majority of the children and adolescents who participated in the co-standardisation completed two or three of these tests, which will permit the establishment of base-rate data on differences between test scores. Children with specific conditions such as language disorders, reading disorders and ADHD were also examined, and it is hoped that comparisons between test profiles will assist in identifying the diagnostic accuracy of various test score discrepancies.

Another common task for the child psychologist involves the examination of change in scores on a test over time, most often arising with the aforementioned re-testing policies of many school systems. Accurate interpretation of differences between test scores over time depends on (1) the determination that a change in scores has actually occurred, and (2) the correct inference of the source of that change. The first step requires the application of psychometric methods, which assist the psychologist to establish that the difference in scores is not attributable to the variability expected given the imperfect test-retest reliability of the instrument. The second task is to consider the range of possible factors which may have contributed to a change in scores.

Appropriate and accurate use and interpretation of tests requires understanding these issues, and careful consideration of the properties of tests rather than adherence to a shared professional myth. But what might be some implications of this for practitioners?

Four decades ago, the test reviewer Oscar Buros (1961) offered the following comments on the practice of testing:

"If (counsellors, personnel directors, psychologists and school administrators) were better informed regarding the merits and limitations of their testing instruments, they would probably be less happy and less successful in their work. The test user who has faith - however unjustified - can speak with confidence in interpreting test results and making recommendations. The well informed test user cannot do this; he knows that the best of our tests are still highly fallible instruments which are extremely difficult to interpret with assurance in individual cases. Consequently, he must interpret test results cautiously and with so many reservations that others wonder whether he really knows what he is talking about. Children, parents, teachers and school administrators are likely to have a greater respect and admiration for a school counselor who interprets test results with confidence even though his interpretations have no scientific justification. … It pays to know only a little about testing; furthermore, it is much more fun for everyone concerned - the examiner, examinee and the examiner's employer."

As a scientific discipline, it is critical for psychology to pay attention to issues of the efficiency and accuracy of our diagnostic practices, by demonstrating that these practices are backed by data. If, as Buros suggested, adopting accurate, appropriate methods in test use and interpretation constrains the flights of fancy and limits the fun, then it is probably a small price to pay in the pursuit of accurate and efficient diagnostic practice.


Australian Psychological Society. (2002). 'Code of Ethics'. Melbourne: APS.

Bray, M. A., Kehle, T. J., & Hintze, J. M. (1998). 'Profile analysis with the Wechsler scales: Why does it persist?' School Psychology International, 19, 209-220.

Buros, O. K. (1961). 'Tests in print: A comprehensive bibliography of tests for use in education, psychology and industry'. Highland Park, NJ: Gryphon Press.

Faust, D. (1989). 'Data integration in legal evaluations: Can clinicians deliver on their premises?' Behavioral Sciences and the Law, 7, 469-483.

Jensen, A. The Rorschach. In O. K. Buros (1965) (Ed.), 'The Sixth Mental Measurements Yearbook'. Highland Park, NJ: The Gryphon Press.

Kaufman, A. S. (1994). 'Intelligent testing with the WISC-III'. New York: John Wiley.

Sattler, J. M. (2001). A'ssessment of children: Cognitive applications'. San Diego: Jerome M. Sattler Publisher Inc.

Watkins, M. W. (2003). IQ subtest analysis: Clinical acumen or clinical illusion? Scientific Review of Mental Health Practice, 2, 1-41.