Intuitive Test Theory

Many of us have an intuitive understanding of physics that works surprisingly well to guide everyday action, but we would not attempt to send a rocket to the moon with it. Unfortunately, Mr. Braun and Mr. Mislevy argue, our policy makers are not as cautious when it comes to basing our school accountability system on intuitive test theory..

By Henry I. Braun and Robert Mislevy

ALONG WITH making sure that our bodily needs are met, one of our first tasks upon entering this world is to try to make sense of it. We do so by continuous observation and generalization, as well as by absorbing the norms of the culture in which we find ourselves. Our understandings typically take the form of stories -- narratives, as the psychologist Jerome Bruner has called them. These stories are attempts to identify why people do what they do -- their beliefs, motives, and plans.

This mode of developing and retaining understanding carries over to the physical world, whether natural or human-made. We hear thunder and see lightning, see objects being thrown and falling to the ground, observe cars and computers working (or not), and we construct stories about causes, patterns, and linkages. Now, we make up these stories whether or not we truly understand what is going on. Adults are driven, in exactly the same way as 5-year-olds are, to express their understanding of what is happening around them in terms of narratives.

As Howard Gardner has pointed out, stories can differ, often substantially.

In most domains of knowledge, we develop very powerful theories when we are very young. . . . No one has to tell a kid that heavy objects fall more quickly than light objects. It's totally intuitive. It happens to be wrong. Galileo showed that it was wrong. Newton explained why it was wrong. But, like others with a robust 5-year-old mind, I still believe heavier objects fall more quickly than lighter objects.

The only people on whom these engravings change are experts. Experts are people who actually think about the world in more sophisticated and different kinds of ways. . . . In your area of expertise, you don't think about what you do as you would when you were five years of age. But I venture to say that if I get to questioning you about something that you are not an expert in, the answers you give will be the answers you would have given before you had gone to school.1

Richard Feynman's story for what happens when we throw a rock might be based on the principle of the path of least action and admit to a rigorous rendering in differential calculus, whereas little Jimmy's story is that the rock wanted to get back down to the ground where it belongs. The point is that people construct plausible stories for actions and events based on what they've experienced themselves and on what they've picked up, however loosely or informally, from the culture around them.

The Gardner quote highlights two other aspects of these narratives. The first is their tendency to persist, even in the face of evidence to the contrary or confrontation with methods of analysis that are much more powerful. Bruner makes the same point with respect to what he calls "folk psychology." He defines folk psychology as a system by which people organize their experience in, knowledge about, and transactions with the social world. We learn our culture's folk psychology along with its language and norms of social behavior. Bruner asserts, "Folk psychology changes but is not displaced by scientific psychology."2 It is the persistence of these narratives (say, in physics) that can be so frustrating for teachers.

The second aspect of these stories is that expertise is often very narrowly focused. That is, outside one's area of specialized training, it is uncommon to do much better than a 5-year-old. Indeed, the situation may be even more dire. In a now classic study, the psychologists Amos Tversky and Daniel Kahneman questioned a large number of research psychologists on various aspects of probability and statistics (the design of experiments and the interpretation of the results) that would ordinarily be relevant to their work. Surprisingly, a majority of the respondents harbored naive (and incorrect) beliefs that, presumably, influenced how they conducted their research.3

What is true of psychology or physics is true of just about every discipline you can think of. It is also true, we will argue, in educational assessment. Before we begin to explore this, our own field, we will examine briefly how people who are not experts in physics think about physical phenomena. This "intuitive physics" is a set of basic premises about how the world works. It consists of story elements or subplots, as it were, called phenomenological primitives (or p-prims, for short), a term coined by psychologist Andrea diSessa. These p-prims are primitive notions in the sense that they "stand without significant explanatory substructure or explanation."4 And just as the idea of p-prims can help explain most people's understanding of the physical world, so too can p-prims help us explain the "intuitive test theory" that nonexperts use to explain the world of assessment.

Perhaps it is not surprising that such p-prims -- and the narratives in which they are embedded -- work well enough for most situations in our everyday lives. After all, they are grounded in the experiences of many people over many, many years. They can lead to trouble, though, when employed in situations that lie outside their range, in which case expert models are indispensable. Unfortunately, unlike prescription drugs, p-prims (in physics or other disciplines) are usually not accompanied by warning labels with contraindications for use. In a fast-changing world, it is increasingly likely that we will find ourselves relying on p-prims that are not up to the task.

Intuitive Physics

One consequence of the "cognitive revolution" in psychology that began in the 1960s was a closer look at how people develop expertise in real-life activities as varied as radiology, writing, chess, and volleyball. A significant finding across domains is that experts don't simply know more facts than novices -- although they usually do -- but that they also organize what they know around deeper principles and relationships. The knowledge novices have is more fragmented and is related to particular situations or organized around surface features of problems.

For example, Micki Chi, Paul Feltovich, and Robert Glaser asked expert physicists and novices to sort a number of problems into groups. The novices produced piles of spring problems, pulley problems, and inclined-plane problems. The experts produced piles associated with equilibrium, Newton's third law, and the conservation of energy, each containing some spring problems, some pulley problems, and some inclined-plane problems. The experts' categorization leads directly to solution strategies for the problems.5

When diSessa introduced the term "p-prims" in 1983, it was expressly to explain nonexperts' ways of reasoning about physics. Familiar examples of such p-prims are "Heavy objects fall faster than light objects," "Things bounce because they are 'springy,'" and "Continuing force is needed for continuing motion." These physical p-prims are based on our everyday experience. A box moves when we push it, and it stops moving when we stop pushing. Cannon balls really do fall faster than feathers. Physicists know this, of course, but, when necessary, they can appeal to a deeper level of explanation, to the more sophisticated primitives of scientific physics. The distinguishing feature of intuitive physics (or intuitive reasoning in any field) is that the p-prims are the bottom line. For nonexperts, they are the final explanation. In other words, sometimes we just have to say, "Well, that's just the way it is."

Some of the p-prims of intuitive physics use such words as force, energy, and momentum, a legacy of the general culture or of a physics class taken long ago. But the terms are not employed in the same way that experts use them. Nonexperts don't sort concepts in the same ways as experts or embed them in the same web of qualitative and quantitative relationships. A set of p-prims is not a coherent system, and a person's set of p-prims can easily contain some that contradict others. They are employed to reason about physical situations, and a model of sorts is assembled to address a given situation. The surface features of a situation tend to elicit some p-prims but not others, so a person's intuitive models can be quite different for two situations that are formally equivalent.

The surprising thing is how well they work for guiding everyday action. You can think you are imparting a substance called "impetus" to the tennis ball when you throw it for your dog. The ball flies until the impetus wears off. You estimate how much of this substance you want to impart to the ball and gauge your throw accordingly -- and, by golly, the ball goes where you want it to. Your impetus theory is wrong, but neither you, nor the dog, nor the ball knows this, and the job gets done just fine.

Intuitive physics works well enough for playing catch with your dog or for building a birdhouse. But it doesn't work for constructing a bridge or shooting a rocket to the moon. One aspect of becoming an expert in physics is learning more sophisticated ways of thinking, but another is knowing when you need to use them, and yet another is recognizing when they fail. (Science is also about telling stories, but they are stories that submit to reality checks.) In scientific physics, concepts and relationships that may be nonintuitive, or even counterintuitive, can be brought to bear on familiar and unfamiliar situations alike. Individuals facing challenges that lie outside everyday experience ignore scientific physics at their peril.

Scientific Test Theory

To Americans who go to school or hold jobs in the 21st century, taking tests is an experience nearly as familiar as pushing boxes or watching things fall. So we need to tell stories about tests -- their purposes, their construction, our performances on them -- and we need concepts to do so. Below, we will briefly sketch how experts in assessment think about these aspects of tests. But unless you are an expert in assessment, it is probably not the way you think about them. Indeed, some of the ideas may be quite foreign to you.

A scientific approach to assessment recognizes that, fundamentally, assessment isn't about items and scores. These are more like the springs and pulleys of testing. Rather, assessment is a special kind of evidentiary argument. Assessment is about reasoning from a handful of particular things students say, do, or make, to more broadly cast inferences about what they know, have accomplished, or are apt to do in the future.6

The starting point for an application of scientific test theory is a clear understanding of the purpose of the assessment and a perspective on the nature of the knowledge or skills that are the focus of attention. Next is the link between this view of knowledge and skills, which you can't see, to things that you can see -- right and wrong answers, problem-solving steps, justifications for building designs, or comparisons of characters in two novels in terms of transaction theory, to cite just a few examples. This analysis resolves into making a case, in light of the purpose of the assessment, for what is meaningful in a student's performance and why. A rationale is also required for the kinds of assignments or challenges that will elicit the evidence to support the intended inferences about students. Conceptual links connect tasks to student performances to judgments about what they know and can do. These are the testing counterparts of Newton's laws.

Now, Newton's laws of motion are deterministic. That is, given a complete description of an object (e.g., its mass, current position, and velocity), we can calculate exactly the effect on its motion of an application of a particular force. In test theory, we can formulate a student model that describes one or more aspects of a student's knowledge or skills. Since the components of the student model cannot be observed directly, we have to use probability theory to express our beliefs about the likely values of these components. As we accumulate more data about the student, we can employ the calculus of probabilities to update our beliefs.

The use of probability-based models to describe what we know, and what we don't know, about a student is a key tool in scientific assessment. It provides a quantitative basis for planning test configurations, calculating the accuracy and reliability of the measurement process, figuring out how many tasks or raters we need to be sufficiently sure about the appropriateness of decisions based on test scores, or monitoring the quality of large-scale assessment systems. We can also apply the tools of probability to new kinds of testing processes, such as ones that select discrete tasks to present to individual students in light of how well they are doing or their instructional backgrounds, or computer-based tests of problem solving in which the problem itself evolves in response to the student's actions. These probability models and their essential role in reasoning are all but unknown to the nonexpert.

It is worth pointing out that the use of probability models to manage information doesn't restrict the kinds of knowledge and skills we can model. While psychometrics arose around 1900 with the goal of measuring traits such as intelligence, the same modeling approach can be applied with all kinds of psychological perspectives and all kinds of data. The variables in the student model can be many or few; they can be measures or categories; they can concern knowledge, procedures, strategies, or attunement to social situations; they can be as coarse as "verbal reasoning" or as fine-grained as "being able to describe playground situations in terms of Newton's laws."

What is observed and how it is modeled and evaluated will depend partly on a psychological perspective and partly on the job at hand. Designing an assessment is like building a bridge. The evidentiary arguments and the probability models are like Newton's laws in that you have to get them right or the entire structure will collapse. But they aren't sufficient to determine the project. In architecture and engineering, decisions about location, materials, and various features of the design are strongly influenced by the resources available, by the situational constraints, and by the needs of the clients. Similar processes are at work in measurement.7

The typical classroom teacher brings to bear little if any of this machinery in constructing, analyzing, and drawing inferences from Friday's math quiz. Usually, this is perfectly fine and appropriate to the purpose and the context. Assessment practices have evolved into familiar forms of testing that often work well enough in common situations. The principles that account for why they work in the situations for which they evolved are there -- invisible but built into the pieces that we can see. Popular conceptions of how and why familiar tests work hold the same ontological status as impetus theory -- dead wrong in the main, but close enough to guide everyday work in familiar settings. It is when we move beyond the familiar that these notions can betray us.

P-Prims Under Scrutiny

Let us now consider a number of p-prims of test theory. Just as in intuitive physics, these are the underpinnings of the view of testing held by many nonexperts. Our goal is to use the insights of scientific test theory to begin to understand how these beliefs might have arisen and in which situations they can break down. In what follows, we sometimes use the phrase "drop-from-the-sky" to describe a test -- by which we mean a test that is developed outside the school context. The term is meant to connote the remoteness of the test from the day-to-day experiences of the students.

A test measures what it says at the top of the page. It is natural to assume that a name carries meaning. Thus we expect that a test called a history test will measure a student's accomplishments or proficiency in history. However, a student's score on such a test can be determined less by how well a student can analyze or interpret historical materials than by a host of other factors that also influence performance and on which individuals can differ substantially. Such factors include, for example, a student's familiarity with the testing situation, the kind of test and mode of administration, and even what the grader of the test is looking for.

A common manifestation of this p-prim is making inferences from test scores that extend well beyond what can be reasonably supported. Perhaps the most notorious example is the overinterpretation of the results of standardized intelligence tests. Performance on a particular drop-from-the-sky intelligence test does typically indicate a capability to do productive reasoning in certain circumstances. But there are many kinds of intelligent behavior, some of which are predicted pretty well by scores on intelligence tests and others that are not.8 For example, people are good chess players not because they are intelligent in a general sense but because -- through study, practice, and reflection on their performance in many, many games -- they have learned a great deal about the patterns and successful strategies in the domain of chess.9

A test is a test is a test. This p-prim is a corollary of the preceding one. Some tests that are called fourth-grade mathematics tests, for example, focus more on concepts, others focus on computations, and still others focus on using math in real-world situations. They reflect different aspects of what students know about and can do with math. Furthermore, a classroom teacher can build her quiz assuming that students are familiar with her notation, item types, and evaluation standards. This is more difficult for a drop-from-the-sky test. Moreover, assessments in the form of projects requiring extended work in math can be done over time as part of a program of instruction, but they aren't well suited for a drop-from-the-sky test that occurs on a single day.

Each assessment can be described in terms of the skills and knowledge it can tell you about, how much information it provides, its implications for learning, how closely it corresponds to students' background and instruction, and its demands on such resources as equipment, money, and student and teacher time. The trick is to match a test -- with all its many characteristics -- with the purpose of the testing and the context in which it will be used. Getting the proper match can be a delicate balancing act. For any number of reasons, the same test can be exactly right for one purpose and situation but quite useless for another. Good test developers know this, and they design different assessments for different purposes in light of the characteristics of the students, the available resources, and the constraints of the setting.

A particularly dangerous fallacy follows from this p-prim: you can take a drop-from-the-sky test constructed to gauge knowledge in a broad content area, give it to students about whom you know little else, and, by coming up with a different way of scoring it, obtain diagnostic information that will be useful for individual, small-scale instructional decisions. This generally doesn't work, and the problem isn't with the items or the scoring rules. It is that effective information about what to do next requires assessment that takes into account what a teacher already knows about a student and provides information in terms of instructional options -- not necessarily better items or more items, just the right items for the right student at the right time. Good diagnostic information results from good match-ups, not from good one-size-fits-all tests.

A score is a score is a score. With all the criticism that testing attracts, it is remarkable how much credence is typically attached to a single test score. After all, the reasoning goes, how could there be a "truer" score than the score a student actually gets? This p-prim is reinforced by the familiar practice of making decisions on the basis of a single test score without considering what the scores might have been in hypothetical administrations of alternative measures. Measurement experts recognize that different data could have arisen from testing on other occasions; from using more, fewer, or different test items; or from employing more, fewer, or different raters. (Perhaps the best way to bring home the concept of "noise" in test scores is to administer multiple tests and let people see for themselves the surprisingly large differences that result.)

Once we decide what we want to make inferences about from the data available, we can use scientific test theory to gauge how much evidence we have and compare it with what might have occurred in a variety of hypothetical alternative situations. This concept, roughly that of measurement error, is not a natural part of everyday reasoning about test scores (with the major exception that occurs when someone's score is lower than he or she expected). Assessment data are not perfect. Relying on a single score without regard to the uncertainty attached to it may be good enough for typical, low-stakes applications, but it is problematic for more consequential ones. Without scientific test theory, we could neither quantify that uncertainty nor evaluate the validity of the use of a particular test score in a particular setting.

Any two tests that measure the same thing can be made interchangeable with a little "equating" magic. This is intuitive test theory's equivalent of the perpetual motion machine. Why do people believe it? First, it seems to happen all the time. Almost everyone knows that large-scale testing programs like the SAT I and the Iowa Tests of Basic Skills (ITBS) regularly generate new test forms and that psychometricians routinely equate scores on the new forms to scores on the old ones. Second, it seems to make sense, because it follows from the preceding p-prims. If you think that tests measure what they say they measure and that all tests that measure it are essentially the same, and if you don't concern yourself with measurement error, then there is no apparent reason not to treat evidence from different tests as more or less equivalent.

But the strength of the correspondence between the evidence from one test and that from another, superficially similar, test is determined by the different aspects of knowledge and skills that the two tests tap, by the amount and quality of the information they provide, and by how well they each match the students' instructional experiences. The SAT I and ITBS testing programs can do this not so much because of the equating procedures they use but because they expend considerable effort in creating test forms with very similar combinations of questions (item types, content areas, mix of difficulties), in order to tap the same sets of skills in the same ways. When tests are not designed to be "parallel" in this way, quantifying in what ways information from one test can be used as if it came from another requires expert-level (scientific) test theory. Some inferences across tests will work well, and others will fail.

With legislation mandating the measurement of student progress and the establishment of common standards for achievement, policy makers have expressed considerable interest in linking tests from different states or different test publishers to the National Assessment of Educational Progress (NAEP). There is a long and definitive line of scientific publications pointing out the very real limitations of linking and equating different tests with the same name.10 Unfortunately, the notion that disparate tests can somehow be made equivalent by applying equating magic will not die, because life would be much easier if it were true. And by the reasoning of intuitive test theory, there is no reason why it can't be done.

You score a test by adding up scores for items. Almost all classroom quizzes and tests are graded in this way, and it works just fine for their purposes. Consequently, one can hardly be blamed for holding this p-prim. But it presumes that the target of inference is a student's overall proficiency in some domain and that the tasks on the test are relatively independent positive indicators of that proficiency. Indeed, this is the simplest (and most familiar) case of a relationship between targets of inference and bits of evidence about them. When interest focuses on dependencies among more complex forms of evidence and multifaceted models of knowledge and skill, however, this "natural" approach to scoring is severely deficient.

This approach fails for large integrated performances such as the videotaped lesson plans and teaching sessions of the National Board for Professional Teaching Standards, because multiple, interconnected judgments across many parts of the work are required. It fails for interactive problem-solving simulations (e.g., troubleshooting or patient management), because each action taken changes the situation and constrains or facilitates the next action. It fails for collections of tasks that tap a variety of skills and knowledge in different mixes, such as language tests that assess not only vocabulary and grammar but also how to conduct meaningful conversations, use cultural information, and accomplish real-world aims such as bargaining. Patterns of what is done well and where performance is inadequate are required, with the added complication that people trade off their strengths against their weaknesses when they use language in real life.

This approach also fails for assessments that aim to distinguish conceptions and misconceptions (as opposed to correctness). That is, it fails when the goal isn't to count how many problems a student can solve, but rather to develop a useful description of her thinking -- so that we can better decide what she might work on next to improve her understanding.

In all of these cases, simple scoring rules don't make the "grade" because they extract only a part of the evidence contained in students' responses -- sometimes completely missing the patterns that are most important -- and therefore can't support the nuanced inferences that are desired. Scientific test theory, extended and elaborated as needed to deal with new kinds of data and new kinds of inferences about students, is the best foundation for both effectively designing these more complex assessments and for making sense of the data they produce.

An A is 93%, a B is 85%, a C is 78%, and 70% is passing. This p-prim follows from the previous one, with the additional assumption that the tasks that make up a test have been written so that these percentages line up nicely with the traditional percent-correct metric of satisfaction for how well students have done on tests of materials that were specifically matched to their instruction. It presumes that somehow, for all tests and all uses and all students, the same percentage of correct answers corresponds to the same level of performance.

A colleague who works on certification and licensing tests tells the story of a state legislature that passed a law mandating that "the passing score on the plumber's licensing exam will be 70." Following good test-design practices, our colleague worked with plumbers to determine the kinds of knowledge and skills needed to be a competent plumber, one who is able to ply the craft ably and with due regard to safety. The committee then created a collection of tasks to probe the targeted knowledge and skills and pilot-tested them with groups of competent plumbers and with apprentices who were judged to be not yet ready to practice on their own. A passing score was selected that best differentiated the two groups. This is a sound foundation for creating a valid licensing assessment and setting a defensible level of performance for a high-stakes decision. When they got that number, it shouldn't have mattered what its numerical value was. Within the constraints of the testing program, it had been constructed to be a valid cut point for the purpose of obtaining a license. As a final step, however, the test developers had to add (or subtract) a "fudge factor" to make the passing score exactly 70.

This p-prim is plausible because for many of the tests we took in school, this grading scheme is not a bad choice. But this didn't happen by accident. Good teachers who wanted to use this grading scheme thought carefully about what they wanted students to learn and about the conditions under which students could exhibit that learning. They set up tasks and evaluated them to get data. Then they looked hard at the numbers. If the scores they saw from their students didn't jibe with their expectations, they went back to the drawing board to figure out why. Were the items unreasonable or unclear? If so, then revise or replace them. Were the students just not learning what was intended? If so, then check whether the students have the background they need, verify that they are really working, improve the pedagogy, and so on.

The difficulties encountered in applying this p-prim and the previous one in more complex settings have led to advances in measurement theory. Indeed, it is possible to construct both easy and hard tests from the same collection of items, and the same level of knowledge will produce a higher score on the easy test than on the hard one. Psychometric models based on item response theory originated in the 1960s to characterize items in terms of their difficulty and other features, so that students can be given different sets of items and still be compared on the same scale -- harder ones for fifth-graders and easier ones for third-graders, for example, or computer-administered tests that are customized to each examinee on the basis of his or her performance as it unfolds.11 So what now is an A, a B, or a C? You can't decide just by calculating the percentage of correct answers; you should decide on the basis of the pattern of correct and incorrect answers, taking into account the relative difficulty of the items presented.

Under some circumstances, the results may be reasonably well approximated by a simple sum. But the underlying principles provide a deeper understanding of why the standard procedures work in familiar situations, as well as the machinery for creating new procedures for novel situations -- very different arrangements of springs and pulleys, but undergirded by the same Newtonian laws.

Multiple-choice questions measure only recall. This p-prim is often stated as an epithet, as part of a comparison to open-ended questions. Certainly most of the multiple-choice questions that people encounter in school test only recall, and it is surely true for multiple-choice questions written by someone who believes the p-prim. But while factual recall items may be the easiest kinds of multiple-choice items to write, other types are certainly possible. For example, a multiple-choice test of subtraction can be written so that patterns of right answers and wrong answers will reflect particular misconceptions and tell us more about a student's understanding than would overall performance on a test made up of only open-ended items.

Similarly, research in physics education sparked by work like diSessa's has led to the development of multiple-choice tests that reveal which p-prims students are using. Rather than the usual open-ended computation and modeling items, the items on the Force and Motion Conceptual Evaluation present descriptions of everyday situations and ask students to choose explanations of what is happening or predict what will happen next.12 Some alternatives reflect Newton's laws, but others reflect p-prims that are more consistent with Galileo's thinking, medieval impetus theory, Aristotle's beliefs, or wholly nonscientific reasoning. The situations vary in ways that research suggests will bring particular p-prims to light.

For example, Newton's third law says that for every action (or force) there is an equal and opposite reaction. If object number 1 exerts a force on object number 2, then object number 2 exerts an equal and opposite force on object number 1. When a car and a small truck of the same weight moving at the same speed collide head-on, most students chose the response that says, "The truck exerts the same amount of force on the car as the car exerts on the truck." That's okay so far, but this is a canonical example for the third law -- easy to give the answer Newton would without understanding the underlying principle. When the small pickup truck is replaced with a huge semi traveling only half as fast, more students choose "The truck exerts a larger force on the car" because the truck is larger. Or they choose "The car exerts a larger force on the truck" because the car is going faster. These responses reflect alternative -- and in this case, conflicting -- p-prims.

In and of itself, the format of a task -- be it multiple-choice, open-ended, simulation-based, or hands-on performance -- doesn't fully determine the kind of thinking it will elicit from a student. What's more, the same task can give rise to different kinds of thinking in different students, depending on how it fits with their background and experiences. To a high school algebra student, figuring out the sum of the numbers from 1 to 100 is a simple application of a familiar formula. But rather different cognitive processes were at play when the 7-year-old Karl Friedrich Gauss derived the formula as an original insight.

Multiple-choice items can be used to test recall of facts, and most of them are used in this way. But if one has clearly in mind the concepts and relationships one wants to probe, as well as the kinds of discriminations that an understanding of them entails, then it is possible to write multiple-choice items that go far beyond recall. The principles for creating such items aren't obvious and, unfortunately, aren't a part of most people's theory of tests.

You can tell if an item is good by looking at it. Like most of the others, this p-prim rests on the assumption that items and tests are really simple objects whose essence can be grasped by their surface characteristics. However, for an item to serve a given purpose, there has to be a reasonable coherence between its particular purpose, what the item provides and what it requires, the student's understanding of the context of the item and the scoring rules, and what else the assessor knows about what the student knows. A bad mismatch at any point, and the item may fail to generate the evidence needed, no matter how "good" it looks.

For example, consider an open-ended item devised by a teacher for her Advanced Placement calculus class that uses her notation, will be scored with the rubric her students have become used to, and calls for applying what they've been studying for the last month to a real-world situation that is similar to one discussed in class. This is an ideal probe to elicit their understanding of an important learning objective. However, it would be a poor item to include in the grade-12 NAEP, which presents tasks to a random sample of students across the country -- many of whom would not be familiar with the notation or the grading rubric. Ten minutes of valuable testing time would be wasted for almost everyone who confronted the question. (The converse of this p-prim is more nearly true: You can often tell an item is bad just by looking at it. Logical flaws and confusing instructions, for example, will keep an item from providing useful information for almost any purpose.)

That the appropriateness of an item depends on "more than meets the eye" implies that writing good items is more difficult than most people would imagine. In addition to having a coherent conceptual framework and a strong evidentiary perspective, item writers must also work under constraints of time and money as they build tasks and assemble tests. It is not a vocation for the faint of heart or the novice, as recent missteps in many high-stakes state tests attest. Ironically, the more one knows about writing test items, the more challenging it is to write good ones.

Multiple-choice tests equal standardized tests equal high-stakes tests. Many of the highly visible tests used today for college admissions, for licensure and certification, and for state accountability for public schools are alike in three important ways: they have meaningful consequences for students or schools, they are presented under standard conditions, and they use multiple-choice items. This configuration occurs often enough that these three distinct properties are conflated in the public eye so that the adjectives "multiple-choice," "standardized," and "high-stakes" are thought to be synonymous -- all ways of describing the same familiar package.

But high-stakes tests can be less standardized and require performances, as is the case with doctoral dissertations and solo flights for pilot certification. Multiple-choice items are found as often in low-stakes classroom quizzes as they are in high-stakes assessments. Finally, standardization is not an all-or-nothing quality. For each aspect of an assessment, there are options about how similar to make the experience for different examinees. And, as always, seeking to standardize involves tradeoffs. Greater similarity across examinees in some facets tends to support comparisons and facilitate communication of results across time and distance. More individualization allows the tests to be better targeted to individuals' circumstances, although the interpretation of results is more tightly bound to those circumstances.

Discussion

While intuitive test theory is sufficient for classroom testing and for the quizzes in Seventeen magazine, it gets you into trouble when you want to evaluate performance on simulation-based activities, run a high-stakes testing program, or measure change in populations using an achievement survey like NAEP. There is a strong similarity -- and an important difference -- between intuitive physics and intuitive test theory that has implications for assessment use and policy. As one's understanding and expertise in physics become more profound, the concepts and tools depart from everyday physics. The same is true with assessment design and analysis at the frontiers.

It is generally accepted that this is the case in physics and, moreover, that the complexity must be confronted if one is embarking on a serious undertaking. Consider the paradigmatic example of launching a rocket to the moon. In fact, in 1961, when President Kennedy made his famous promise that by the end of that decade the U.S. would send a man to the moon and return him safely back to Earth, his staff had already consulted with experts about the feasibility of such an endeavor. Two points are noteworthy. First, everyone expected that all the options that would be considered would be in accord with Newton's laws of motion, not Aristotle's. Second, President Kennedy did not assert that, on its flight to the moon, the rocket would have to meet specific milestones that he and his advisors deemed appropriate.

In most issues that involve technical considerations, experts are consulted, and their perspectives become part of the policy debate. They don't make the decisions, and they shouldn't. In any social setting, there are more considerations than purely technical ones. But policy options should be restricted to those that are in accord with basic principles and broadly held standards of practice -- the analogs of Newton's laws of motion.

Unfortunately, this is often not the case in assessment, as a review of the testing policies in many states and the legislative history of the No Child Left Behind Act demonstrate. As assessment-based accountability becomes a more prominent feature of education policy, those standing on the technical side of assessment must confront the reality that critical decisions are made and regulations are drafted on the basis of intuitive test theory, with untoward consequences a likely result. The advent of technology-based assessment may, in many ways, exacerbate the problem. No doubt voluminous data will be produced, but insight will still be in short supply. In fact, a disciplined application of the principles of evidentiary reasoning to design, development, and analysis will be all the more necessary if the investment in technology is to yield meaningful returns.

We remain, then, with the problem that p-prims are both widely held and persistent. What, then, should those of us in educational measurement do? There are at least three lines of attack, one negative and two positive. First, we should not shy away from critiquing policies and programs that are based on intuitive test theory. This involves telling lots of people (some of them very important) that what they want to do won't work and that doing something right is harder or takes longer than they might like.

A second approach is to use scientific test theory, in conjunction with developments in psychology and technology, to achieve goals that could not have been accomplished otherwise -- certainly not by relying on intuitive test theory. These existence proofs are the most compelling argument for test theory as a scientific discipline and for its utility in the setting of education policy.

Finally, we need to do a much better job of communicating to a variety of audiences the basics of testing and the dangers we court when we ignore the principles and methods of educational measurement. Communication is a form of teaching, and we should take the challenge of this kind of teaching more seriously than ever before. Perhaps we should consider using narratives as a framework for this effort. We have an obligation to be as creative in this effort as we pride ourselves on being in our technical research.


1. Howard Gardner, Educating the Unschooled Mind (Washington, D.C.: Federation of Behavioral, Psychological, and Cognitive Sciences, 1993), p. 5.

2. Jerome Bruner, Acts of Meaning (Cambridge, Mass.: Harvard University Press, 1990), p. 14.

3. Amos Tversky and Daniel Kahneman, "Belief in the Law of Small Numbers," Psychological Bulletin, vol. 76, 1971, pp. 105-10.

4. Andrea diSessa, "Phenomenology and the Evolution of Intuition," in Dedre Gentner and Albert L. Stevens, eds., Mental Models (Hillsdale, N.J.: Erlbaum, 1983), p. 15.

5. Micki T. H. Chi, Paul Feltovich, and Robert Glaser, "Categorization and Representation of Physics Problems by Experts and Novices," Cognitive Science, vol. 5, 1981, pp. 121-52.

6. Robert J. Mislevy, "Substance and Structure in Assessment Arguments," Law, Probability, and Risk, December 2003, pp. 237-58.

7. Henry I. Braun, "A Postmodern View of the Problem of Language Assessment," in Antony J. Kunnan, ed., Studies in Language 9: Fairness and Validation in Language Assessment: Selected Papers from the 19th Language Testing Research Colloquium (Cambridge: Cambridge University Press, 2000), pp. 263-72.

8. Howard Gardner, Frames of Mind: The Theory of Multiple Intelligences (New York: Basic Books, 1983); and Robert J. Sternberg, The Triarchic Mind: A New Theory of Human Intelligence (New York: Viking-Penguin, 1988).

9. Adrianus de Groot, Thought and Choice in Chess (The Hague: Mouton, 1965).

10. See, for example, Michael J. Feuer et al., eds., Uncommon Measures: Equivalence and Linkage Among Educational Tests (Washington, D.C.: National Academies Press, 1999).

11. Howard Wainer et al., Computerized Adaptive Testing: A Primer, 2nd ed. (Hillsdale, N.J.: Erlbaum, 2000).

12. Ronald K. Thornton and David R. Sokoloff, "Assessing Student Learning of Newton's Laws: The Force and Motion Conceptual Evaluation," American Journal of Physics, vol. 66, 1998, pp. 228-351.


HENRY I. BRAUN is Distinguished Presidential Appointee at the Educational Testing Service (ETS), Princeton, N.J. ROBERT MISLEVY is a professor in the Department of Measurement, Statistics, and Evaluation, University of Maryland, College Park. They wish to acknowledge Neal Dorans, Paul Holland, and Howard Wainer for stimulating conversations on the topic of this article during its preparation. The research reported here was underwritten by ETS and grants from the Office of Educational Research and Improvement, U.S. Department of Education (No. R305B60002), and from the National Center for Research on Evaluation, Standards, and Student Testing, UCLA. However, the opinions expressed are solely those of the authors..

 

 
PDK Home | Site Map

Last modified 3/2/05
URL: http://www.pdkintl.org/kappan/k_v86/k0503br1.htm
PDK International respects your privacy
© 2005 Phi Delta Kappa International