![]() |
Performance Assessment and the New Standards Project: A Story of Serendipitous SuccessBy Elizabeth Spalding In trying to invent a system of "tests worth teaching to," the New Standards Project achieved serendipitous success by developing and propagating a model for helping teachers form communities of learners and inquirers, Ms. Spalding points out.
Illustration by Linda Weatherly |
HISTORY offers countless examples of explorers like Christopher Columbus and Ponce de León who set out with one destination in mind but arrived at another. Many important scientific and technological advances have also resulted from apparent failures. In 1856, for example, William Perkin's failure to produce synthetic quinine resulted in his discovery of aniline purple, which gave birth to the modern dye industry. Nylon looked like an unpromising product until some DuPont chemists fooling around in the laboratory stretched it out and discovered a process that made it one of the most successful synthetic fibers ever. Spencer Silver, a 3M scientist, failed in his attempt to create a permanent adhesive. The substance he developed adhered only temporarily -- and became the "sticky" in today's ubiquitous "sticky notes."1 The history of exploration and discovery teaches us that failure and success are not absolutes and must be judged in context.
In a special section on performance assessment in the May 1999 issue of the Kappan, Edward Haertel claimed that the movement to develop large-scale, performance-based assessments has failed to bring about "sweeping education reform," as some of its proponents hoped and perhaps rashly promised.2 But is it possible that this movement, while failing to achieve one goal, has achieved others equally, if not more, important? In order to answer this question, it may be helpful to look closely at the case of one organization that aspired to develop "tests worth teaching to" -- the New Standards Project.
Of all the standards and assessment projects of the 1980s and 1990s, the New Standards Project (NSP) was unquestionably the most ambitious. A national coalition of approximately 17 states and seven urban school districts, co-directed by Lauren Resnick of the Learning, Research, and Development Center of the University of Pittsburgh and Marc Tucker of the National Center on Education and the Economy in Washington, D.C., the New Standards Project explicitly aimed to "create tests worth taking."3 Throughout the first half of the 1990s, the NSP sponsored hundreds of meetings involving thousands of educators in the crusade to build assessments toward which teachers would want to teach.4 Thousands of students in grades 4, 8, and 10 completed New Standards performance assessment tasks and compiled portfolios to demonstrate their ability to meet performance standards that were being developed simultaneously.
New Standards research and development units were housed at various sites across the country, including the University of Pittsburgh and the University of California, Berkeley. The Literacy Unit eventually came to be housed at the National Council of Teachers of English (NCTE) in Urbana, Illinois. Its co-directors were Miles Myers, then executive director of NCTE, and David Pearson, then dean of the College of Education at the University of Illinois, Urbana-Champaign. From 1991 to 1996, I served as onsite coordinator of the Literacy Unit of the New Standards Project.5
Today, New Standards has scaled back its vision considerably. It devotes most of its resources to developing and marketing "reference examinations," which are more efficient to administer and score than its earlier products. Large, national meetings are rare. By many measures, the New Standards Project looks like a classic case of an education reform that failed.
In fact, the experiences of New Standards and several of its partner states and districts serve as the primary sources for some of the major critiques of large-scale performance assessment presented by Haertel and other authors in the May 1999 Kappan. These critiques include:
1. The reliability and generalizability of performance-based assessments are questionable, making them unsuitable for purposes of accountability.
2. In practice, the impact of external performance-based assessment and portfolios on curriculum and instruction has been minimal and, in some cases, negative.
3. Because of their open-ended nature, performance-based assessments and portfolios do not lend themselves to standardization of implementation or scoring. Not only are such assessments unwieldy and intrusive, but they have done little to shrink performance gaps between and within groups, thereby raising questions about equity and opportunity-to-learn.
These critiques are balanced by the authors' cautious, positive claims, which include:
1. Performance-based assessment and portfolios reflect contemporary, constructivist views of knowledge, thinking, and learning. Traditional, multiple-choice testing was based on behaviorist views, which are now passé in many circles.
2. Performance-based assessment and portfolios have the potential to influence curriculum and instruction in a positive way.
3. At the moment, performance-based assessments and portfolios look like our best hope for providing meaningful information about the performance capabilities of students and for bringing about institutional change.
My purpose here is not so much to counter the charges against large-scale performance assessment, but rather to show a dimension of it that is often overlooked when debates focus on technical issues: its impact on the professional growth of classroom teachers. Through their participation in the New Standards Project, teachers significantly advanced their knowledge of the relationship between classroom instruction and internal and external assessment. They engaged in local and national discussions about the purposes of education and the nature of literacy. Perhaps most important, New Standards enabled teachers to escape the isolation historically associated with their profession and to become members of a community of inquirers. New Standards acknowledged that teachers are experts and sent the message that teachers' work and knowledge matter. "Sweeping education reform" -- if and when it does come -- will have to proceed classroom by classroom and thus teacher by teacher. The greatest success of the New Standards Project has been its contribution to teacher professionalization.
The NSP and the Development of Literacy Assessments
By the mid-1980s, states, districts, schools, and teachers from Maine to California were already developing alternatives to standardized, multiple-choice testing of students' literacy skills. The problem was that all those working at these sites were toiling, for the most part, independently on a labor-intensive, costly process. The New Standards Project made it possible for Maine, California, and points in between to meet, pool resources, and share expertise.6
The long-term goal of the NSP was to create an assessment system consisting of "three P's": performance examinations, portfolios, and projects. NSP partners would be able to implement all or parts of this system within their state or district assessments. Initially, NSP decided to focus on creating a "bank" of performance tasks that would yield information about the literacy skills of fourth-, eighth-, and 10th-graders.
While a number of partner states and districts had already developed performance tasks to assess reading and writing, none of these tasks had all the features that seemed desirable to literacy educators. These features included:
A complete task, then, would make visible to students and teachers alike the values and standards embedded in the performance.8 This is a tall order for an on-demand assessment task, particularly since no existing models embodied all these features. Perhaps the NSP would have done better to hire a team of experienced test developers to create such tasks, but it gave the job to teachers -- and therein lay one key to both its "failure" and its success.
Developing English Language Arts Performance Tasks
The NSP initiated a series of national meetings, each focusing on a particular segment of the development process. While the majority of participants in task development were classroom teachers, NSP meetings also drew on the expertise of an array of education professionals, including administrators, assessment directors, teacher educators, and measurement and subject-area specialists. At NSP meetings, attendees sat at assigned tables with teachers from similar grade levels but very different contexts. A typical table might include high school teachers from urban New York, suburban Colorado, rural Oregon, and a California district with a rapidly growing population of non-native speakers of English.
Teachers brought sheaves of student work, produced in response to draft performance tasks or simply in response to a successful classroom assignment that might be transformed into a performance task. They passed around papers and read the work from one another's classrooms. They talked and argued about what constitutes high-quality work and how its characteristics might be articulated in a rubric that could be easily understood by teachers and students across the country. They defined "high expectations" in terms of student performance. In the process, many realized that it was no longer possible to claim -- as teachers who have looked so long at only their own students' work that they have lost awareness of other possibilities often do -- "But my kids just can't do that kind of work!" Emotions could run high at NSP meetings, and tears were not uncommon as teachers, perhaps for the first time in their careers, explained and defended their instructional and assessment practices to a jury of peers.
Literacy Unit co-directors Myers and Pearson modeled and instilled a strong work ethic at these conferences, and, at the end of an eight-hour day, participants often carried their work-in-progress back to their hotel rooms in order to be ready for the next morning's session. Conversations begun around the tables often continued far into the night. For many attendees, this way of working was a sharp and welcome contrast to the familiar mode of after-school inservice training sessions in which teachers, weary from a long day in the classroom, sit and listen to an outside expert demonstrate a pedagogical strategy or explain an educational issue that they may or may not see as relevant to their classroom practice.
At an NSP meeting, the intensity of the work was somewhat mitigated by the amenities of the setting. Generally, teachers who travel for professional development are accustomed to carpooling, doubling or tripling up in a spartan motel room, and skimping on breakfast and lunch in order to dine in a restaurant lit by candles rather than fluorescent tubes. NSP meetings were hardly luxurious. They simply offered the "perks" to which most professionals are accustomed: private rooms, pleasant surroundings, well-prepared meals, and organized social activities. Each meeting kicked off with the distribution of T-shirts embroidered with the NSP logo, the date, and the locale of the meeting. During evening social events -- for example, a clambake or a square dance -- attendees, usually wearing their new NSP apparel, could relax, swap "teacher stories," and vent their frustrations with the day's work. These details are not trivial, for they helped to create an "NSP culture" with its own history, folk tales, rituals, and values. Among these values were collaboration, self-appraisal, a commitment to demystifying assessment for students and teachers, and a determination to blur the distinctions between learning and assessment.9
Performance task development was not a linear process. Although each meeting had a stated focus, the development of new tasks, the revision of tasks-in-progress, and the drafting of scoring rubrics occurred simultaneously. Participants learned to tolerate ambiguity and appreciate uncertainty. NSP staff provided attendees with guidelines, suggestions, and procedures, but the overarching message was clear: "You are the experts. We are depending on you and your practical professional knowledge to create tests worth teaching to."
Not surprisingly, the assessment tasks that resulted looked very much like good instruction. For example, the "Harriet Tubman Task" extended over four class periods. Fourth-graders received readers containing Harriet Tubman's biography. After reading this text and responding in writing to several open-ended questions, students role-played scenes from Tubman's life. The reader also contained the words and music to "Go Down Moses" for a class sing-along. Each class received a copy of Faith Ringgold's beautifully illustrated book The People Could Fly, which the teacher read to the class. (The class was allowed to keep the book.) Finally, after learning about Tubman from this variety of texts and responding in several different modes, students worked in groups to create panels for a mural depicting scenes from her life. The finished murals, carefully folded and labeled by teachers, were returned to the Literacy Unit after piloting, along with students' written responses. The murals were beautiful -- but unscoreable -- and the task was eventually discarded.
Despite many false starts like these, by the spring of 1994 a number of performance tasks had survived the review and revision process and had been piloted with several thousand students.10 "Mummies and Pyramids" was one of the eighth-grade tasks that weathered numerous revisions and reviews and was widely piloted. In this task, which extended over five class periods, students discussed and recorded their prior knowledge of Egypt, mummies, and pyramids. They watched a video about mummies and took notes on the film. Each student received an illustrated reader with more information on the topic and a Kids Discover: Pyramids magazine to browse for facts and topics of interest. Students took notes and responded to open-ended items about their reading. Students chose one of two writing prompts in the form of a persuasive letter, then used their readers and notes as sources. One class period was devoted to drafting and another to revising. In contrast to "Harriet Tubman," "Mummies and Pyramids" did yield useful scores for reading and writing.
At a summer benchmarking meeting in Portsmouth, New Hampshire, NSP attendees identified papers illustrating the various performance levels, refined rubrics for scoring reading and writing, and wrote commentaries linking each paper to the rubrics. NSP staff members assembled these materials into scoring guides for each task, and several weeks later, hundreds of English language arts teachers converged on Snowbird, Utah, to be trained to score student work. While many of the participants had experience with the procedures and conventions for evaluating writing, the holistic scoring of reading was still "terra incognita" to most. To everyone's disappointment and despite an intensive and rigorous training protocol, interrater reliability, particularly for reading scores, did not meet psychometric standards.
After two years of concentrated effort, the NSP had failed to produce performance tasks that met both the Literacy Unit's professional criteria (listed above) and the standards of psychometricians. Partner states and districts were growing impatient and expected some return on their investment -- tasks that could be embedded in their local assessment systems and that would yield if not individual scores at least useful information at the school level. But further literacy task development was put on hold, as the NSP turned to the second "P" in its envisioned system: portfolios.
Developing English Language Arts Portfolios
English language arts portfolio development officially began with a 1993 "summit," at which individuals involved in portfolio assessment at the school, district, state, and national levels shared their expertise and advice with the NSP.11 Portfolio development promised to be even more complex than performance task development, so the NSP began by inviting small teams of teachers and others who had some experience using portfolios for assessment to participate in the development and piloting. Elementary, middle, and high school teams met for the first time in Danvers, Massachusetts, and embarked on the two-year quest to design an English language arts portfolio table of contents and scoring system.
The atmosphere at portfolio development meetings, as at the task development meetings, was intense. More so than performance tasks, portfolios evoked emotions in teachers who brought their hard-won wisdom and classroom portfolios to the table, often to discover that what worked so well in their individual classrooms and motivated their students to create impressive demonstrations of their literacy just didn't translate well into other contexts. For example, at one meeting teachers and staff "oohed" and "aahed" over the vividly colored and exquisitely detailed botanical drawings produced by a high school student after reading Hawthorne's "Rappaccini's Daughter" and included in the portfolio as a response to literature. They were beautiful, but, like the Harriet Tubman murals, ultimately unscoreable.
Portfolios in the English language arts classroom are particularly well suited to making visible the processes of reading and writing, to showing growth in these areas, and to inviting reflection on learning. Teachers who had made the shift from a teacher- and product-centered classroom to a learner- and process-centered classroom were justifiably proud of the weighty collections of work their students had produced and understandably chagrined when outside readers found the portfolios incomprehensible or, worse, boring. Clearly, a five- or 10-pound portfolio was not feasible for purposes of large-scale assessment, and growth was a nebulous concept for objective measurement. Teachers struggled with how to incorporate the qualities of literacy they valued, such as versatility and risk-taking, into the emerging portfolio design and scoring rubrics and with how to define these terms in language that students could understand. They left each development meeting with draft portfolio "menus," rubrics, and handbooks and returned several months later with student portfolios-in-progress, student feedback, and suggestions for revision.
This first developmental year culminated in a national portfolio training conference at Indian Wells, California, in July 1994. Again, hundreds of teachers and other education professionals from NSP partner states and districts gathered to be introduced to the portfolio design, the scoring system, and the student handbooks and to examine sample portfolios at a variety of performance levels.12 These teachers then returned to their home sites to train another 2,500 teachers in the portfolio process.
During the 1994-95 school year, these 2,500 teachers and an estimated 50,000 students participated in the English language arts portfolio field trial.13 Each student not only compiled a portfolio but also completed one performance task (for example, "Mummies and Pyramids") and included it in the completed portfolio. The performance tasks were to serve as an audit mechanism to ensure consistency of scoring across sites and the authenticity of the students' work.
This was an eventful year for NSP staff and field trial participants. The effort to develop performance standards was nearing completion, and the standards had a domino effect on the portfolios already in progress: when the standards changed, the portfolio menu changed, thus requiring that the student handbook be changed, as well as the scoring system.14 In addition, concerns about attacks on NSP's standards and assessment activities by political and religious conservatives had grown into full-blown panic, especially after the collapse of the innovative California Learning Assessment System.15 Any materials that might provide fodder for conservative critics were censored, and this made the selection of "exemplar" portfolios to guide students and teachers difficult, especially at the eighth- and 10th-grade levels, since many adolescents use writing to explore experiences and ideas that many adults find objectionable.
Despite the setbacks and changes in course, at the end of the 1994-95 field trial year, each partner state and district held meetings at which teachers evaluated student portfolios according to NSP draft scoring rubrics and selected 10 potential fourth-, eighth-, and 10th-grade exemplars that would be analyzed in a national benchmarking conference in July. At this meeting, held in Ascutney, Vermont, teachers at each grade level reached consensus on a set of portfolios that exemplified each level of the tentative scoring scale. As a result of their close analysis of student work, they recommended additional changes to portfolio contents, scoring, and logistics. The Literacy Unit distributed exemplar portfolios and revised portfolio materials to partner sites in fall 1995, but once again NSP changed its focus:
Responding to a variety of constituents' concerns (e.g., the time-intrusive character of performance tasks. . . , the difficulty and expense of scoring performance tasks and portfolios, the lack of domain coverage permitted by either performance tasks or portfolios, and the lackluster record of both portfolios and performance tasks when subjected to conventional measurement indices), New Standards leaders decided . . . to add a reference examination component to the "portfolio" of examination materials available for use by its partners.16
In English language arts, the reference examination consists of a multiple-choice component of text-editing skills and reading comprehension and a performance component calling for open-ended responses to reading and a response to a writing prompt. As New Standards entered into partnership with a commercial testing company, it became clear that its formal relationship with NCTE, a nonprofit professional organization, was no longer appropriate, and the Literacy Unit in Illinois was disbanded.
The Unforeseen Success of the New Standards Project
The case of the New Standards Project illustrates most, if not all, of the problems that have beset large-scale performance assessment and portfolios. In addition, as Linda Mabry has pointed out, on-demand tasks and portfolios may overwhelm a sound classroom program of curriculum and instruction.17 When teachers are encouraged to apply scoring rubrics mindlessly rather than to use them as starting points for discussions of the quality of student work, then the value of performance assessment as a means of education reform is compromised. As Haertel noted, conservative groups may so influence the content of performance assessment tasks and portfolios as to make them meaningless to students.18 Realistically, there is little chance that any single external assessment (or any single classroom assessment, for that matter) will engage all students' interest and allow them to show what they can do. One middle-schooler vividly made this point in his evaluation of the 1994-95 portfolio field trial: "I would rather shovel coal in hell than put together another one of these portfolios."
New Standards may not have delivered on its promise to create a national system of tests worth teaching to. What it did deliver was high-quality professional development to hundreds of teachers across the country. By high-quality professional development, I mean opportunities -- to participate with other professionals in intense, intellectual discussions over the nature of content and performance standards and the nature of the English language arts; to struggle with the questions of "How good is good enough?" and "Good enough for whom?"; to describe rather than assign a number or letter to a range of performances in reading and writing and to put that description on paper in language a fourth-, eighth-, or 10th-grader can understand; to question one's own instructional and assessment practices and the practices of one's peers.19
Despite their flaws, performance tasks and portfolios demand that teachers look closely and seriously at student work. Rarely, if ever, do teachers gather around a table with a set of multiple-choice tests and answer sheets before them and wonder, "Why in the world did she choose D?" Performance tasks and portfolios invite such study. Sitting together and looking closely at student work opens up doors both real and figurative. A teacher who has spent time studying a student portfolio containing, for example, excerpts of reading-response journals, visual responses to literature, finished pieces of writing with numerous drafts and revisions appended, and an essay reflecting on the learning that has occurred as a result of all this reading, writing, and responding is not likely to go back to the classroom, close the door, and say, "Well, all that was very interesting, but I prefer worksheets." Exchanging teaching ideas and stories, developing lesson and unit plans, aligning curriculum with content and performance standards -- all these are worthwhile activities. But student work keeps us honest because it is the bottom line. If students are the context that matters most to teachers, then student work is a critical feature of that world.20 Measurement issues are probably as irrelevant to teachers as Perkin's formula for aniline purple was to the first customers for lavender, violet, and mauve fabric.
New Standards brought teachers from disparate parts of the country together for extended periods of time and gave them worthy intellectual challenges. At New Standards meetings, teachers learned and practiced ways of working, talking, and thinking, which they took back to their classrooms, schools, and districts and shared.21 Much of what teachers involved in the NSP learned, they taught one another. And many a teacher told me in person or in writing, "New Standards has changed my life."
Did students benefit from the work of New Standards? I think it's safe to assume that many did, although perhaps not in directly measurable or immediately discernible ways. Teachers who participated in New Standards returned to their classrooms intellectually energized, connected to a new network of like-minded colleagues to support them in innovative practices. Students, like the middle-schooler quoted above, may not always have enjoyed piloting tasks and compiling portfolios, particularly if they were not accustomed to these forms of assessment in the classroom, but they certainly took up the challenge. Rarely in the work of reading, benchmarking, and writing commentary on student work did teachers find examples of perfunctory performance. New Standards regularly sought students' evaluations of both performance tasks and the portfolio materials -- a step rarely taken in large-scale testing.
Teachers who piloted tasks and who may not have been directly involved in the development process had their say as well. Their critiques were legitimate (tasks disrupted classroom instruction) and often insightful (how can students who lack basic reading and writing skills hope to do well on such complex assessments?). However, quite a few teachers used the tasks as opportunities to become observers in their own classrooms and to reflect on the degree to which their daily activities resembled those required by the tasks. For example, one high school teacher wrote:
For me the most valuable aspect of this project occurred as I was completing the teacher [questionnaire]. In determining the activities I do and do not do in class, I realized how little time I allow my students for free reading and writing, two of the activities they mentioned most as parts they enjoyed about the task in their evaluations. I plan to incorporate more free reading and writing into the curriculum next semester.22
Or, as another teacher penciled plaintively, recalling the generous buffets at NSP meetings, "Send shrimp!" The point is that the NSP asked.
Certainly, many boxes of NSP tasks ended up gathering dust in supply closets, and exemplar portfolios meant to be shared with students never were, but many more materials did get from the hands of skilled teachers to the hands of students. As information passed from one level to another in the task and portfolio development and piloting process, messages were garbled. As in the game of "Gossip," a suggestion like "you should" could easily be interpreted as the mandate "you must." Even so, the energy of participating students and teachers alike seemed boundless, and the volume and quality of their work impressive.
The New Standards Project treated teachers like professionals, experts in their field. Teachers sensed this and appreciated it. Teachers who participated in the NSP soon lost that "for God's sake, just tell me what to do" attitude and began telling NSP staff how things should be done. A hotel room of one's own and a few chilled cocktail shrimp are a small price to pay for letting teachers know their work is valued and their time is valuable.
Horace Walpole coined the term "serendipity" in describing the characters in a fairy tale who "were always making discoveries, by accidents and sagacity, of things which they were not in quest of."23 In trying to invent a system of "tests worth teaching to," the New Standards Project achieved serendipitous success by developing and propagating a model for helping teachers form communities of learners and inquirers. This is strong evidence for why we need performance assessment and why the complex problems it presents are worth working on. Without it, we may have little to say to inservice or preservice teachers on the subject of assessment beyond, "Make sure they get a good night's sleep, use sharpened number-two pencils, and make their marks dark." The promise of performance assessment for bringing about lasting, beneficial changes in teaching and learning exceeds its limitations.
![]()
PDK Home | Site Map
Kappan Professional
Journal
Last updated 27 June 2000
URL: http://www.pdkintl.org/kappan/kspa0006.htm
Copyright 2000 Phi
Delta Kappa International