|
Reprinted from CHI ‘83 Conference Proceedings, pp. 72-77. © 1983 ACM.
Copying by permission of the Association for Computing Machinery.
Abstract
Integral to the design process of the Xerox 8010 “Star” workstation was
constant concern for the user interface. The design was driven by principles of
human cognition. Prototyping of ideas, paper-and-pencil analyses, and
human-factors experiments with potential users all aided in making design
decisions. Three of the human-factors experiments are described in this
paper: A selection schemes test determined the number of buttons on the mouse
pointing device and the meanings of these buttons for doing text selection.
An icon test showed us the significant parameters in the shapes of objects on
the display screen. A graphics test evaluated the user interface for making line
drawings, and resulted in a redesign of that interface.
1. Introduction
The Xerox 8010 office workstation, known as Star during development, is meant
for use by office professionals. In contrast to word processors which are
largely used by secretarial and administrative personnel, or computer systems
which are largely used by technically-trained workers, Star had to be designed
for casual users who demand extensive functionality at a small training cost.
Since the background of the targeted users was very different from that of
Star’s designers, the designers’ intuitions could not always be
used as the criteria for an acceptable system.
Recognizing that design of the Star user interface was a major undertaking, the
design team approached it using several principles, derived from cognitive psychology:
| | | There should be an explicit user’s model of the system, and it should be
familiar (drawing on objects and activities the user already works with) and consistent.
|
| | | Seeing something and pointing to it is easier for people than remembering a
name and typing it. This principle is often expressed in psychological literature
as “recognition is generally easier than recall” [Anderson].
|
| | | Commands should be uniform across domains, in cases where the domains have
corresponding actions (e.g., deleting a word from text, deleting a line from
an illustration, and deleting information from a database).
|
| | | The screen should faithfully show the state of the object the user is
working on: “What you see is what you get,”
|
| |
Even given these principles, the design space is enormous, and many proposed designs
turned out to be unsatisfactory. Further tools were needed for designing Star
than just a set of principles to start from. Once a design was proposed, it had
to be tested, which we did in several ways.
First, the general user interface was prototyped in an environment which made it
easy to modify. Care was spent on the user illusion, but not on all the underpinnings
necessary to provide an integrated, robust system. This prototype was used by
Star designers and others to get a “feel” for what they were proposing.
Sometimes a prototype was not appropriate to answer questions arising in the
design, so various analyses were performed. For instance, Card, Moran, and
Newell’s Keystroke Level Model [Card] was used to study the number of
user actions and amount of time required to perform large office tasks, given
a proposed command language. This helped identify bottlenecks and annoyances in
the procedures that would be necessary to perform the tasks.
Finally, in certain domains where neither analysis nor informal use of prototypes
was sufficient to validate or invalidate proposed designs, the Functional
Test Group (which also did much of the user interface analysis) performed
formal human-factors experiments. Those experiments are the topic of this paper.
In the rest of the paper, we first present the basics of the Star user
interface, to give the reader the context of the tests which were run. Then
we describe three representative experiments which we performed. Finally, we
discuss what sort of things were tested successfully and what sort of things were
not tested, significant features of the testing we did, and the effect the
testing had on the success of Star’s user interface.
2. Background description of Star
The Star user interface has been extensively described in papers which also address
the design philosophy and process (Seybold, Smith1, and Smith2). Here we describe
only enough of Star to motivate the user interface tests we will be covering.
| Figure 1. Elements of the Star Workstation |
Star is run on a powerful personal computer. It has a 17” diagonal,
high-resolution, bitmapped screen which can display arbitrarily complex images;
a keyboard which has a moderate number of function keys to the left, right,
and above the main typing array; and a pointing device (the mouse).
Figure 1
shows these elements graphically.
Central to the user interface is the office metaphor. Familiar office objects, such
fas documents, folders, and file drawers, are represented on the screen by
small pictures called icons. Data icons, such as documents, are mailed,
filed, and printed by moving them to icons representing outbaskets, file drawers,
and printers, respectively, so individual commands are not needed for these
operations. When the content of an object needs to be seen, such as for
editing, the icon is opened to take up a large rectangular area
on the screen called a window.
Star documents include text, graphics, typeset mathematical formulas, and
tables, all freely intermixed. All appear on the screen exactly as they will
appear when they are printed (within the limits of the display resolution), and
all can be edited interactively.
The user performs a Star action by first selecting the object of the action by
pointing to it with the mouse; it videoinverts to give feedback that it is
selected. After making a selection, the user presses the function key indicating
the desired command. Most Star actions can be performed with only four
function keys: Delete, Move, Copy, and Show Properties. These are applied to
all kinds of Star objects from characters and paragraphs to data-driven barcharts
and icons. The function of Delete is clear. Move and Copy, in addition to
allowing rearrangement and replication of objects, perform printing, mailing,
and filing functions, as mentioned above.
The Show Properties key brings up a property sheet. Each Star object has
a set of properties displayed on its property sheet. For instance, the properties
of a character are its typeface, size, position with respect to the baseline,
and so forth. The properties of a folder (a collection of documents and other
folders) include its name and the sort order of its contents. The properties
of a data-driven barchart include information on the desired orientation and
shading of the bars, the number of ticks on the axis, and, of course, the
data. The property sheets appear when asked for, let the user select desired
property settings, and then disappear when no longer needed. They offer an
immense flexibility of options for Star objects, without cluttering either a
command language or the screen.
3. Selection Schemes Tests
The goal of the two selection schemes tests was to evaluate methods for selecting
text. The schemes are various mappings of one, two, or three mouse buttons to
the functions needed for indicating what text is to be operated on. The kinds
of selection behavior needed are (1) Point: indicating a point between
two characters, to be used as the destination of a Move or Copy, or the position
where new typed text will be inserted; (2) Select: selecting some
text, possibly in increments of a character, word, sentence, paragraph, or
the whole document; and (3) Extend: extending the selection to include
a whole range of text.
Selection Scheme Test 1
The first test compared six selection schemes. These schemes are summarized in
Figure 2, schemes A through F. The six selection schemes differ in the
mapping between mouse buttons and the three operations. As one example of the
differences among schemes, in two schemes, A and B, different buttons are used
for Point and Select, while in the remaining four schemes the first button is used
for both Point and Select.
| Scheme A | Scheme B | Scheme C | Scheme D | Scheme E | Scheme F | Scheme G |
Button 1 | Point | Point | Point C Drawthrough | Point C, W, S, ¶, D Drawthrough | Point C, W, S, ¶, D Drawthrough | Point C Drawthrough | Point C, W, S, ¶, D |
Button 2 | C Drawthrough | C, W, S, ¶, D Drawthrough | W, S, ¶, D Drawthrough | | Adjust | Adjust | Adjust |
Button 3 | W, S, ¶, D Drawthrough | | | | | | |
Key:
Point: Selects a point, i.e., a position between adjacent characters. Used as
destination for Move or Copy. If the button doesn’t also
make a text selection, Point is also used to indicate a destination for type-in.
C, W, S, ¶, D: Selects a character, word, sentence, paragraph, or whole
document, by repeatedly clicking the mouse button while pointing at
something that’s already selected.
Drawthrough: The user holds the button down and moves the mouse. The selection
extends from the button-down position to the button-up
point. The selection is extended in units of whatever was previously selected.
Adjust: The user clicks the mouse button to extend the selection from the
existing selection to the button-up point. The selection is
extended in units of whatever was previously selected.
Figure 2. Description of the Selection Schemes
Methodology. Using a between-subjects paradigm, each of six groups (four subjects
per group) was assigned one of the six schemes. Two of the subjects in each group
were experienced in the use of the mouse, two were not. Each subject was first
trained in the use of the mouse and in basic Star editing techniques. Next, the
assigned scheme was taught. Each subject then performed ten text editing tasks,
each of which was repeated six times. Dependent variables were selection time and
selection errors.
Selection time. Mean selection times are shown in Figure 3. Among these six
schemes, scheme F was substantially better than the others over all six
trials (p<.001).
Scheme A | Scheme B | Scheme C | Scheme D | Scheme E | Scheme F | Scheme G |
12.25 | 15.19 | 13.41 | 13.44 | 12.85 | 9.89 | 7.96 |
Figure 3. Mean Selection Time (Secs)
Selection Errors. There was an average of one selection error per 4 tasks.
The majority (65%) were errors in drawthrough: either too far or not far enough.
The frequency of drawthrough errors did not vary as a function of selection
scheme. “Too Many Clicks” errors, e.g., the subject clicking to
a sentence instead of a word, accounted for 20% of the errors, with schemes which
employed less multiple-clicking being better. “Click Wrong Mouse Button”
errors accounted for 15% of total errors. These errors also varied across schemes,
with schemes having fewer buttons being better.
Selection Scheme Test 2
The results of the first test were interpreted as suggesting that the following
features of a selection scheme should be avoided: 1) drawthrough, 2) three buttons,
and 3) multiple-clicking. The second selection scheme test evaluated a scheme
designed with these results in mind. Scheme G is also described in Figure 2.
It is essentially Scheme F with the addition of multiple-clicking. It avoids drawthrough
and uses only two buttons. Multiple-clicking is used because, although 20%
of the errors in the first test were attributable to errors in multiple-clicking,
Star’s designers felt that a selection scheme must provide for quick
selection of standard text units.
The same methodology was used for evaluating the new scheme as was used for
the rest, except that only one user was experienced with the mouse and three were not.
Results. The mean selection time for the new scheme was 7.96 sec, the lowest time
so far. The frequency of “Too Many Clicks” errors in Scheme G was
about the same as the frequency observed in the first selection scheme test.
Conclusions. The results of the second test were interpreted as indicating
that Scheme G was acceptable for use in Star, since (1) selection time for Scheme
G was shorter than for all other schemes, and (2) the advantage of providing
quick selection of standard text units through multiple-clicking was judged
sufficiently great to balance the moderate error rate due to multiple-clicking errors.
4. Icon Shape Test
A series of tests was used in helping to decide what the icons should look like
so that they would be readily identifiable, easy to learn, and distinguishable.
The purpose of the tests was to give some feedback to the icon designers about
probable user response to designs. We did not intend that the tests alone be
used to decide which set of icons was best, but rather to point up difficulties
and preferable design directions.
We did not test icons as commands. These tests did not consider the issues of
whether iconic representation and implicit commands are better than typed names
and typed commands or whether a small set of “universal” commands
(Delete, Move, Copy, Show Properties) applied uniformly across domains (text,
graphics, printing, mailing) are superior to a large number of commands specialized
to each domain.
Methodology and Results
| Figure 4. Four sets of icon designs were tested (only nine of the seventeen in each set are shown here). Set I was chosen and modified as shown at the right |
Four different sets of 17 icons were designed by four different designers (see
Figure 4). Five subjects were assigned to each set for a total of 20
subjects. A series of paper-and-pencil tests was used to assess familiarity (Naming
Tests); two response-time tests using a computer and display measured recognizability
and distinguishability (Timed Tests); finally, subjects were asked for their
opinions (Rating Tests).
Naming Tests. First the experimenter showed the icons one at a time, each on
a 3X5” card, and asked for “a short description of what you think it is.”
Then the entire set was presented and the subjects were allowed to change their
descriptions. Next, names and short descriptions were given and the subject was
asked to “point to the symbol that best fits each description.”
Finally, with all the names available, the subject was asked to put “one
of the names next to each symbol.”
Since Set 2 had each icon named already, the naming tests showed the obvious
value of having labels on icons. The three sets without labels were misinterpreted
about 25% of the time on first sight. A few specific icons were revealed
as most difficult: Printer (Sets 3 and 4), Record File (1, 3, 4), Directory (3,
4), Group (1, 3). For example, the Group from Set 1 was described as “cemetery
plots – to purge information” and as “keyboard – pushbuttons”.
Timed Tests. The two timed tests used a Xerox Alto computer with the icons displayed
on the screen as they would be in Star. For the first timed test, we used
a procedure suggested by Pew and Green [Green]. The subjects were given the name
of an icon and told that it may or may not appear on the display. When an
icon appeared they responded as quickly as possible by pressing a YES- or a NO-button
depending on whether they thought the one presented was the one named. This
test showed no significant differences among the icon sets. We concluded that the
short training involved in the Naming Tests was adequate for any of the sets.
| Figure 5. Summary of results from two of the icon tests showing eight of the seventeen icons. “Good” icons should be in the lower right of the diagram |
In the second timed test, we asked the subjects to point as quickly as possible
to the named icon in a randomized display of all the icons. Results of
this test, combined with the naming results, are shown in Figure 5. This
test showed some significant differences among sets and icons. Over all, subjects
with Set 2 took roughly 0.5 seconds longer than subjects with the other sets to
find icons (2.5 vs 2.0), and subjects took more than a second longer to find the
Document and Folder than to find the other icons (3.0 vs. 2.0).
Rating Tests. At the end of the tests, subjects were asked to say whether any
of the icons in their set were “easy” or “difficult... to pick
out of the crowd”. Subjects’ opinions corresponded fairly well with
their performance.
When shown all four sets and asked to choose a best icon for each type,
subjects usually chose on the basis of which was most realistically depicted
or because of the labels. Over-all preference was given to Set 2 (“most
helpful”) or to Set 4 (“more different shapes”). The
opinions strongly reflect the tasks given in the tests; considerations beyond
the tests would have been difficult for the subjects to judge.
Conclusions
The naming tests pointed out the value of labels (in Set 2), but the YES-NO
response-time test indicated that, once learned, there was little difference among
the sets for recognition. The pointing test, where distinguishability was
important, showed that the sets with more visual variety (Sets 1, 3, and 4)
were more successful. The most useful results from the icon tests were
recommendations about specific icons; those with problems were redesigned.
The final choice of icon designs included a variety of concerns beyond those
that could be addressed by the tests. For example, to give the user feedback
that a particular icon is selected, its image is inverted (everything white
becomes black and vice versa). Set 1, which has every icon predominantly white,
was considered the best at showing selection. Finally, an important consideration
in choosing the icon designs was how refined the set was graphically. With
some redesign, Set 1 was the final choice for Star.
5. Graphics Tests
Unlike the two tests just described, the goal of the graphics testing was
much less clearcut. We simply wanted to find out how easy the user interface
was to learn, and where the difficulties were.
| Figure 6. Graphic selections and commands. The new scheme simplified selection by eliminating multiple-clicking and adding graphics-only commands (Stretch and Line). |
The Star graphics functionality, described in detail in [Lipkie], involves a
structured graphics approach to making line drawings. Lines and rectangles,
like other Star objects such as icons and characters, are objects that can be
selected, moved, copied, and altered. According to the original user interface
at the time of the tests, selection of graphics objects followed the text
paradigm closely (see Figure 6): clicking the left mouse button once at
an object (such as a line) selected one point on the object (an end of the line);
a second click of the left button enlarged the selection so that it included
the whole object. Because of this richness in selecting, few function keys were
able to perform many functions. For instance, a user could lengthen, shorten,
and rotate (“stretch”) a line by selecting only one end and
pressing the Move key. The same key moved the entire line if the whole line
were selected (by clicking twice). Creation of new lines was done with the Copy
key, with a special accelerator when only an end of a line was selected
that aided in making connected lines. Captions could be added to the
illustration by copying into the picture a “text frame,” a rectangular
area which was capable of containing text. Prototype examples of all graphics
objects could be obtained from a system-supplied document called a “graphics
transfer sheet.”
Methodology
This experiment used a small number (3) of inexperienced subjects, since we were
looking for qualitative behavior, rather than statistical significance in these
tests. The subjects had already been through a prototype of Star’s on-line
introduction to the general functions, so they had a background in the use of
Star, and we knew roughly how they fit into the spectrum of Star users. For this
test the subjects read hardcopy graphics training which consisted of explanatory
material, interspersed with exercises done on the machine. At the end of the
training, the subjects were asked to create some illustrations, both from scratch
and by modifying existing illustrations. The test was self-paced. Time and
performance were the dependent variables.
About five weeks later, two of the three subjects returned to do some exercises to
show how much of the training they had retained.
The entire study (taking up to one day for the test, and one hour for the
follow-up) was videotaped. Cameras showed both the user and the screen, along
with the time of day.
Results
Both during the test and follow-up, evaluators recorded the times spent in each
part of the training and exercises, plus critical incidents in the use of
the system. These critical incidents were later catalogued into problems with the
prototype implementation, with the design of the user interface, with the
training, and with the design of the experiment. They were also prioritized
according to how pervasive and persistent they were.
The design problems were described to the Star design group, and were reinforced
by showing the designers clips of the videotape. There were two major
user-interface problems: First, the multiple clicking that distinguished selection of
the end of an object from selection of the whole object was far too error
prone. Selection should be made at one level only. This necessitated addition
of a function key for the Stretch function, since the Move key could no
longer do double duty. Secondly, the Copy method of making a new line was too
awkward. Since making a new line is central to graphics, it was felt that
a function key should be dedicated to this operation.
Redesign. Both of these changes to the user interface involved adding new
function keys. But at that time the number of keys on the Star keyboard
was frozen, and all had assigned meanings. The suggested solution was to
change the meanings of the function keys across the top of the keyboard,
since (a) they were already being changed in another context, and (b) they
were normally just accelerators for text functions and had no use in the graphics
context. The new meanings of the keys would be displayed on the screen whenever
they were in effect. There were eight keys there, but only two were needed
to solve the problems found by testing. However, the inventive designers
quickly found uses for most of the rest.
After this redesign, the graphics user interface was presumably easier to use.
But the new design added complication to Star in general by allowing function
keys to change their meaning in a way much more obtrusive than before. We did
not know whether the overall effect was an improvement or not, so the test
was repeated to compare the new scheme with the old.
Retest. The second graphics test fixed several problems in the experiment
design, and used early versions of the customer training materials. It was run
similarly to the first, with three subjects learning the old user interface and
four learning the new one. The results of the repeated test of the old user
interface were very similar to those of the original test. Both versions took
similar amounts of time in the training portion, but at the end the users of
the new interface were quicker at making illustrations and finished more of the
tasks (see Figure 7). New problems were identified, of course, but
they were relatively easy to fix, so the new user interface was the one
adopted for the product.
| Old Interface | New Interface |
Time per training module (min.) | 32 ± 12 | 42 ± 12 |
Time per task (min.) | 18 ± 5 | 9 ± 5 |
Numbers are given as M ± SD, where M is the mean over all the users and SD is the standard deviation.
Figure 7. Quantitative Comparison of Graphics Interfaces
6. Summary and Conclusions
The three experiments described here run the gamut from formality to
informality, depending on the purposes of the tests and the costs
of the experiments. In general, we were able to be most formal and careful when
the topic of the experiment was well-defined and when the experiments could be
kept short. As the questions to be settled became less well-defined, on the
other hand, experiments took on a flavor of “fishing expeditions”
to see what we came up with. Particularly when we addressed problems relating to
use of a general Star function and the relationship of that function to the rest
of Star, the experiments required large amounts of training. This was very
costly both in setting up the tests and in execution; a consequence was that
fewer subjects were used. Finally, extremely vague questions, such as whether
icons in general provide a better user interface than typing commands, were not
tested at all; icons were shown to be an acceptable user interface, and that
result sufficed for our purposes.
Important points we found in our experimentation are the following:
(1) Videotaping was a very important tool. First of all, the cameras allowed
us to see and hear the subject without being in the same room. Secondly, it
was a record of all activity, so we didn’t need to take perfect notes during the
experiment. Third, the designers were more convinced by the videotapes than
by our dry numbers that people were having trouble with their system.
(2) All tests were flexible enough to allow the experimenters to observe why
results were coming out the way they were. For example, verbal protocols were
elicited in many of the tests and formal or informal interviews followed all
the tests. This was important in helping us suggest design improvements.
Star was a mammoth undertaking. “The design effort took more than six
years. ... The actual implementation involved from 20 to, eventually, 45 programmers
over 3.5 years producing over 250,000 lines of highlevel code." [Harslem] By
the time of the initial Star release, the Functional Test Group had performed
over 15 distinct human-factors tests, using over 200 experimental subjects and
lasting for over 400 hours (Figure 8). In addition, we applied a standard
methodology to compare Star’s text editing features to those of other
systems [Roberts]. The group averaged 6 people (1 manager, 3 scientists, and
2 assistants) for about 3 years to perform this work.
Test Topic | No. Sub | Tot. Hrs | Impact |
Selection Schemes | 28 | 64 | Lead to new design; validated new scheme |
Keyboard (6 layouts) | 20 | 40 | Led to design of keyboard |
Display | 20 | 10 | Specified display phosphor and refresh rate |
Tab-indent | 16 | 16 | Caused redesign of Tab and Indent functionality |
Labels | 12 | 6 | Caused change in property sheet and keyboard labels |
Property Sheets | 20 | 40 | Identified potential interface problems and redesigns |
Fonts | 8 | 6 | Led to decision on screen-paper coordination |
Icons | 20 | 30 | Led to design of icons |
Initial Dialogue | 12 | 36 | Led to design of training facility and materials |
HELP | 2 | 6 | Validated HELP design ideas |
Graphics | 10 | 65 | Led to redesign; validated new design |
Graphic Idioms | 4 | 16 | Contributed to redesigns |
J-Star Labels | 25 | 25 | Led to design of keyboard labels for Japanese-Star |
Figure 8. Partial listing of Star-1 Functional Tests
The impact of Functional Testing on the Star product has been a pervasive set
of small and large changes to the user interface. The amount of difference these
changes made is, of course, impossible to assess, but the quality of Star’s
user interface is well known. It has won an award as the “friendliest”
computer system of 1982, as judged by Computing magazine. Imitators, led
by Sidereal, Apple’s Lisa, and VisiCorp’s VisiON,
are starting to have a major impact on the marketplace. We can only take
this as a ratification of Star’s design process, a rich blend of user
interface principles, functional analysis, and human interface testing.
Acknowledgments
The tests described here were carried out by the staff of the Functional
Test Group consisting of W. Bewley, C. McBain, L. Miller, T. Roberts,
D. Schroit, W. Verplank, and R. Walden. Able assistance was provided by
M. Beard, W. Bowman, N. Cox, A. Duvall, W. Judd, J. Newlin and
D. Silva. Star user-interface design was the result of a long process of
innovation at Xerox PARC and elsewhere; however immediate credit should go
to Eric Harslem, Charles Irby, Ralph Kimball, and David C. Smith.
William L. Bewley, Teresa L. Roberts, David Schroit, William L. Verplank
Xerox Office Systems Division
References
[Anderson] Anderson, J. R. Cognitive psychology and its implications. W.
H. Freeman and Company, San Francisco, 1980.
[Card] Card, S. K., Moran, T. P., and Newell, A. The Keystroke-Level Model for
user performance time with interactive systems. Communications of the ACM,
23, 7, (July 1980), 396-410.
[Green] Green, P. and Pew, R. W. Evaluating pictographic symbols: an automotive
application. Human Factors, 20, 1, (Feb. 1978), 102-114.
[Harslem] Harslem, E., Nelson, L. E.
A retrospective on the development of
Star. Proc. of the 6th International Conference on Software Engineering,
Tokyo, Japan, (Sept. 1982), 377-383.
[Lipkie] Lipkie, D. E., Evans, S. R., Newlin, J. K., Weissman, R. L.
Star graphics: an object-oriented implementation. Computer Graphics, 16,
3, (July 1982), 115-124.
[Roberts] Roberts, T. L. and Moran, T. P. The evaluation of text editors:
methodology and empirical results. Communications of the ACM, 26, 4, (April 1983)
[Seybold] Seybold, J. W. The Xerox Star, a “professional” workstation.
The Seybold Report on Word Processing, 4, 5, (May 1981), 1-19.
[Smith1] Smith, D. C, Irby, C, Kimball, R., Verplank, W., Harslem, E.
Designing
the Star user interface. Byte, 7, 4, (April 1982), 242-282.
[Smith2] Smith, D. C, Harslem, E., Irby, C, Kimball, R.
The Star user interface:
an overview. Proceedings of the AFIPS 1982 National Computer
Conference, 50, (June 1982), 515-528.
| |