By Julie A. Galli
Developing a test that accurately measures the skills of a test-taker
is at the core of a psychometrician's profession. Computer adaptive testing
allows the psychometrician to take this ambition to new heights. With computer
adaptive testing being a relatively new, but definitely underutilized, mode
of test delivery, it is vital that we explore its vast capabilities.
Computer adaptive testing is not simply a tool to administer a
test. It has the capacity to provide plethoric data on all types of statistical
measurement, including test-taker's scores, item analysis, validity, and reliability.
It takes computer analysis to another level by supplying more precise scores,
a multitude of data, and tremendous savings of time and money.
We will explore computer adaptive testing using an online skills
assessment as the measurement to test information technology professionals.
Central to the skills assessment is item construction. Without solid, well performing
test items, we have nothing; therefore, we will address test items with respect
to online skills assessments. Next we will move into validity and reliability
and how they impact the assessment. Finally, we will conclude with a culmination
of all factors discussed and how they pull together to form a strong and sound
assessment tool.
Online Skills Assessment
This paper will focus on measuring validity and reliability of
online skills assessments delivered on a computer adaptive test engine. Before
delving into the psychometric aspects of this topic, it is best to define what
encompasses an online skills assessment. With the explosion of information technology
(IT), there is a great need for qualified IT professionals in an array of technological
positions at different levels of proficiency and for a variety of companies.
Fortunately, for any given area of IT expertise, the knowledges and skills required
are generalizable across any position requiring that specific expertise. For
example, a telecom engineer possesses certain knowledges and skills including
maintaining an Internet Protocol network using Cisco routers. There are a great
number of companies needing IT professionals with this background. Rather than
each individual company creating its own parameters for this position, since
each company would be looking for the same parameters, it is simpler to have
one standard to follow. This is where online skills assessments are beneficial.
Many of the IT positions require certification, since there are specific knowledges
and skills these professionals must possess and the industry standards are uniform.
Furthermore, these knowledges and skills are very concrete, and not at all imprecise
as is an ability or a personal characteristic.
To clarify, a knowledge is information learned through training
or experience; a skill is an attribute involving cognitive and motor capabilities
that are acquired through practice; an ability is an innate attribute involving
cognitive or motor capabilities; and a personal characteristic refers to work
ethic, values, interests, preferences and the like (Whetzel & Wheaton, 1997).
The online certifications can focus on the requisite knowledges and skills,
since many companies will have the same needs. However, each company has its
own culture and specific needs and the abilities and personal characteristics
may differ. For those attributes, the company can provide its own assessments,
separate from the online skills assessments.
What makes the IT professional-online skills assessment union
so ideal, is that the IT professional is more than comfortable with a computer
and the Internet in general. IT professionals from all over the world can take
an online skills assessment from any computer anywhere at anytime without having
to worry about finding a testing center that has hours conducive to their schedule.
They have no apprehension of the computer or the Internet, so there is no uncomfortable
transition from paper-and-pencil tests to online tests. Given this background
information on online skills assessments, it is necessary to take the next step
in discussing the mode of delivery, and that is computer adaptive testing.
Computer Adaptive Testing
Computer adaptive testing (CAT) is a measurement instrument used
to test individuals using a computer engine that adapts to the skill level of
the test-taker by administering successive test items ranging in difficulty
based on the test-taker's performance on the previous items (Cohen, & Swerdlik,
1999; Embretson & Reise, 2000; Frederico, 1992; Hambleton, Swaminathan,
& Rogers, 1991; Hulin, Drasgow, & Parsons; Murphy & Davidshofer,
2001; Wainer, 2000). The origins of adaptive testing date back to 1908 when
Alfred Binet used adaptive testing on his intelligence tests for children where
he asked age-specific questions and ended the examination when a child answered
a certain number of questions in a row incorrectly (Hambleton et al., 1991).
Frederic Lord, who is known as the "Father of Modern Testing," is
the person who developed item response theory and computer adaptive testing
as we use it today (Hambleton, Swaminathan, & Rogers, 1991; Lord, 1980).
In recent years, CAT has grown in popularity mainly due to the
advantages of CAT over traditional paper-and-pencil tests. CAT is not beneficial
in all applications, for instance, personality tests, but in skills tests, it
is unbeatable. In fact, the disadvantages that researchers note are all nullified
when using online skills assessments for IT professionals. For example, some
of these disadvantages are the test-takers fear or unfamiliarity of computers,
the expense of building a testing center, and the difficulty in finding an existing
testing center that is proctored and has convenient hours (Kline, 1986; Pinsoneault,
1996; Federico, 1996).
There are many advantages of CAT tests that may make the use of
paper-and-pencil tests obsolete for skills assessments. There is a reduction
in test time, because fewer items are needed to gauge the level of competency
of the test-taker (Embretson & Reise, 2000; Hambleton et al, 1991; Hulin
et al., 1983; Kline, 1986; Roper, Ben-Porath, & Butcher, 1995; Wainer, 2000).
Rather than the test-taker having to answer every item on every level of difficulty,
the CAT engine adapts to the test-taker and feeds a succession of items based
on the test-takers ability. This leads to another advantage, and that is the
test-taker will be able to experience less test anxiety or on the other hand,
boredom, since the level of item difficulty matches the test-taker's skill level
(Hambleton et al., 1991; Hulin et al., 1983; Roper, Ben-Porath, & Butcher,
1995). Also, along these same lines, the test-taker will not be overly-intimidated
by too challenging items, or will not become careless by rapidly answering less
challenging items (Hambleton et al., 1991; Hulin et al., 1983). Another advantage
to the shorter test length is the elimination or reduction of test fatigue (Hambleton
et al., 1991; Hulin et al., 1983). It is very difficult to maintain an intense
cognitive pace for an extended period of time. Finally, the reduced testing
time results in less monitoring time by a proctor (Cohen & Swerdlik, 1999;
Hambleton et al., 1991; Murphy & Davidshofer, 2001), which results in a
monetary savings.
The reduction of items and the ability to not expose all of the
items in every testing session has inherent advantages for CAT over paper-and-pencil.
One advantage is test security (Hambleton et al., 1991; Murphy & Davidshofer,
2001; Overton, Harms, Taylor, & Zickar, 1997). IT certification is a big
business, and item thieves go to great lengths to memorize the items and sell
them to unscrupulous test-takers. Another advantage is that fewer items means
less overexposure of the items, thereby allowing really good items to remain
in the item bank longer.
CAT has a major advantage over paper-and-pencil tests when it
comes to scoring. Scoring is immediate. When a company needs to hire someone,
or an applicant is looking for a job, having access to the scores immediately
significantly reduces lag time in job placement. Another advantage is that human
data entry errors are eliminated, and so is the cost of hiring someone to input
the data (Cohen & Swerdlik, 1999; Federico, 1992; Hambleton et al., 1991;
Kline, 1986).
The above-mentioned advantages to CAT are all very important,
but what makes CAT such a vital measurement tool, is its accuracy in evaluating
the test-taker (Frederico, 1992; Hulin et al., 1983; Kline, 1986; Roper et al.,
1995; Wainer, 2000). Psychometricians use CAT as a tool to differentiate the
ability levels of test-takers, so that assessing their skill level is more clearly
defined. If the test item is too easy or too difficult, then the information
regarding the test-taker is irrelevant, since the too easy and too difficult
items provide no useful information regarding the test-taker's ability (Drasgow
& Olson-Buchanan, 1999). This accuracy to assess test-takers is especially
helpful in the IT world where an applicant may claim to possess a certain level
of knowledge and skill, but in fact does not.
The above advantages are more consequential once one reads the
studies comparing CAT to traditional paper-and-pencil tests showing that CAT
equals or outweighs paper-and-pencil in both reliability and validity. Federico
(1992) found no significant difference when measuring the two tests using multivariate
statistics, and that discriminant validity was higher for the CAT test. Overton
et al. (1997) determined that the reliability measurement of the CAT and paper-and-pencil
tests were not significantly different. Roper et al. (1995) found no criterion-related
validity differences between the two assessment formats that would indicate
that paper-and-pencil tests are superior. Finally, Pinsoneault (1996) discovered
that both the validity and reliability of the CAT test were higher than the
identical paper-and-pencil test.
CAT is based on item response theory (IRT), which pinpoints the
test-taker's ability level and tailors the flow of the test according to the
skill level of the test-taker (Embretson, & Reise, 2000; Hambleton et al.
1991; Hulin et al., 1983; Lord, 1980; Wainer, 2000). Since each test-taker presumably
sees a personalized version of the particular test, it is assumed that all items
are measuring the same skill. In essence, all test-takers are receiving a parallel
form of the test. In order for each form of the test to be parallel, there must
be an underlying construct or latent trait for all of the items. Therefore,
the success of the test is contingent upon each item adhering to an underlying
construct (Drasgow & Olson-Buchanan, 1999; Hambleton et al., 1991; Hulin
et al., 1983; Lord, 1980).
Test Items
Before the test is developed and the construct is named, it is
necessary to be familiar with the basics of item construction for a CAT test.
In this instance, we are focusing on the test items most conducive to online
skills assessments. First, the format will be multiple-choice with a clearly
stated and succinct question, and five dichotomous choices with four well-constructed
distracters and one unambiguously correct choice (Kline, 1986; Osterlind, 1989).
The purpose is to measure the test-taker's knowledge and skill, not to make
the test overly complicated by including inadequately constructed items.
The test developer will create six to eight times the number of
items that each test-taker will actually see. There are several reasons for
the abundance of test questions. We want to ensure that the test is secure,
the items will not be overexposed, and to easily eliminate poorly performing
items without disrupting the composition of the test (Kline, 1986). An example
of an ideal test bank is 300 to 400 items for a test that is 50 items in length.
The difficulty level of each item must be carefully considered,
because the test needs to have a balance of difficulty levels for each topic
being tested, so that all forms of the test are parallel (Anastasi & Urbina,
1997; Kline, 1986; Osterlind, 1989). Regardless of the difficulty level, all
test-takers will be tested on the same underlying construct. Also, if there
are several topic areas within the test, they must adhere to the underlying
construct and be balanced across the test (Embretson & Reise, 2000; Kline,
1986; Osterlind, 1989).
Osterlind (1989) states that test items are a unit of measurement
displaying a relationship to a specified construct. This relationship allows
the psychometrician to accurately gauge how much of the construct the test-taker
possesses when using a multiple-choice, dichotomously scored test. Sands, Waters,
and McBride (1997) posit that the item pool, item sampling, and item scoring
are the impetus behind test validity and reliability. The test items are paramount
to all aspects of building an effective assessment.
Validity
Test validity refers to how well a measurement procedures measures
what it intends to measure (Cascio, 1998; Cohen & Swerdlik, 1992; Kerlinger,
1992; Murphy & Davidshofer, 2001). In our case, we want to measure an online
skills assessment. We are not measuring the validity of the skills assessment
itself; rather we are validating the inferences of the assessment scores and
what those scores mean and how the assessment is used (Cascio, 1998; Cohen &
Swerdlik, 1992; Murphy & Davidshofer, 2001). There are three key procedures
for measuring validity: construct, content and criterion-related validity.
Construct Validity
Many theoreticians believe that construct validity is the overarching category
under which all other validity falls (Kerlinger, 1992; Murphy & Davidshofer,
2001; Osterlind, 1989). Previously mentioned is the fact that the inferences
of the skills assessment are validated and not the assessment itself. Construct
validity measures the test score in comparison to the construct on which the
test is based (Cascio, 1998; Kline, 1986). If a telecom engineer scores high
on a test written to the construct of Internet Protocols, then we can interpret
the scores as telling us that the telecom engineer possesses the requisite skills
related to the construct of Internet Protocols. Here we are validating the inference
that the test scores show a high degree of construct relatedness with regard
to Internet Protocols.
This brings us to the point of unidimensionality. With item response
theory and CAT, it is vital that the online skills assessment is unidimensional,
that is, has one construct and only one construct (Drasgow & Olson-Buchanan,
1999; Hambleton et al., 1991; Hulin et al., 1983; Wainer, 2000). This is to
ensure that each form of the test is measuring the exact same construct. Otherwise,
each test-taker will be taking a different test, and we will not know what exactly
each test is measuring, because the tests are no longer parallel (Drasgow &
Olson-Buchanan, 1999; Embretson & Reise, 2000; Hulin et al., 1983; Wainer,
2000). This does not mean that a skills assessment cannot have more than one
topic, just that each topic has one underlying construct.
Content Validity
Content validity measures the sampling of a specific piece of the content of
the universe from which it belongs (Cascio, 1998; Cohen & Swerdlik, 1999;
Kerlinger, 1992). Sampling from a universe sounds like a daunting task, but
it can be brought into perspective by conducting a job analysis. A job analysis
is a dynamic, yet systematic set of procedures used to determine job duties
and tasks and to determine the knowledge, skills, and abilities (KSAs) needed
to perform the job (Whetzel & Wheaton, 1997). For the job of a telecom engineer,
a psychologist would ascertain the knowledges and skills required for one to
be successful in that job. The knowledges and skills would become the skills
assessment subtopics. The related grouping of those knowledges and skills are
called competencies, and the competencies become the test topics.
The topics and subtopics serve as the content of the test. Notice
that the word topic is plural. A skills assessment can have more than one topic,
as long as all topics within the assessment are unidimensional. The job analysis
focuses strictly on the job of a telecom engineer. Anything that could be shared
by another position, or is not shared by all telecom engineers would not be
included. Otherwise, the skills assessment would not be generalizable to all
telecom engineers or the construct would overlap with another skills assessment
and therefore another construct (Whetzel & Wheaton, 1997). The skills assessment
can have more than one topic, provided that each topic is equally balanced with
test items from each subtopic, and equally balanced levels of difficulty per
item per subtopic. This allows the CAT engine to randomly draw from each topic
area, while confirming that all forms of the test are parallel (Drasgow &
Olson-Buchanan, 1999; Hambleton et al., 1991; Hulin et al., 1983; Lord, 1980).
By including a variety of topics, the psychometrician increases
the richness of the online skills assessment. With the computer scoring the
test, a test-taker will be evaluated not only on the construct, but on the multiple
topic areas as well. When psychometricians talk about the accuracy of assessing
a test-taker through CAT, one point they are referring to is the capability
of the computer to score each individual topic area, in addition to the test
as a whole (Hambleton et al., 1991; Hulin et al., 1983; Lord, 1980). This is
especially beneficial if a company needs to hire a telecom engineer with a specific
knowledge and skill set in an area such as protocol interoperability, but not
in multicast protocols. The level of specificity of the scoring in a CAT test
can reveal so much about the depth and type of knowledge a test-taker possesses
that a paper-and-pencil test could never do (Hambleton et al., 1991).
Criterion-Related Validity
Criterion-related validity is useful when one wants to predict a behavior from
a test (Anastasi & Urbina, 1997; Cascio, 1998; Kerlinger, 1992). Cascio
(1998) explains that we would compare two sets of scores - the test score and
a criterion - to test the hypothesis that the test score is positively correlated
to the criterion in order to predict future performance. Concurrent validity
and predictive validity are the two ways to measure criterion related validity.
Concurrent validity is when we compare a test-taker's score to
another currently available criterion, for instance, scores from a performance
evaluation. The two scores are compared and from that we can assess one's performance
(Anastasi & Urbina, 1997; Cascio, 1998; Kerlinger, 1992). A typical way
to perform concurrent validity is to have existing employees take the online
skills assessment, and then compare the scores of the assessment to their scores
on the performance evaluation. From there we would create a cut-off score that
indicates what level of knowledge and skill one must possess for that person
to succeed on the job. Future candidates would need to score above the cut-off
point to be eligible for the position being offered.
Concurrent validity has some problems in that if an incumbent
who is taking the skills assessment feels very secure in the job, the incumbent
may not try as hard to score well on a test as an applicant. Also, the incumbent
may feel more comfortable in the position and have greater years of experience
than the applicant, and that may affect test scores. The subjective nature of
performance evaluations may not give an accurate estimate of the true performance
of the incumbent (Wainer, 2000). If the evaluator scores people too high, too
low, or always in the middle, then we cannot accurately gauge the true performance
of the employee.
Predictive validity has a time lapse between the two comparison
scores. First, an applicant takes the test and is hired using the normal rules
of hiring the company employs. Next, once the applicant has been in the job
for approximately six months, a performance review is implemented. Finally,
one compares the test score to the performance evaluation and a metric is established
(Cascio, 1998; Kerlinger, 1992).
There are some issues related to predictive validity. With the
time lapse, the company may have undergone some changes and experienced turnover.
Not only are there fewer employees, but also the sample size may be too small
to provide worthwhile data. The apprehension one experiences when taking a test,
especially for a new position, may negatively affect the test scores. Conversely,
a confident test-taker may score higher than a nervous test-taker, falsely amplifying
the differences in the two scores. Similar to concurrent validity, if the performance
evaluation produces an inaccurate measure, then using the scores as a criterion
has no meaning when using it as a comparison (Wainer, 2000).
Reliability
Test reliability refers to the accuracy or precision of a measurement
tool. There are two ways to look at reliability. One is the stability of the
measurement tool over repeated trials, or the replicability of a score, and
the other is the measurement tool being free from unsystematic errors, or the
internal consistency of the scores (Anastasi & Urbina, 1997; Cascio, 1998;
Kerlinger, 1992; Kline, 1986; Wainer, 2000). In both cases, reliability is about
obtaining the true score of a test-taker. When a test-taker takes the same test
repeatedly, ideally the test score should remain the same. That is the test-taker's
true score. However, for various reasons, there is variance in the test-taker's
score, and the amount of variance is how we measure the reliability of the test.
The accuracy of an assessment tool can be measured several ways. Here we will
consider test-retest reliability, alternate form reliability, split-half reliability,
and inter-item consistency.
Test-Retest Reliability
In test-retest reliability, the same form of the test is given over two different
sessions with a time lapse between the two test administrations (Anastasi &
Urbina, 1997; Cascio, 1998; Kerlinger, 1992). Error in test-retest is due to
temporal stability. The longer the time between testing, the greater the error
and the lower the reliability score (Cascio, 1998). On the other hand, if the
test-taker studies for the second administration or goes through training, the
variance will be inaccurately less, because the conditions are not the same
(Wainer, 2000). If the time lapse is shorter, then something else will affect
the accuracy of the test-retest reliability and that is the memorization of
the test item choices (Kline, 1986).
Alternate Form Reliability
Alternate form reliability is also known as parallel form, because two forms
of the same test are administered either with or without a time lapse (Cascio,
1998; Kerlinger, 1992). Whereas in test-retest having error variance due to
temporal stability, alternate form reliability has error variance due to content
sampling, or content and time if there is a time interval between administrations.
Content sampling refers to the subject matter or content of the items in the
topic areas. If a telecom engineer is being tested on routing protocols, he
may know some aspects of routing protocols more than others. For example, if
the topic covers distance vector versus link state and he has only cursory knowledge
of link state but a command of distance vector, and one item is focused on distance
vector and the other on link state, then his scores will be skewed based on
the content covered in that topic area. Cascio (1998) does not recommend using
alternate form reliability, because of the expense and difficulty in constructing
two parallel forms, plus the correlation between the two assessment forms provides
a very conservative reliability estimate.
Split-Half Reliability
Split-half reliability is used when a single form of a test is given in a single
administration and then the test is statistically split for the correlation
of the two scores (Anastasi & Urbina, 1997; Cascio, 1998; Cohen & Swerdlik,
1999; Murphy & Davidshofer, 2001). This type of reliability is also referred
to as internal consistency, given that it is one set of items compared to another
set of items from the same test (Cascio, 1998; Cohen & Swerdlik, 1999).
Since time is not a factor in split half reliability, temporal stability is
not an issue here; however, content sampling is the contributory cause of error
variance.
One consideration for split-half reliability is how to divide the test into
two statistically equal halves. Some psychometricians choose to split into even
and odd items, because test anxiety at the beginning or burn out at the end
of the test will not contribute to error variance (Anastasi & Urbina, 1997;
Murphy & Davidshofer, 2001). Cascio (1998) argues that the method of dividing
even and odd items will not actually measure internal consistency, but rather
equivalence, which will erroneously inflate the reliability estimate. Another
approach is to randomly select items for the split halves (Cascio, 1998; Cohen
& Swerdlik, 1999).
Inter-Item Consistency
Inter-item consistency refers to the correlation of each individual item, rather
than all of the items combined, as we see in internal consistency (Cohen &
Swerdlik, 1999). Cohen and Swerdlik (1999) explain that inter-item consistency
is most valuable in evaluating the homogeneity or unidimensionality of a test.
Like internal consistency, inter-item consistency is calculated using one form
of a test in one administration. Unlike internal consistency, each item, not
each half of the test is correlated. This provides a more accurate metric of
each item and of unidimensionality in general. Since content and unidimensionality
are key to a higher reliability score, error variance is attributable to content
sampling and content multidimensionality or heterogeneity (Cascio, 1998).
The Kuder-Richardson formula 20 (KR-20) and coefficient alpha
are the principal methods of calculating inter-item consistency. KR-20 is the
mean of all possible split halves of a test, with the KR-20 being preferred
over the split-half method (Cohen & Swerdlik, 1999). Having the ability
to compare at the item level is superior to comparing at the test level when
evaluating homogeneity. The more homogeneous the items, the closer the KR-20
and split-half calculations will be. The more heterogeneous the items, the greater
the reliability of the split-half correlation (Anastasi & Urbina, 1997;
Cascio, 1998; Cohen & Swerdlik, 1999). KR-20 is ideal for dichotomously
scored tests, like our multiple choice online skills assessment (Cascio, 1998;
Cohen & Swerdlik, 1999).
The only difference between KR-20 and coefficient alpha is in
the formula. In KR-20, we sum the proportion of the test-takers who pass and
those who do not pass each item. In coefficient alpha, that piece is replaced
by the sum of the variances of the item scores. In other words, we calculate
the variance of each test-taker's item score, and then we add these variances
across all of the items (Cascio, 1998; Murphy & Davidshofer, 2001). This
formula change allows both dichotomous and continuous items to be correlated
using coefficient alpha. Cohen and Swerdlik (1999) define coefficient alpha
as the mean of all possible split halves, just like the KR-20, but with the
correlations corrected using the Spearman-Brown formula.
Cohen & Swerdlik (1999) recommend using coefficient alpha
over all other measures of error variance. Cascio (1998) suggests using caution
with coefficient alpha, because it is sensitive to the number of items (the
greater the items, the greater the reliability), item intercorrelations, and
dimensionality. Cortina as cited in Cascio (1998) advises the psychometrician
to establish the unidimensionality of the test, presumably through factor analysis,
before incorporating coefficient alpha as a metric of reliability.
Building a Solid Online Skills Assessment
Now that we have a basic understanding of validity and reliability,
it is time to bring the information together to create a well respected and
effective online skills assessment using CAT. Unidimensionality is a common
theme running through validity and reliability, so our first step is to demonstrate
that our test has one construct. Performing a factor analysis will allow us
to see mathematically the homogeneity of our test items (Sands et al., 1997).
It is important to have an objective method to assess the sampling dimensionality,
because when we do validation studies, the information will be much more subjective
(Anastasi & Urbina, 1997). Once we establish what items belong to a specific
assessment, then we can move on to reviewing our assessment.
Several subject matter experts would evaluate each item of the
test to evaluate the content. Necessary revisions would be made based on the
subject matter expert feedback. The factor analysis and content review would
take place during a pre-test or beta test period. We have to conduct a beta
test with actual test-takers, so that the feedback and statistics would reflect
that seen in the actual test. Having a large population to work with is important,
because, the sample size for the factor analysis must be ten times greater than
the number of items (Kline, 1986). From the front end, the intended users of
the test take the beta test online as if it were the actual test and evaluate
the content of the items. From the back end, the psychometricians evaluate the
results of the factor analysis and the reliability of the test items and the
test as a whole.
In discussing reliability, many options were presented. With the
online skills assessment being administered on a CAT engine, it appears that
using KR-20 would be most useful. Since the online skills assessment is delivered
via computer, we have the ability to continually monitor the test items. With
increasing iterations of the online skills assessment being taken, the reliability
statistics will continue to generate. As the sample size increases, so should
our reliability. Also, we will be able to evaluate the items on an ongoing basis.
This may mean that more items will be eliminated, if their reliability is poor,
which will allow the test to continue to improve.
As we remove items, our item pool is decreasing. KR-20 can handle
the decrease in the item pool, better than coefficient alpha (Cascio, 1998).
We want to have stringent standards, which is why we selected KR-20 over split-half,
but we do not want to jeopardize our test, because of a sensitive reliability
measure. Kerlinger (1992) discusses the principle behind improving reliability.
He calls it the "maxmincon principle" and defines it by saying that
we must strive to "maximize the variance or the individual differences
and minimize the error variance." CAT permits us to accurately assess each
test-taker who will have a pinpointed score based on his knowledges and skills,
thereby maximizing the individual differences. The error variance will be assessed
on an ongoing basis, which will allow us to make continual improvements to our
test items. This will result in minimizing the error variance.
Once our beta test has provided us with the necessary information,
it is time to post the online skills assessment as the finished product. We
will have made all of the changes that were brought to our attention during
beta testing, and the final product will be ready to assess the knowledges and
skills of the test-takers. As previously mentioned, the CAT format allows us
to make constant improvements to the assessment, even as it continues to operate.
The factor analysis provided evidence of construct validity and
the pre-beta reviewers and aforementioned job analysis provided evidence for
content validity. Employers need to conduct their own validation study to protect
themselves against adverse impact as well as to validly interpret the scores
(Whetzel & Wheaton, 1997). Employers can examine criterion-related validity
by correlating the test scores to any measurement. Whetzel & Wheaton (1997)
suggest using scores from a behaviorally anchored rating scale from a performance
review as the criterion, as they have a numeric rating and would provide a somewhat
objective metric for evaluating criterion-related validity.
Conclusion
CAT has opened up an entire new way of measuring the knowledges
and skills of test-takers that traditional paper-and-pencil tests could never
do. Not only do validity and reliability measures not suffer with CAT, oftentimes
they improve. Some of the advantages of CAT are the reduction in test administration,
reduction of items exposed, increased security, increased assessment accuracy,
immediate scoring, continual statistical monitoring, and more. CAT has changed
the way psychometricians conduct assessments. Now we can take a skills assessment
test, and not only computerize it, and make it adaptive, but we can deliver
it online via the Internet. This eliminates the need for centralized testing
centers and test administrators, which is an enormous financial savings. Test-takers
can take an online assessment at any time from any computer that has Internet
access. The convenience of online skills assessments and CAT is astounding.
As more psychometricians focus on CAT and its immense capabilities, we are surely
to see vast improvements in the quality of testing instruments in the years
to come.
References
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th
ed.). Upper Saddle River, NJ: Prentice-Hall.
Cascio, W. F. (1998). Applied psychology in human resource management (5th ed.).
Upper Saddle River, NJ: Prentice-Hall.
Cohen, R. J., & Swerdlik, M. E. (1999). Psychological testing and assessment:
An introduction to tests and measurements (4th ed.). Mountain View, CA: Mayfield.
Drasgow, F., & Olson-Buchanan, J. B. (1999). Innovations in computerized
assessment. Mahweh, NJ: Lawrence Erlbaum Associates.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists.
Mahweh, NJ: Lawrence Erlbaum Associates.
Frederico, P. A. (1992). Assessing semantic knowledge using computer-based and
paper-based media. Computers in Human Behavior, 8, 169-181.
Hambleton, R. K., Swaminathan, H, & Rogers, H. J. (1991). Fundamentals of
item response theory. Newbury Park, CA: Sage.
Hulin, C. L., Drasgow, F., & Parsons, C.K. (1983). Item response theory:
Application to psychological measurement. Homewood, IL: Dow-Jones Irwin.
Kerlinger, F. N. (1992). Foundations of behavioral research (3rd edition). Fort
Worth: Harcourt Brace.
Kline, P. (1986). A handbook of test construction: Introduction to psychometric
design. London: Methuen.
Lord, F. M. (1980). Applications of item response theory to practical testing
problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
Murphy, K. R., & Davidshofer, C. O. (2001). Psychological testing: Principles
and applications (5th ed.). Upper Saddle River, NJ: Prentice-Hall.
Osterlind, S. J. (1989). Constructing test items. Boston: Kluwer Academic.
Overton, R. C., Harms, H. J., Taylor, L. R., & Zickar, M. J. (1997). Adapting
to adaptive testing. Personnel Psychology, 50, 171-185.
Pinsoneault, T. B. (1996). Equivalency of computer-assisted and paper-and-pencil
administered versions of the Minnesota Multiphasic Personality Inventory-2.
Computers in Human Behavior, 12 (2), 291-300.
Roper, B. L., Ben-Porath, Y. S., & Butcher, J. N. (1995). Comparability
and validity of computerized adaptive testing with the MMPI-2. Journal of Personality
Assessment, 65 (2), 358-371.
Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive
testing: From inquiry to operation. Washington, DC: American Psychological Association.
Wainer, H. (2000). Computer adaptive testing: A primer (2nd ed.). Mahweh, NJ:
Lawrence Erlbaum Associates.
Whetzel, D. L., & Wheaton, G. R. (1997). Applied measurement methods in
industrial psychology. Palo Alto, CA: Davies-Black.