Brainbench - a Previsor company
HomeIndividuals Register/Logon View CartPrivacyAbout BrainbenchHelp
Partners
 Overview
 Why Brainbench?
 Authorized Resellers
 Advertising
 Community
 Refer a Company
 Success Stories
 
Our Sponsors

List all White Papers

Measuring Validity and Reliability of Computer Adaptive Online Skills Assessments

September 24, 2001

By Julie A. Galli

Developing a test that accurately measures the skills of a test-taker is at the core of a psychometrician's profession. Computer adaptive testing allows the psychometrician to take this ambition to new heights. With computer adaptive testing being a relatively new, but definitely underutilized, mode of test delivery, it is vital that we explore its vast capabilities.

Computer adaptive testing is not simply a tool to administer a test. It has the capacity to provide plethoric data on all types of statistical measurement, including test-taker's scores, item analysis, validity, and reliability. It takes computer analysis to another level by supplying more precise scores, a multitude of data, and tremendous savings of time and money.

We will explore computer adaptive testing using an online skills assessment as the measurement to test information technology professionals. Central to the skills assessment is item construction. Without solid, well performing test items, we have nothing; therefore, we will address test items with respect to online skills assessments. Next we will move into validity and reliability and how they impact the assessment. Finally, we will conclude with a culmination of all factors discussed and how they pull together to form a strong and sound assessment tool.

Online Skills Assessment

This paper will focus on measuring validity and reliability of online skills assessments delivered on a computer adaptive test engine. Before delving into the psychometric aspects of this topic, it is best to define what encompasses an online skills assessment. With the explosion of information technology (IT), there is a great need for qualified IT professionals in an array of technological positions at different levels of proficiency and for a variety of companies. Fortunately, for any given area of IT expertise, the knowledges and skills required are generalizable across any position requiring that specific expertise. For example, a telecom engineer possesses certain knowledges and skills including maintaining an Internet Protocol network using Cisco routers. There are a great number of companies needing IT professionals with this background. Rather than each individual company creating its own parameters for this position, since each company would be looking for the same parameters, it is simpler to have one standard to follow. This is where online skills assessments are beneficial. Many of the IT positions require certification, since there are specific knowledges and skills these professionals must possess and the industry standards are uniform. Furthermore, these knowledges and skills are very concrete, and not at all imprecise as is an ability or a personal characteristic.

To clarify, a knowledge is information learned through training or experience; a skill is an attribute involving cognitive and motor capabilities that are acquired through practice; an ability is an innate attribute involving cognitive or motor capabilities; and a personal characteristic refers to work ethic, values, interests, preferences and the like (Whetzel & Wheaton, 1997). The online certifications can focus on the requisite knowledges and skills, since many companies will have the same needs. However, each company has its own culture and specific needs and the abilities and personal characteristics may differ. For those attributes, the company can provide its own assessments, separate from the online skills assessments.

What makes the IT professional-online skills assessment union so ideal, is that the IT professional is more than comfortable with a computer and the Internet in general. IT professionals from all over the world can take an online skills assessment from any computer anywhere at anytime without having to worry about finding a testing center that has hours conducive to their schedule. They have no apprehension of the computer or the Internet, so there is no uncomfortable transition from paper-and-pencil tests to online tests. Given this background information on online skills assessments, it is necessary to take the next step in discussing the mode of delivery, and that is computer adaptive testing.

Computer Adaptive Testing

Computer adaptive testing (CAT) is a measurement instrument used to test individuals using a computer engine that adapts to the skill level of the test-taker by administering successive test items ranging in difficulty based on the test-taker's performance on the previous items (Cohen, & Swerdlik, 1999; Embretson & Reise, 2000; Frederico, 1992; Hambleton, Swaminathan, & Rogers, 1991; Hulin, Drasgow, & Parsons; Murphy & Davidshofer, 2001; Wainer, 2000). The origins of adaptive testing date back to 1908 when Alfred Binet used adaptive testing on his intelligence tests for children where he asked age-specific questions and ended the examination when a child answered a certain number of questions in a row incorrectly (Hambleton et al., 1991). Frederic Lord, who is known as the "Father of Modern Testing," is the person who developed item response theory and computer adaptive testing as we use it today (Hambleton, Swaminathan, & Rogers, 1991; Lord, 1980).

In recent years, CAT has grown in popularity mainly due to the advantages of CAT over traditional paper-and-pencil tests. CAT is not beneficial in all applications, for instance, personality tests, but in skills tests, it is unbeatable. In fact, the disadvantages that researchers note are all nullified when using online skills assessments for IT professionals. For example, some of these disadvantages are the test-takers fear or unfamiliarity of computers, the expense of building a testing center, and the difficulty in finding an existing testing center that is proctored and has convenient hours (Kline, 1986; Pinsoneault, 1996; Federico, 1996).

There are many advantages of CAT tests that may make the use of paper-and-pencil tests obsolete for skills assessments. There is a reduction in test time, because fewer items are needed to gauge the level of competency of the test-taker (Embretson & Reise, 2000; Hambleton et al, 1991; Hulin et al., 1983; Kline, 1986; Roper, Ben-Porath, & Butcher, 1995; Wainer, 2000). Rather than the test-taker having to answer every item on every level of difficulty, the CAT engine adapts to the test-taker and feeds a succession of items based on the test-takers ability. This leads to another advantage, and that is the test-taker will be able to experience less test anxiety or on the other hand, boredom, since the level of item difficulty matches the test-taker's skill level (Hambleton et al., 1991; Hulin et al., 1983; Roper, Ben-Porath, & Butcher, 1995). Also, along these same lines, the test-taker will not be overly-intimidated by too challenging items, or will not become careless by rapidly answering less challenging items (Hambleton et al., 1991; Hulin et al., 1983). Another advantage to the shorter test length is the elimination or reduction of test fatigue (Hambleton et al., 1991; Hulin et al., 1983). It is very difficult to maintain an intense cognitive pace for an extended period of time. Finally, the reduced testing time results in less monitoring time by a proctor (Cohen & Swerdlik, 1999; Hambleton et al., 1991; Murphy & Davidshofer, 2001), which results in a monetary savings.

The reduction of items and the ability to not expose all of the items in every testing session has inherent advantages for CAT over paper-and-pencil. One advantage is test security (Hambleton et al., 1991; Murphy & Davidshofer, 2001; Overton, Harms, Taylor, & Zickar, 1997). IT certification is a big business, and item thieves go to great lengths to memorize the items and sell them to unscrupulous test-takers. Another advantage is that fewer items means less overexposure of the items, thereby allowing really good items to remain in the item bank longer.

CAT has a major advantage over paper-and-pencil tests when it comes to scoring. Scoring is immediate. When a company needs to hire someone, or an applicant is looking for a job, having access to the scores immediately significantly reduces lag time in job placement. Another advantage is that human data entry errors are eliminated, and so is the cost of hiring someone to input the data (Cohen & Swerdlik, 1999; Federico, 1992; Hambleton et al., 1991; Kline, 1986).

The above-mentioned advantages to CAT are all very important, but what makes CAT such a vital measurement tool, is its accuracy in evaluating the test-taker (Frederico, 1992; Hulin et al., 1983; Kline, 1986; Roper et al., 1995; Wainer, 2000). Psychometricians use CAT as a tool to differentiate the ability levels of test-takers, so that assessing their skill level is more clearly defined. If the test item is too easy or too difficult, then the information regarding the test-taker is irrelevant, since the too easy and too difficult items provide no useful information regarding the test-taker's ability (Drasgow & Olson-Buchanan, 1999). This accuracy to assess test-takers is especially helpful in the IT world where an applicant may claim to possess a certain level of knowledge and skill, but in fact does not.

The above advantages are more consequential once one reads the studies comparing CAT to traditional paper-and-pencil tests showing that CAT equals or outweighs paper-and-pencil in both reliability and validity. Federico (1992) found no significant difference when measuring the two tests using multivariate statistics, and that discriminant validity was higher for the CAT test. Overton et al. (1997) determined that the reliability measurement of the CAT and paper-and-pencil tests were not significantly different. Roper et al. (1995) found no criterion-related validity differences between the two assessment formats that would indicate that paper-and-pencil tests are superior. Finally, Pinsoneault (1996) discovered that both the validity and reliability of the CAT test were higher than the identical paper-and-pencil test.

CAT is based on item response theory (IRT), which pinpoints the test-taker's ability level and tailors the flow of the test according to the skill level of the test-taker (Embretson, & Reise, 2000; Hambleton et al. 1991; Hulin et al., 1983; Lord, 1980; Wainer, 2000). Since each test-taker presumably sees a personalized version of the particular test, it is assumed that all items are measuring the same skill. In essence, all test-takers are receiving a parallel form of the test. In order for each form of the test to be parallel, there must be an underlying construct or latent trait for all of the items. Therefore, the success of the test is contingent upon each item adhering to an underlying construct (Drasgow & Olson-Buchanan, 1999; Hambleton et al., 1991; Hulin et al., 1983; Lord, 1980).

Test Items

Before the test is developed and the construct is named, it is necessary to be familiar with the basics of item construction for a CAT test. In this instance, we are focusing on the test items most conducive to online skills assessments. First, the format will be multiple-choice with a clearly stated and succinct question, and five dichotomous choices with four well-constructed distracters and one unambiguously correct choice (Kline, 1986; Osterlind, 1989). The purpose is to measure the test-taker's knowledge and skill, not to make the test overly complicated by including inadequately constructed items.

The test developer will create six to eight times the number of items that each test-taker will actually see. There are several reasons for the abundance of test questions. We want to ensure that the test is secure, the items will not be overexposed, and to easily eliminate poorly performing items without disrupting the composition of the test (Kline, 1986). An example of an ideal test bank is 300 to 400 items for a test that is 50 items in length.

The difficulty level of each item must be carefully considered, because the test needs to have a balance of difficulty levels for each topic being tested, so that all forms of the test are parallel (Anastasi & Urbina, 1997; Kline, 1986; Osterlind, 1989). Regardless of the difficulty level, all test-takers will be tested on the same underlying construct. Also, if there are several topic areas within the test, they must adhere to the underlying construct and be balanced across the test (Embretson & Reise, 2000; Kline, 1986; Osterlind, 1989).

Osterlind (1989) states that test items are a unit of measurement displaying a relationship to a specified construct. This relationship allows the psychometrician to accurately gauge how much of the construct the test-taker possesses when using a multiple-choice, dichotomously scored test. Sands, Waters, and McBride (1997) posit that the item pool, item sampling, and item scoring are the impetus behind test validity and reliability. The test items are paramount to all aspects of building an effective assessment.

Validity

Test validity refers to how well a measurement procedures measures what it intends to measure (Cascio, 1998; Cohen & Swerdlik, 1992; Kerlinger, 1992; Murphy & Davidshofer, 2001). In our case, we want to measure an online skills assessment. We are not measuring the validity of the skills assessment itself; rather we are validating the inferences of the assessment scores and what those scores mean and how the assessment is used (Cascio, 1998; Cohen & Swerdlik, 1992; Murphy & Davidshofer, 2001). There are three key procedures for measuring validity: construct, content and criterion-related validity.

Construct Validity
Many theoreticians believe that construct validity is the overarching category under which all other validity falls (Kerlinger, 1992; Murphy & Davidshofer, 2001; Osterlind, 1989). Previously mentioned is the fact that the inferences of the skills assessment are validated and not the assessment itself. Construct validity measures the test score in comparison to the construct on which the test is based (Cascio, 1998; Kline, 1986). If a telecom engineer scores high on a test written to the construct of Internet Protocols, then we can interpret the scores as telling us that the telecom engineer possesses the requisite skills related to the construct of Internet Protocols. Here we are validating the inference that the test scores show a high degree of construct relatedness with regard to Internet Protocols.

This brings us to the point of unidimensionality. With item response theory and CAT, it is vital that the online skills assessment is unidimensional, that is, has one construct and only one construct (Drasgow & Olson-Buchanan, 1999; Hambleton et al., 1991; Hulin et al., 1983; Wainer, 2000). This is to ensure that each form of the test is measuring the exact same construct. Otherwise, each test-taker will be taking a different test, and we will not know what exactly each test is measuring, because the tests are no longer parallel (Drasgow & Olson-Buchanan, 1999; Embretson & Reise, 2000; Hulin et al., 1983; Wainer, 2000). This does not mean that a skills assessment cannot have more than one topic, just that each topic has one underlying construct.

Content Validity
Content validity measures the sampling of a specific piece of the content of the universe from which it belongs (Cascio, 1998; Cohen & Swerdlik, 1999; Kerlinger, 1992). Sampling from a universe sounds like a daunting task, but it can be brought into perspective by conducting a job analysis. A job analysis is a dynamic, yet systematic set of procedures used to determine job duties and tasks and to determine the knowledge, skills, and abilities (KSAs) needed to perform the job (Whetzel & Wheaton, 1997). For the job of a telecom engineer, a psychologist would ascertain the knowledges and skills required for one to be successful in that job. The knowledges and skills would become the skills assessment subtopics. The related grouping of those knowledges and skills are called competencies, and the competencies become the test topics.

The topics and subtopics serve as the content of the test. Notice that the word topic is plural. A skills assessment can have more than one topic, as long as all topics within the assessment are unidimensional. The job analysis focuses strictly on the job of a telecom engineer. Anything that could be shared by another position, or is not shared by all telecom engineers would not be included. Otherwise, the skills assessment would not be generalizable to all telecom engineers or the construct would overlap with another skills assessment and therefore another construct (Whetzel & Wheaton, 1997). The skills assessment can have more than one topic, provided that each topic is equally balanced with test items from each subtopic, and equally balanced levels of difficulty per item per subtopic. This allows the CAT engine to randomly draw from each topic area, while confirming that all forms of the test are parallel (Drasgow & Olson-Buchanan, 1999; Hambleton et al., 1991; Hulin et al., 1983; Lord, 1980).

By including a variety of topics, the psychometrician increases the richness of the online skills assessment. With the computer scoring the test, a test-taker will be evaluated not only on the construct, but on the multiple topic areas as well. When psychometricians talk about the accuracy of assessing a test-taker through CAT, one point they are referring to is the capability of the computer to score each individual topic area, in addition to the test as a whole (Hambleton et al., 1991; Hulin et al., 1983; Lord, 1980). This is especially beneficial if a company needs to hire a telecom engineer with a specific knowledge and skill set in an area such as protocol interoperability, but not in multicast protocols. The level of specificity of the scoring in a CAT test can reveal so much about the depth and type of knowledge a test-taker possesses that a paper-and-pencil test could never do (Hambleton et al., 1991).

Criterion-Related Validity
Criterion-related validity is useful when one wants to predict a behavior from a test (Anastasi & Urbina, 1997; Cascio, 1998; Kerlinger, 1992). Cascio (1998) explains that we would compare two sets of scores - the test score and a criterion - to test the hypothesis that the test score is positively correlated to the criterion in order to predict future performance. Concurrent validity and predictive validity are the two ways to measure criterion related validity.

Concurrent validity is when we compare a test-taker's score to another currently available criterion, for instance, scores from a performance evaluation. The two scores are compared and from that we can assess one's performance (Anastasi & Urbina, 1997; Cascio, 1998; Kerlinger, 1992). A typical way to perform concurrent validity is to have existing employees take the online skills assessment, and then compare the scores of the assessment to their scores on the performance evaluation. From there we would create a cut-off score that indicates what level of knowledge and skill one must possess for that person to succeed on the job. Future candidates would need to score above the cut-off point to be eligible for the position being offered.

Concurrent validity has some problems in that if an incumbent who is taking the skills assessment feels very secure in the job, the incumbent may not try as hard to score well on a test as an applicant. Also, the incumbent may feel more comfortable in the position and have greater years of experience than the applicant, and that may affect test scores. The subjective nature of performance evaluations may not give an accurate estimate of the true performance of the incumbent (Wainer, 2000). If the evaluator scores people too high, too low, or always in the middle, then we cannot accurately gauge the true performance of the employee.

Predictive validity has a time lapse between the two comparison scores. First, an applicant takes the test and is hired using the normal rules of hiring the company employs. Next, once the applicant has been in the job for approximately six months, a performance review is implemented. Finally, one compares the test score to the performance evaluation and a metric is established (Cascio, 1998; Kerlinger, 1992).

There are some issues related to predictive validity. With the time lapse, the company may have undergone some changes and experienced turnover. Not only are there fewer employees, but also the sample size may be too small to provide worthwhile data. The apprehension one experiences when taking a test, especially for a new position, may negatively affect the test scores. Conversely, a confident test-taker may score higher than a nervous test-taker, falsely amplifying the differences in the two scores. Similar to concurrent validity, if the performance evaluation produces an inaccurate measure, then using the scores as a criterion has no meaning when using it as a comparison (Wainer, 2000).

Reliability

Test reliability refers to the accuracy or precision of a measurement tool. There are two ways to look at reliability. One is the stability of the measurement tool over repeated trials, or the replicability of a score, and the other is the measurement tool being free from unsystematic errors, or the internal consistency of the scores (Anastasi & Urbina, 1997; Cascio, 1998; Kerlinger, 1992; Kline, 1986; Wainer, 2000). In both cases, reliability is about obtaining the true score of a test-taker. When a test-taker takes the same test repeatedly, ideally the test score should remain the same. That is the test-taker's true score. However, for various reasons, there is variance in the test-taker's score, and the amount of variance is how we measure the reliability of the test. The accuracy of an assessment tool can be measured several ways. Here we will consider test-retest reliability, alternate form reliability, split-half reliability, and inter-item consistency.

Test-Retest Reliability
In test-retest reliability, the same form of the test is given over two different sessions with a time lapse between the two test administrations (Anastasi & Urbina, 1997; Cascio, 1998; Kerlinger, 1992). Error in test-retest is due to temporal stability. The longer the time between testing, the greater the error and the lower the reliability score (Cascio, 1998). On the other hand, if the test-taker studies for the second administration or goes through training, the variance will be inaccurately less, because the conditions are not the same (Wainer, 2000). If the time lapse is shorter, then something else will affect the accuracy of the test-retest reliability and that is the memorization of the test item choices (Kline, 1986).

Alternate Form Reliability
Alternate form reliability is also known as parallel form, because two forms of the same test are administered either with or without a time lapse (Cascio, 1998; Kerlinger, 1992). Whereas in test-retest having error variance due to temporal stability, alternate form reliability has error variance due to content sampling, or content and time if there is a time interval between administrations. Content sampling refers to the subject matter or content of the items in the topic areas. If a telecom engineer is being tested on routing protocols, he may know some aspects of routing protocols more than others. For example, if the topic covers distance vector versus link state and he has only cursory knowledge of link state but a command of distance vector, and one item is focused on distance vector and the other on link state, then his scores will be skewed based on the content covered in that topic area. Cascio (1998) does not recommend using alternate form reliability, because of the expense and difficulty in constructing two parallel forms, plus the correlation between the two assessment forms provides a very conservative reliability estimate.

Split-Half Reliability
Split-half reliability is used when a single form of a test is given in a single administration and then the test is statistically split for the correlation of the two scores (Anastasi & Urbina, 1997; Cascio, 1998; Cohen & Swerdlik, 1999; Murphy & Davidshofer, 2001). This type of reliability is also referred to as internal consistency, given that it is one set of items compared to another set of items from the same test (Cascio, 1998; Cohen & Swerdlik, 1999). Since time is not a factor in split half reliability, temporal stability is not an issue here; however, content sampling is the contributory cause of error variance.
One consideration for split-half reliability is how to divide the test into two statistically equal halves. Some psychometricians choose to split into even and odd items, because test anxiety at the beginning or burn out at the end of the test will not contribute to error variance (Anastasi & Urbina, 1997; Murphy & Davidshofer, 2001). Cascio (1998) argues that the method of dividing even and odd items will not actually measure internal consistency, but rather equivalence, which will erroneously inflate the reliability estimate. Another approach is to randomly select items for the split halves (Cascio, 1998; Cohen & Swerdlik, 1999).

Inter-Item Consistency
Inter-item consistency refers to the correlation of each individual item, rather than all of the items combined, as we see in internal consistency (Cohen & Swerdlik, 1999). Cohen and Swerdlik (1999) explain that inter-item consistency is most valuable in evaluating the homogeneity or unidimensionality of a test. Like internal consistency, inter-item consistency is calculated using one form of a test in one administration. Unlike internal consistency, each item, not each half of the test is correlated. This provides a more accurate metric of each item and of unidimensionality in general. Since content and unidimensionality are key to a higher reliability score, error variance is attributable to content sampling and content multidimensionality or heterogeneity (Cascio, 1998).

The Kuder-Richardson formula 20 (KR-20) and coefficient alpha are the principal methods of calculating inter-item consistency. KR-20 is the mean of all possible split halves of a test, with the KR-20 being preferred over the split-half method (Cohen & Swerdlik, 1999). Having the ability to compare at the item level is superior to comparing at the test level when evaluating homogeneity. The more homogeneous the items, the closer the KR-20 and split-half calculations will be. The more heterogeneous the items, the greater the reliability of the split-half correlation (Anastasi & Urbina, 1997; Cascio, 1998; Cohen & Swerdlik, 1999). KR-20 is ideal for dichotomously scored tests, like our multiple choice online skills assessment (Cascio, 1998; Cohen & Swerdlik, 1999).

The only difference between KR-20 and coefficient alpha is in the formula. In KR-20, we sum the proportion of the test-takers who pass and those who do not pass each item. In coefficient alpha, that piece is replaced by the sum of the variances of the item scores. In other words, we calculate the variance of each test-taker's item score, and then we add these variances across all of the items (Cascio, 1998; Murphy & Davidshofer, 2001). This formula change allows both dichotomous and continuous items to be correlated using coefficient alpha. Cohen and Swerdlik (1999) define coefficient alpha as the mean of all possible split halves, just like the KR-20, but with the correlations corrected using the Spearman-Brown formula.

Cohen & Swerdlik (1999) recommend using coefficient alpha over all other measures of error variance. Cascio (1998) suggests using caution with coefficient alpha, because it is sensitive to the number of items (the greater the items, the greater the reliability), item intercorrelations, and dimensionality. Cortina as cited in Cascio (1998) advises the psychometrician to establish the unidimensionality of the test, presumably through factor analysis, before incorporating coefficient alpha as a metric of reliability.

Building a Solid Online Skills Assessment

Now that we have a basic understanding of validity and reliability, it is time to bring the information together to create a well respected and effective online skills assessment using CAT. Unidimensionality is a common theme running through validity and reliability, so our first step is to demonstrate that our test has one construct. Performing a factor analysis will allow us to see mathematically the homogeneity of our test items (Sands et al., 1997). It is important to have an objective method to assess the sampling dimensionality, because when we do validation studies, the information will be much more subjective (Anastasi & Urbina, 1997). Once we establish what items belong to a specific assessment, then we can move on to reviewing our assessment.

Several subject matter experts would evaluate each item of the test to evaluate the content. Necessary revisions would be made based on the subject matter expert feedback. The factor analysis and content review would take place during a pre-test or beta test period. We have to conduct a beta test with actual test-takers, so that the feedback and statistics would reflect that seen in the actual test. Having a large population to work with is important, because, the sample size for the factor analysis must be ten times greater than the number of items (Kline, 1986). From the front end, the intended users of the test take the beta test online as if it were the actual test and evaluate the content of the items. From the back end, the psychometricians evaluate the results of the factor analysis and the reliability of the test items and the test as a whole.

In discussing reliability, many options were presented. With the online skills assessment being administered on a CAT engine, it appears that using KR-20 would be most useful. Since the online skills assessment is delivered via computer, we have the ability to continually monitor the test items. With increasing iterations of the online skills assessment being taken, the reliability statistics will continue to generate. As the sample size increases, so should our reliability. Also, we will be able to evaluate the items on an ongoing basis. This may mean that more items will be eliminated, if their reliability is poor, which will allow the test to continue to improve.

As we remove items, our item pool is decreasing. KR-20 can handle the decrease in the item pool, better than coefficient alpha (Cascio, 1998). We want to have stringent standards, which is why we selected KR-20 over split-half, but we do not want to jeopardize our test, because of a sensitive reliability measure. Kerlinger (1992) discusses the principle behind improving reliability. He calls it the "maxmincon principle" and defines it by saying that we must strive to "maximize the variance or the individual differences and minimize the error variance." CAT permits us to accurately assess each test-taker who will have a pinpointed score based on his knowledges and skills, thereby maximizing the individual differences. The error variance will be assessed on an ongoing basis, which will allow us to make continual improvements to our test items. This will result in minimizing the error variance.

Once our beta test has provided us with the necessary information, it is time to post the online skills assessment as the finished product. We will have made all of the changes that were brought to our attention during beta testing, and the final product will be ready to assess the knowledges and skills of the test-takers. As previously mentioned, the CAT format allows us to make constant improvements to the assessment, even as it continues to operate.

The factor analysis provided evidence of construct validity and the pre-beta reviewers and aforementioned job analysis provided evidence for content validity. Employers need to conduct their own validation study to protect themselves against adverse impact as well as to validly interpret the scores (Whetzel & Wheaton, 1997). Employers can examine criterion-related validity by correlating the test scores to any measurement. Whetzel & Wheaton (1997) suggest using scores from a behaviorally anchored rating scale from a performance review as the criterion, as they have a numeric rating and would provide a somewhat objective metric for evaluating criterion-related validity.

Conclusion

CAT has opened up an entire new way of measuring the knowledges and skills of test-takers that traditional paper-and-pencil tests could never do. Not only do validity and reliability measures not suffer with CAT, oftentimes they improve. Some of the advantages of CAT are the reduction in test administration, reduction of items exposed, increased security, increased assessment accuracy, immediate scoring, continual statistical monitoring, and more. CAT has changed the way psychometricians conduct assessments. Now we can take a skills assessment test, and not only computerize it, and make it adaptive, but we can deliver it online via the Internet. This eliminates the need for centralized testing centers and test administrators, which is an enormous financial savings. Test-takers can take an online assessment at any time from any computer that has Internet access. The convenience of online skills assessments and CAT is astounding. As more psychometricians focus on CAT and its immense capabilities, we are surely to see vast improvements in the quality of testing instruments in the years to come.


References

Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice-Hall.
Cascio, W. F. (1998). Applied psychology in human resource management (5th ed.). Upper Saddle River, NJ: Prentice-Hall.
Cohen, R. J., & Swerdlik, M. E. (1999). Psychological testing and assessment: An introduction to tests and measurements (4th ed.). Mountain View, CA: Mayfield.
Drasgow, F., & Olson-Buchanan, J. B. (1999). Innovations in computerized assessment. Mahweh, NJ: Lawrence Erlbaum Associates.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahweh, NJ: Lawrence Erlbaum Associates.
Frederico, P. A. (1992). Assessing semantic knowledge using computer-based and paper-based media. Computers in Human Behavior, 8, 169-181.
Hambleton, R. K., Swaminathan, H, & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Hulin, C. L., Drasgow, F., & Parsons, C.K. (1983). Item response theory: Application to psychological measurement. Homewood, IL: Dow-Jones Irwin.
Kerlinger, F. N. (1992). Foundations of behavioral research (3rd edition). Fort Worth: Harcourt Brace.
Kline, P. (1986). A handbook of test construction: Introduction to psychometric design. London: Methuen.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
Murphy, K. R., & Davidshofer, C. O. (2001). Psychological testing: Principles and applications (5th ed.). Upper Saddle River, NJ: Prentice-Hall.
Osterlind, S. J. (1989). Constructing test items. Boston: Kluwer Academic.
Overton, R. C., Harms, H. J., Taylor, L. R., & Zickar, M. J. (1997). Adapting to adaptive testing. Personnel Psychology, 50, 171-185.
Pinsoneault, T. B. (1996). Equivalency of computer-assisted and paper-and-pencil administered versions of the Minnesota Multiphasic Personality Inventory-2. Computers in Human Behavior, 12 (2), 291-300.
Roper, B. L., Ben-Porath, Y. S., & Butcher, J. N. (1995). Comparability and validity of computerized adaptive testing with the MMPI-2. Journal of Personality Assessment, 65 (2), 358-371.
Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington, DC: American Psychological Association.
Wainer, H. (2000). Computer adaptive testing: A primer (2nd ed.). Mahweh, NJ: Lawrence Erlbaum Associates.
Whetzel, D. L., & Wheaton, G. R. (1997). Applied measurement methods in industrial psychology. Palo Alto, CA: Davies-Black.

 

14425 Penrose Place, Suite 150, Chantilly, VA 20151
(703) 437-4800
Terms and Conditions