Psychometric Assessment – A Brief Discussion

Psychometric tests are standard and legitimate ways of identifying potential behavior of a person in different situations. In the context of a job application, such tests can ascertain a person's personality traits, aptitudes, and the abilities to handle the tasks of the job. By using scientific methods and statistical techniques, assessments can be simplified for the candidate and still produce valid, reliable, and precise results. This combination can improve the quality of the recruitment and selection process.

Psychometric Assessment Properties

There is a multitude of assessments to choose from when it comes to employee selection and development. But choosing the correct set of assessments is the key. From a psychometric standpoint, an assessment must fulfill two essential requirements, validity, and reliability. Identifying whether the chosen assessment is measuring what it is supposed to measure and if its measurements are accurate is half the battle. Apart from the two principle properties, many other aspects can be used to norm the properties of an assessment such as success in a role, gender, age, education, employability, occupation, etc.

Validity against Reliability

Validity

The validity of a psychometric test, or any test for that matter, lies in the test's ability to measure what it is designed to measure. A high validity test consistently yields results that are firmly in connection with the intended goal of the test. A simple example of this would be the use of a weighing machine. If a person weighs the same object from time to time and the scale reports the same number, then the weighing machine is considered reliable; however, the result is only considered valid if the weight measurement is accurate.
When it comes to the validity of psychometric tests, there are three major validity types:

  • Predictive Validity
  • Convergent Validity
  • Construct Validity

Predictive Validity

This validity measures a tool's precision at predicting specific outcomes in certain scenarios. Experts consider predictive validity as the most useful validity for a psychometric assessment. A predictively valid assessment stands a better chance of predicting an individual's performance in a specific role.

Prediction from Test

Actual Performance

Predictive Validity

Recommend

High

ü

Not recommend

Low

ü

Recommend

Low

û

Not recommend

High

û


For example, if a college entrance test is assessed for validity, a correlation between students' entrance test score and undergraduate scores are drawn. If the students who scored high in the entrance test also scored higher in their undergrad programs, then the test is predictively valid. If undergrad scores well in the college entrance test but does not do well in the course, then it is predictively invalid.

Convergent Validity

The usefulness of convergent validity is second only to predictive validity. This validity measurement works by drawing a correlation between multiple tests that measure the same or similar constructs and traits. The evidence gathered from all the sources converges towards a validity pattern.

Test Prediction

Another Valid Test Prediction

Convergent Validity

Recommended

Recommended

ü

Not recommended

Not recommended

ü

Recommended

Not recommended

û

Not recommended

Recommended

û

Construct Validity

Construct validity is typically used after predictive and convergent validity measurements are complete. This methodology is concerned with the interpretation of test scores, ensuring related observational and theoretical terms are considered. It also identifies if the attribute being measured by assessments exists.

Test Prediction

Theory Prediction

Construct Validity

Recommended

Recommended

ü

Not recommended

Not recommended

ü

Recommended

Not recommended

û

Not recommended

Recommended

û

     
Construct validity measurement is crucial if you are trying to establish a new measure, such as a measurement for a cognitive skill. By ensuring that the findings of the theory match the predictions made by the test, we can establish construct validity. Other correlational methods and factor analysis for personality inventory are often used to verify the construct validity of an assessment.

Reliability

The reliability of an assessment is defined by the consistency of test scores when the same test is repeatedly administered on the same subject under identical test conditions. But since such an ideal testing condition is impossible to create, reliability of an assessment is expressed in terms of a correlation coefficient which ranges from 0.00 to 1.00 (also known as Cronbach's Alpha). A test with perfect reliability would have a Cronbach's Alpha coefficient of 1.00 while a perfectly unreliable test would secure a coefficient of 0.00.

The reliability of a test is usually measured via Test-Retest Reliability and Internal Consistency.

Internal Consistency

Within a test, several items could be designed to measure the same or similar factors, information, or other objectives. The test is designed this way to ensure higher levels of accuracy. As an example, if you visit a fine dining restaurant, you might be requested to fill out a customer satisfaction form that asks your overall satisfaction level ranging from 'strongly disagree' to 'strongly agree' (i.e., a Likert scale). But it might also have questions asking how likely you are to recommend this restaurant, if you liked the ambiance, how the food was, etc. 

If this customer satisfaction survey has the same or similar answers to all questions (Agree, for example) and if that rating matches the rating given for overall satisfaction, then the survey is internally consistent.

Test-Retest Reliability

This is a temporal method to measure the reliability of a test. When the same test is administered to the same subject multiple times, and the person takes the test, in the same way, each time, there should not be drastic differences between the scores obtained in each test. A correlation of scores starting from the first test to the last is used while applying a coefficient known as Pearson's coefficient. For example, if we are to measure the reliability of an IQ test, the same person should be tested for several months using the same test. As a person's IQ does not dramatically change over a short period, we should expect almost the same test score each time. Otherwise, the test should be deemed unreliable.

Standardizations or Norms

Psychometric tests are standardized against scores of different test-taker groups for comparison instead of focusing on the performance or score of a single individual. These norms allow us to compare an individual's performance against that of a representative sample. The representative sample used to compare must be successful in the role that the individual is considering. For example, when a test is developed for salespeople, the representative sample must consist of only successful salespeople. As we create samples of difference test-taker groups, we can see that the performances and scores of one group vary significantly from the other. Standardization allows us to:

  1. Make the assessment more valuable for selection processes
  2. Compare an individual's performance or ability level against a similar pool of candidates
  3. Generate relative performance report against the representative group
  4. Interpret the test results better

Following are some of the parameter used in norming with a representative sample:

  • Success in the role
  • Relevant work experience
  • Current employment status
  • Area or domain of occupation
  • Educational background

Candidates Cheating on Psychometric Tests

Some candidates try to cheat on psychometric tests which is why modern tests factor in all response styles that participants may use. The better tests expect cheating behavior and account for that factor while interpreting results since cheating also indicates a form of competence.
To interpret the test results correctly, we must understand the difference between response bias and response style. Response bias is the propensity of a participant to answer questions based on certain factors extra to the test's designated content or factors. Response style is conscious or unconscious bias while responding to test questions while still staying within the test content.

To interpret the test results correctly, and weed out fake results, careless or distorted answers, there are a few ways to identify and prioritize a genuine candidate over the others. These methods are based on the following factors:

Extreme Response Style

Certain participants tend to answer each question to the extreme no matter what their actual feelings are. On a Likert scale test, they might answer all questions with strongly agree or strongly disagree.

Social Desirability Response Style

These candidates try to depict themselves on the right side of every social issue or socially relatable scenario. Hence when administered a test, they might respond 'most like me' to a majority of socially relatable questions. The candidate may be cautioned to answer appropriately.

Careless Response Style

Such a candidate does not evaluate the question or the answers properly and tends to provide the same answer to all questions. Hence these test results are not useful.

Central Tendency Response Style

These candidates are overly cautious and tend to choose the center or middle answer for every question. This could be due to their desire to play safe or them being unsure of the answer. While they are honest during the test, the results are not useful.

Genuine Response Style

Test results not falling into any of the above four categories is considered to be genuine. And the participant is considered to be a trustworthy candidate.

Scaling Methods

Apart from the response styles, the scaling methods of tests also play a role in selecting the appropriate psychometric test. Some of the popular scales are:

The Likert Scale

This is the scale one often encounters while providing feedback to service providers, restaurants, organizational surveys, etc. It is a five-point scale moving from extreme negative to extreme positive. The verbiage on the scale changes depending on the test. It is used to find out a participant's level of agreement on a given item. This scale is prone to social desirability and central tendency errors.

Semantic Scale

This scale provides two positive yet contrasting responses with a neutral option. For example, one may be asked how they want to spend their holidays at a particular resort. The options could be to go out to the beach or spend time at the spa. Both are valid choices with a neutral option in the middle. This test is prone to social desirability, careless responding, and extreme response styles.

Forced Choice

This type of scaling does not allow the participant to select the same answer more than once during the test. This negates chances of extreme responding, central tendency, social desirability, and careless responding.

AWAV Article

You don't have access for this content!

Only sponsors can view this page...