Portrait of Dr Anna Brown

Dr Anna Brown

Reader in Psychological Methods and Statistics
PGR Progression Monitoring lead


Dr Anna Brown is a psychometrician with an established reputation and extensive industry experience. Currently she is teaching psychological methods and conducting research in psychometrics at the School of Psychology. Previously, Anna taught short courses in applied psychometrics at the University of Cambridge, where she also conducted research focusing on modelling response biases in questionnaire data. Anna's industry experiences included research and test development at the research division of the UK largest occupational test publisher, SHL Group, where she had worked as Principal Research Statistician for many years. 

Anna holds an MSc degree in Mathematics with distinction and a PhD in Psychology with distinction. Anna's PhD research led to the development of the Thurstonian IRT model described as a breakthrough in scoring of forced-choice questionnaires, and received the 'Best Dissertation' award from the Psychometric Society. Applications of this methodology include the development of an IRT-scored version of Occupational Personality Questionnaire (OPQ32r).  

Research interests

Anna’s research focuses on psychological measurement and psychometric testing, particularly issues in test validity and test fairness. She uses latent variable models including Multidimensional Item Response Theory (MIRT) to model responses to typical performance tests including ipsative questionnaires, and to model response biases in self-report measures and in feedback reports to individuals and organisations.

Key publications

  • Brown, A. & Maydeu-Olivares, A. (2018). Ordinal Factor Analysis of Graded-Preference Questionnaire Data. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 516-529. DOI: 10.1080/10705511.2017.1392247
  • Brown, A. (2016). Item Response Models for Forced-Choice Questionnaires: A Common Framework. Psychometrika, 81(1), 135–160. DOI: 10.1007/s11336-014-9434-9
  • Brown, A. & Maydeu-Olivares, A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1), 36-52. DOI: 10.1037/a0030641
  • Brown, A. & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460-502. DOI: 10.1177/0013164410375112


Past research students

  • Dr Yin Lin PhD Psychology (2020) ESRC funded: Asking the right questions: Increasing fairness and accuracy of personality assessments with Computerised Adaptive Testing
  • Dr Ana Carla Crispim PhD Psychology (2018). Self-funded: Exploring the validity evidence of core affect

Anna welcomes contact from potential Doctoral students interested in modern psychometric modelling (structural equation modelling, questionnaire design, latent trait modelling, and similar).


Anna is happy to supervise final year and MSc projects related to:

  1. Faking and impression management in high stakes assessments, for example: situational and personal characteristics linked to applicant ‘faking good’ on employment tests; or patient ‘faking bad’ on diagnostic tests for access to treatments; etc.
  2. Unmotivated response biases and their impact on test validity, for example: acquiescence, leniency/severity, halo/horn effects etc.
  3. Measurement of individuals differences, for example: factorial structure of personality constructs, equivalence of measurement models across groups, etc. 


Anna is an elected member of the Council of the International Test Commission, chairing its Research and Guidelines Committee. She has served as a reviewer for many journals in the area of psychological measurement, and is a member of the editorial Board of the International Journal of Testing, and a member of the advisory council to the Advances in Methods and Practices in Psychological Science (AMPPS) journal.

Grants and Awards

2015Alzheimer's Society 
"C-DEMQOL – Measurement of quality of life in family carers of people with dementia: development of a new instrument for evaluation"
2014Department of Health 
"Systematic Review of the Psychometric Properties of ASQ (ASQ-SE)"
2014ESRC CASE Studentship 
“Asking the right questions: Increasing fairness and accuracy of personality assessments with Computerised Adaptive Testing”
2011The Psychometric Society Dissertation Award (best dissertation) 
2010-2011The Isaac Newton Trust grant
“Modern Psychometrics: theoretical and empirical contributions using item response models”: over two financial years 
2010Dissertation support award from Society of Multivariate Experimental Psychology$1,000


Showing 50 of 52 total publications in the Kent Academic Repository. View all publications.


  • Budgett, J., Brown, A., Daley, S., Page, T., Banerjee, S., Livingston, G., & Sommerlad, A. (2019). The social functioning in dementia scale (SF-DEM): Exploratory factor analysis and psychometric properties in mild, moderate and severe dementia. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 11, 45-52. doi:10.1016/j.dadm.2018.11.001
    Introduction: The psychometric properties of the social functioning in dementia scale (SF-DEM) over different dementia severities are unknown.
    Methods: We interviewed 299 family carers of people with mild, moderate or severe dementia from two UK research sites; examined acceptability (completion rates); conducted exploratory factor analysis and tested each factor’s internal consistency and construct validity.
    Results: 285/299 (95.3%) carers completed questionnaires. Factor analysis indicated three distinct factors with acceptable internal consistency: spending time with other people, correlating with overall social function (r=0.56, p<0.001) and activities of daily living (ADLs) (r=-0.48, p<0.001); communicating with other people correlating with ADLs (r=-0.66, p<0.001); and sensitivity to other people correlating with quality of life (r=0.35, p<0.001) and inversely with neuropsychiatric symptoms (r=-0.45, p<0.001). The three factors’ correlations with other domains were similar across all dementia severities.
    Discussion: The SF-DEM carer version measures three social functioning domains and has satisfactory psychometric properties in all severities of dementia.
  • Brown, A., & Fong, S. (2019). How valid are 11-plus tests? Evidence from Kent. British Educational Research Journal, 45, 1235-1254. doi:10.1002/berj.3560
    Despite profound influence of selection-by-ability on children’s educational opportunities, empirical evidence for validity of 11-plus tests is scarce. This study focused on secondary selection in Kent, the largest grammar school area in England. We analysed scores from the ‘Kent Test’ (the 11-plus test used in Kent), Cognitive Assessment Tests (CAT4), and Key Stage 2 Standardised Assessment Tests (KS2) using longitudinal data of two year cohorts (N1=95, N2=99) from one primary school. All the assessment batteries provided highly overlapping information, with the decisive effect of content area (e.g. verbal versus maths) over task type (e.g. knowledge-loaded versus knowledge-free). Thus, the value in differentiating ‘pure’ (i.e. knowledge-free) ability in 11-plus testing is questionable. KS2 and Kent Test aggregated scores overlapped very strongly, sharing nearly 80% of variance; moreover, KS2-based eligibility decisions had higher sensitivity than the Kent Test in predicting the actual admissions to grammar schools after Head Teacher Assessment (HTA) appeals have taken place. Finally, the use of multiple pass marks for each Kent Test component as well as the total score was found to increase the chance of false rejection. This study provides preliminary evidence that national examinations could be a good basis for selection to grammar schools; it challenges the use of complex admission rules and multiple decisions and questions the value of 11-plus tests.
  • Brown, A., Page, T., Daley, S., Farina, N., Bassett, T., Livingston, G., Budgett, J., Gallaher, L., Feeney, I., Murray, J., Bowling, A., Knapp, M., & Banerjee, S. (2019). Measuring the quality of life of family carers of people with dementia: Development and validation of C-DEMQOL. Quality of Life Research, 28, 2299-2310. doi:10.1007/s11136-019-02186-w
    Purpose. We aimed to address gaps identified in the evidence base and instruments available to measure the quality of life (QOL) of family carers of people with dementia, and develop a new brief, reliable, condition-specific instrument.
    Methods. We generated measurable domains and indicators of carer QOL from systematic literature reviews and qualitative interviews with 32 family carers and 9 support staff, and two focus groups with 6 carers and 5 staff. Statements with five tailored response options, presenting variation on the QOL continuum, were piloted (n = 25), pre-tested (n = 122) and field-tested (n = 300) in individual interviews with family carers from North London and Sussex. The best 30 questions formed the C-DEMQOL questionnaire, which was evaluated for usability, face and construct validity, reliability, and convergent/discriminant validity using a range of validation measures.
    Results. C-DEMQOL was received positively by the carers. Factor analysis confirmed that C-DEMQOL sum scores are reliable in measuring overall QOL (omega = 0.97) and its five subdomains: ‘meeting personal needs’ (omega = 0.95); ‘carer wellbeing’ (omega = 0.91); ‘carer-patient relationship’ (omega = 0.82); ‘confidence in the future’ (omega = 0.90), and ‘feeling supported’ (omega = 0.85). The overall QOL and domain scores show the expected pattern of convergent and discriminant relationships with established measures of carer mental health, activities, and dementia severity and symptoms.
    Conclusions. The robust psychometric properties support the use of C-DEMQOL in evaluation of overall and domain-specific carer QOL; replications in independent samples and studies of responsiveness would be of value.
  • Daley, S., Murray, J., Farina, N., Page, T., Brown, A., Bassett, T., Livingston, G., Bowling, A., Knapp, M., & Banerjee, S. (2018). Understanding the quality of life of family carers of people with dementia: Development of a new conceptual framework. International Journal of Geriatric Psychiatry, 1-8. doi:10.1002/gps.4990
    Dementia is a major global health and social care challenge, and family carers are a vital determinant of positive outcomes for people with dementia. This study’s aim was to develop a conceptual framework for the Quality of Life (QOL) of family carers of people with dementia.

    We studied family carers of people with dementia and staff working in dementia services iteratively using in-depth individual qualitative interviews and focus groups discussions. Analysis used constant comparison techniques underpinned by a collaborative approach with a study-specific advisory group of family carers.

    We completed 41 individual interviews with 32 family carers and 9 staff and two focus groups with 6 family carers and 5 staff. From the analysis, we identified 12 themes that influenced carer QOL. These were organised into three categories focussing on: person with dementia, carer, and external environment.

    For carers of people with dementia, the QOL construct was found to include condition-specific domains which are not routinely considered in generic assessment of QOL. This has implications for researchers, policy makers and service providers in addressing and measuring QOL in family carers of people with dementia.
  • Guenole, N., Brown, A., & Cooper, A. (2018). Forced Choice Assessment of Work Related Maladaptive Personality Traits: Preliminary Evidence from an Application of Thurstonian Item Response Modeling. Assessment, 25, 513-526. doi:10.1177/1073191116641181
    This article describes an investigation of whether Thurstonian item response modeling is a viable method for assessment of maladaptive traits. Forced-choice responses from 420 working adults to a broad-range personality inventory assessing six maladaptive traits were considered. The Thurstonian item response model’s fit to the forced-choice data was adequate, while the fit of a counterpart item response model to responses to the same items but arranged in a single-stimulus design was poor. Mono-trait hetero-method correlations indicated corresponding traits in the two formats overlapped substantially, although they did not measure equivalent constructs. A better goodness of fit and higher factor loadings for the Thurstonian item response model, coupled with a clearer conceptual alignment to the theoretical trait definitions, suggested that the single-stimulus item responses were influenced by biases that the independent clusters measurement model did not account for. Researchers may wish to consider forced-choice designs and appropriate item response modeling techniques such as Thurstonian item response modeling for personality questionnaire applications in industrial psychology, especially when assessing maladaptive traits. We recommend further investigation of this approach in actual selection situations and with different assessment instruments.
  • Brown, A., & Maydeu-Olivares, A. (2018). Ordinal Factor Analysis of Graded-Preference Questionnaire Data. Structural Equation Modeling: A Multidisciplinary Journal, 25, 516-529. doi:10.1080/10705511.2017.1392247
    We introduce a new comparative response format, suitable for assessing personality and similar constructs. In this “graded-block” format, items measuring different constructs are first organized in blocks of 2 or more; then, pairs are formed from items within blocks. The pairs are presented one at a time, to enable respondents expressing the extent of preference for one item or the other using several graded categories. We model such data using confirmatory factor analysis (CFA) for ordinal outcomes. We derive Fisher information matrices for the graded pairs, and supply R code to enable computation of standard errors of trait scores. An empirical example illustrates the approach in low-stakes personality assessments and shows that similar results are obtained when using graded blocks of size 3 and a standard Likert format. However, graded-block designs may be superior when insufficient differentiation between items is expected (due to acquiescence, halo or social desirability).
  • Wetzel, E., Brown, A., Hill, P., Chung, J., Robins, R., & Roberts, B. (2017). The narcissism epidemic is dead; long live the narcissism epidemic. Psychological Science, 28, 1833-1847. doi:10.1177/0956797617724208
    Are recent cohorts of college students more narcissistic than their predecessors? To address debates about the so-called “narcissism epidemic,” we used data from three cohorts of students (N1990s = 1,166; N2000s = 33,647; N2010s = 25,412) to test whether narcissism levels (overall and specific facets) have increased across generations. We also tested whether our measure, the Narcissistic Personality Inventory (NPI), showed measurement equivalence across the three cohorts, a critical analysis that had been overlooked in prior research. We found that several NPI items were not equivalent across cohorts. Models accounting for nonequivalence of these items indicated a small decline in overall narcissism levels from the 1990s to the 2010s (d = ?0.27). At the facet-level, leadership (d = ?0.20), vanity (d = –0.16), and entitlement (d = –0.28) all showed decreases. Our results contradict the claim that recent cohorts of college students are more narcissistic than earlier generations of college students.
  • Page, T., Farina, N., Brown, A., Daley, S., Bowling, A., Bassett, T., Livingston, G., Knapp, M., Murray, J., & Banerjee, S. (2017). Instruments Measuring the Disease-Specific Quality of Life of Family Carers of People with Neurodegenerative Diseases: A Systematic Review. BMJ Open, 7. doi:10.1136/bmjopen-2016-013611
    Objective: Neurodegenerative diseases, such as dementia, have a profound impact on those with the conditions and their family carers. Consequently, the accurate measurement of family carers’ quality of life (QOL) is important. Generic measures may miss key elements of the impact of these conditions so using disease-specific instruments has been advocated. This systematic review aimed to identify and examine the psychometric properties of disease-specific outcome measures of QOL of family carers of people with neurodegenerative diseases (Alzheimer’s disease, other dementias; Huntington’s disease; Parkinson’s disease; Multiple Sclerosis; and Motor Neurone Disease).
    Design: Systematic review.
    Methods: Instruments were identified using five electronic databases (PubMed, PsycINFO, Web of Science, Scopus and IBSS) and lateral search techniques. Only studies which reported the development and/or validation of a disease-specific measure for adult family carers, and which were written in English, were eligible for inclusion. The methodological quality of the included studies was evaluated using the COSMIN checklist. The psychometric properties of each instrument were examined.
    Results: Six hundred and seventy six articles were identified. Following screening and lateral searches, a total of eight articles were included; these reported seven disease-specific carer QOL measures. Limited evidence was available for the psychometric properties of the seven instruments. Psychometric analyses were mainly focused on internal consistency, reliability and construct validity. None of the measures assessed either criterion validity or responsiveness to change.
    Conclusions: There are very few measures of carer QOL that are specific to particular neurodegenerative diseases. The findings of this review emphasise the importance of developing and validating psychometrically robust disease-specific measures of carer QOL.
  • Farina, N., Page, T., Daley, S., Brown, A., Bowling, A., Bassett, T., Livingston, G., Knapp, M., Murray, J., & Banerjee, S. (2017). Factors associated with the quality of life of family carers of people with dementia: A systematic review. Alzheimers & Dementia, 13, 572-581. doi:10.1016/j.jalz.2016.12.010
    INTRODUCTION: Family carers of people with dementia are their most important support in practical, personal and economic terms. Carers are vital to maintaining the quality of life (QOL) of people with dementia. This review aims to identify factors related to the QOL of family carers of people with dementia.
    METHODS: Searches on terms including ‘carers’, ‘dementia’, ‘family’ and ‘quality of life’ in research databases. Findings were synthesised inductively, grouping factors associated with carer QOL into themes.
    RESULTS: 909 abstracts were identified. Following screening, lateral searches and quality appraisal, 41 studies (n=5,539) were included for synthesis. Ten themes were identified: demographics; carer-patient relationship; dementia characteristics; demands of caring; carer health; carer emotional wellbeing; support received; carer independence; carer self-efficacy; and future.
    DISCUSSION: The quality and level of evidence supporting each theme varied. We need further research on what factors predict carer QOL in dementia and how to measure it.
  • Brown, A., Inceoglu, I., & Lin, Y. (2016). Preventing Rater Biases in 360-Degree Feedback by Forcing Choice. Organizational Research Methods, 20, 121-148. doi:10.1177/1094428116668036
    We examined the effects of response biases on 360-degree feedback using a large sample (N=4,675) of organizational appraisal data. Sixteen competencies were assessed by peers, bosses and subordinates of 922 managers, as well as self-assessed, using the Inventory of Management Competencies (IMC) administered in two formats – Likert scale and multidimensional forced choice. Likert ratings were subject to strong response biases, making even theoretically unrelated competencies correlate highly. Modeling a latent common method factor, which represented non-uniform distortions similar to those of “ideal-employee” factor in both self- and other assessments, improved validity of competency scores as evidenced by meaningful second-order factor structures, better inter-rater agreement, and better convergent correlations with an external personality measure. Forced-choice rankings modelled with Thurstonian IRT yielded as good construct and convergent validities as the bias-controlled Likert ratings, and slightly better rater agreement. We suggest that the mechanism for these enhancements is finer differentiation between behaviors in comparative judgements, and advocate the operational use of the multidimensional forced-choice response format as an effective bias prevention method.
  • Velikonja, T., Edbrooke-Childs, J., Calderon, A., Sleed, M., Brown, A., & Deighton, J. (2016). The psychometric properties of the Ages & Stages Questionnaires for use as population outcome indicators at 2.5 years in England: A systematic review. Child: Care, Health and Development, 43, 1-17. doi:10.1111/cch.12397
    Background: Early identification of children with potential development delay is essential to ensure access to care. The Ages & Stages Questionnaires (ASQ) are used as population outcome indicators in England as part of the 2.5 year review.
    Method: The aim of this article was to systematically review the worldwide evidence for the psychometric properties of the ASQ third edition (ASQ-3TM) and the Ages & Stages Questionnaires®: Social-Emotional (ASQ:SE). Eight electronic databases and grey literature were searched for original research studies available in English language, which reported reliability, validity, or responsiveness of the ASQ-3TM or ASQ:SE for children aged between 2 and 2.5 years. Twenty studies were included. Eligible studies used either the ASQ-3TM or the ASQ:SE and reported at least one measurement property of the ASQ-3TM and/or ASQ:SE. Data were extracted from all papers identified for final inclusion, drawing on Cochrane guidelines.
    Results: Using ‘positive’, ‘intermediate’, and ‘negative’ criteria for evaluating psychometric properties, results showed ‘positive’ reliability values in 11/18 instances reported, ‘positive’ sensitivity values in 13/18 instances reported, and ‘positive’ specificity values in 19/19 instances reported.
    Conclusions: Variations in age or language versions used, quality of psychometric properties, and quality of papers resulted in heterogeneous evidence. It is important to consider differences in cultural and contextual factors when measuring child development using these indicators. Further research is very likely to have an important impact on the interpretation of the ASQ-3TM and ASQ:SE psychometric evidence.
  • Chua, K., Brown, A., Little, R., Matthews, D., Morton, L., Loftus, V., Watchurst, C., Tait, R., Romeo, R., & Banerjee, S. (2016). Quality-of-life assessment in dementia: the use of DEMQOL and DEMQOL-Proxy total scores. Quality of Life Research, 25, 3107-3118. doi:doi:10.1007/s11136-016-1343-1

    There is a need to determine whether health-related quality-of-life (HRQL) assessments in dementia capture what is important, to form a coherent basis for guiding research and clinical and policy decisions. This study investigated structural validity of HRQL assessments made using the DEMQOL system, with particular interest in studying domains that might be central to HRQL, and the external validity of these HRQL measurements.


    HRQL of people with dementia was evaluated by 868 self-reports (DEMQOL) and 909 proxy reports (DEMQOL-Proxy) at a community memory service. Exploratory and confirmatory factor analyses (EFA and CFA) were conducted using bifactor models to investigate domains that might be central to general HRQL. Reliability of the general and specific factors measured by the bifactor models was examined using omega (?) and omega hierarchical (? h) coefficients. Multiple-indicators multiple-causes models were used to explore the external validity of these HRQL measurements in terms of their associations with other clinical assessments.


    Bifactor models showed adequate goodness of fit, supporting HRQL in dementia as a general construct that underlies a diverse range of health indicators. At the same time, additional factors were necessary to explain residual covariation of items within specific health domains identified from the literature. Based on these models, DEMQOL and DEMQOL-Proxy overall total scores showed excellent reliability (? h > 0.8). After accounting for common variance due to a general factor, subscale scores were less reliable (? h < 0.7) for informing on individual differences in specific HRQL domains. Depression was more strongly associated with general HRQL based on DEMQOL than on DEMQOL-Proxy (?0.55 vs ?0.22). Cognitive impairment had no reliable association with general HRQL based on DEMQOL or DEMQOL-Proxy.


    The tenability of a bifactor model of HRQL in dementia suggests that it is possible to retain theoretical focus on the assessment of a general phenomenon, while exploring variation in specific HRQL domains for insights on what may lie at the ‘heart’ of HRQL for people with dementia. These data suggest that DEMQOL and DEMQOL-Proxy total scores are likely to be accurate measures of individual differences in HRQL, but that subscale scores should not be used. No specific domain was solely responsible for general HRQL at dementia diagnosis. Better HRQL was moderately associated with less depressive symptoms, but this was less apparent based on informant reports. HRQL was not associated with severity of cognitive impairment.
  • Lin, Y., & Brown, A. (2016). Influence of Context on Item Parameters in Forced-Choice Personality Assessments. Educational and Psychological Measurement, 77, 389-414. doi:10.1177/0013164416646162
    A fundamental assumption in computerized adaptive testing (CAT) is that item parameters are invariant with respect to context – items surrounding the administered item. This assumption, however, may not hold in forced-choice (FC) assessments, where explicit comparisons are made between items included in the same block. We empirically examined the influence of context on item parameters by comparing parameter estimates from two FC instruments. The first instrument was compiled of blocks of three items, whereas in the second, the context was manipulated by adding one item to each block, resulting in blocks of four. The item parameter estimates were highly similar. However, a small number of significant deviations were observed, confirming the importance of context when designing adaptive FC assessments. Two patterns of such deviations were identified, and methods to reduce their occurrences in a FC CAT setting were proposed. It was shown that with a small proportion of violations of the parameter invariance assumption, score estimation remained stable.
  • Brown, A. (2016). Thurstonian Scaling of Compositional Questionnaire Data. Multivariate Behavioral Research, 51, 345-356. doi:10.1080/00273171.2016.1150152
    To prevent response biases, personality questionnaires may use comparative response formats. These include forced choice, where respondents choose among a number of items, and quantitative comparisons, where respondents indicate the extent to which items are preferred to each other. The present article extends Thurstonian modeling of binary choice data (Brown & Maydeu-Olivares, 2011a) to “proportion-of-total” (compositional) formats. Following Aitchison (1982), compositional item data are transformed into log-ratios, conceptualized as differences of latent item utilities. The mean and covariance structure of the log-ratios is modelled using Confirmatory Factor Analysis (CFA), where the item utilities are first-order factors, and personal attributes measured by a questionnaire are second-order factors. A simulation study with two sample sizes, N=300 and N=1000, shows that the method provides very good recovery of true parameters and near-nominal rejection rates. The approach is illustrated with empirical data from N=317 students, comparing model parameters obtained with compositional and Likert scale versions of a Big Five measure. The results show that the proposed model successfully captures the latent structures and person scores on the measured traits.
  • van Damm, N., Brown, A., Mole, T., Davis, J., Britton, W., & Brewer, J. (2015). Development and Validation of the Behavioral Tendencies Questionnaire. PLoS ONE, 10, 1-21. doi:10.1371/journal.pone.0140867
    At a fundamental level, taxonomy of behavior and behavioral tendencies can be described
    in terms of approach, avoid, or equivocate (i.e., neither approach nor avoid). While there are
    numerous theories of personality, temperament, and character, few seem to take advantage
    of parsimonious taxonomy. The present study sought to implement this taxonomy by
    creating a questionnaire based on a categorization of behavioral temperaments/tendencies
    first identified in Buddhist accounts over fifteen hundred years ago. Items were developed
    using historical and contemporary texts of the behavioral temperaments, described as
    “Greedy/Faithful”, “Aversive/Discerning”, and “Deluded/Speculative”. To both maintain
    this categorical typology and benefit from the advantageous properties of forced-choice
    response format (e.g., reduction of response biases), binary pairwise preferences for items
    were modeled using Latent Class Analysis (LCA). One sample (n1 = 394) was used to estimate
    the item parameters, and the second sample (n2 = 504) was used to classify the participants
    using the established parameters and cross-validate the classification against
    multiple other measures. The cross-validated measure exhibited good nomothetic span
    (construct-consistent relationships with related measures) that seemed to corroborate the
    ideas present in the original Buddhist source documents. The final 13-block questionnaire
    created from the best performing items (the Behavioral Tendencies Questionnaire or BTQ)
    is a psychometrically valid questionnaire that is historically consistent, based in behavioral
    tendencies, and promises practical and clinical utility particularly in settings that teach and
    study meditation practices such as Mindfulness Based Stress Reduction (MBSR).
  • Megreya, A., Bindemann, M., & Brown, A. (2015). Criminal thinking in a Middle Eastern prison sample of thieves, drug dealers and murderers. Legal and Criminological Psychology, 20, 324-342. doi:10.1111/lcrp.12029
    Purpose: The Psychological Inventory of Criminal Thinking Styles (PICTS) has been applied extensively to the study of criminal behaviour and cognition. This study aimed to explore the psychometric characteristics (factorial structure, reliability and external validity) of an Arabic version of the PICTS, to explore cross-cultural differences between a sample of Middle-Eastern (Egyptian) prisoners and Western prison samples, and to examine the influence of type of crime on criminal thinking styles.
    Method: A group of 130 Egyptian male prisoners who had been sentenced for theft, drug dealing or murder completed the PICTS. Their scores were compared with the reported data of American, British, and Dutch prisoners.
    Results: The Arabic PICTS showed scale reliabilities estimated by coefficient alpha comparable to the English version, and reliabilities estimated as test-retest correlations were high. Confirmatory factor analysis showed that the PICTS subscale scores of Egyptian prisoners best fitted a two-factor model, in which one dimension comprised mollification, entitlement, superoptimism, sentimentality and discontinuity, and the second dimension reflected the thinking styles of power orientation, cut-off and cognitive indolence. Observed levels of thinking styles varied by type of crime, specifically between prisoners sentenced for theft, drug dealing, and murder. Cultural differences in criminal thinking styles were also found, whereby the Egyptian prisoners recorded the highest scores in most thinking styles, while American, Dutch and English prisoners were more comparable to each other.
    Conclusions: This study provides one of the first investigations of criminal thinking styles in a non-Western sample and suggests that cross-cultural differences in the structure of these thinking styles exist. In addition, the results indicate that criminal thinking styles need to be understood by the type of crime for which a person has been sentenced.
  • Wetzel, E., Roberts, B., Fraley, C., & Brown, A. (2015). Equivalence of Narcissistic Personality Inventory constructs and correlates across scoring approaches and response formats. Journal of Research in Personality, 61, 87-98. doi:10.1016/j.jrp.2015.12.002
    The prevalent scoring practice for the Narcissistic Personality Inventory (NPI) ignores the forced-choice nature of the items. The aim of this study was to investigate whether findings based on NPI scores reported in previous research can be confirmed when the forced-choice nature of the NPI’s original response format is appropriately modeled, and when NPI items are presented in different response formats (true/false or rating scale). The relationships between NPI facets and various criteria were robust across scoring approaches (mean score vs. model-based), but were only partly robust across response formats. In addition, the scoring approaches and response formats achieved equivalent measurements of the vanity facet and in part of the leadership facet, but differed with respect to the entitlement facet.
  • Brown, A. (2014). Item Response Models for Forced-Choice Questionnaires: A Common Framework. Psychometrika, 81, 135-160. doi:10.1007/s11336-014-9434-9
    In forced-choice questionnaires, respondents have to make choices between two or more items presented at the same time. Several IRT models have been developed to link respondent choices to underlying psychological attributes, including the recent MUPP (Stark, Chernyshenko & Drasgow, 2005) and Thurstonian IRT (Brown & Maydeu-Olivares, 2011) models. In the present article, a common framework is proposed that describes forced-choice models along three axes: 1) the forced-choice format used; 2) the measurement model for the relationships between items and psychological attributes they measure; and 3) the decision model for choice behavior. Using the framework, fundamental properties of forced-choice measurement of individual differences are considered. It is shown that the scale origin for the attributes is generally identified in questionnaires using either unidimensional or multidimensional comparisons. Both dominance and ideal point models can be used to provide accurate forced-choice measurement; and the rules governing accurate person score estimation with these models are remarkably similar.
  • Guenole, N., & Brown, A. (2014). The consequences of ignoring measurement invariance for path coefficients in structural equation models. Frontiers in Psychology, 5, 1-16. doi:10.3389/fpsyg.2014.00980
    We report a Monte Carlo study examining the effects of two strategies for handling measurement non-invariance – modeling and ignoring non-invariant items – on structural regression coefficients between latent variables measured with item response theory models for categorical indicators. These strategies were examined across four levels and three types of non-invariance – non-invariant loadings, non-invariant thresholds, and combined non-invariance on loadings and thresholds – in simple, partial, mediated and moderated regression models where the non-invariant latent variable occupied predictor, mediator, and criterion positions in the structural regression models. When non-invariance is ignored in the latent predictor, the focal group regression parameters are biased in the opposite direction to the difference in loadings and thresholds relative to the referent group (i.e., lower loadings and thresholds for the focal group lead to overestimated regression parameters). With criterion non-invariance, the focal group regression parameters are biased in the same direction as the difference in loadings and thresholds relative to the referent group. While unacceptable levels of parameter bias were confined to the focal group, bias occurred at considerably lower levels of ignored non-invariance than was previously recognized in referent and focal groups.
  • Hill, A., Stoeber, J., Brown, A., & Appleton, P. (2014). Team perfectionism and team performance: A prospective study. Journal of Sport & Exercise Psychology, 36, 303-315. doi:10.1123/jsep.2013-0206
    Perfectionism is a personality characteristic that has been found to predict sports performance in athletes. To date, however, research has exclusively examined this relationship at an individual level (i.e., athletes’ perfectionism predicting their personal performance). The current study extends this research to team sports by examining whether, when manifested at team level, perfectionism predicts team performance. A sample of 231 competitive rowers from 36 boats completed measures of self-oriented, team-oriented, and team-prescribed perfectionism prior to competing against one another in a 4-day rowing competition. Strong within-boat similarities in the levels of team members’ team-oriented perfectionism supported the existence of collective team-oriented perfectionism at the boat level. Two-level latent growth curve modeling of day-by-day boat performance showed that team-oriented perfectionism positively predicted the position of the boat in mid-competition and the linear improvement in position. The findings suggest that imposing perfectionistic standards on team members may drive teams to greater levels of performance.
  • Brodbeck, J., Bachmann, M., Brown, A., & Znoj, H. (2014). Effects of depressive symptoms on antecedents of lapses during a smoking cessation attempt: An ecological momentary assessment study. Addiction, 109, 1363-1370. doi:10.1111/add.12563
    AIMS: To investigate pathways through which momentary negative affect and depressive symptoms affect risk of lapse during smoking cessation attempts.

    DESIGN: Ecological Momentary Assessment was carried out during two weeks after an unassisted smoking cessation attempt. A three-month follow-up measured smoking frequency.

    SETTING:Data were collected via mobile devices in German-speaking Switzerland.

    PARTICIPANTS: A total of 242 individuals (age 20-40, 67% men) reported 7,112 observations.

    MEASUREMENTS: Online surveys assessed baseline depressive symptoms and nicotine dependence. Real-time data on negative affect, physical withdrawal symptoms, urge to smoke, abstinence-related self-efficacy, and lapses.

    FINDINGS: Two-level structural equation model suggested that on the situational level, negative affect increased the urge to smoke and decreased self-efficacy (? = .20; ? = -.12, respectively), but had no direct effect on lapse risk. A higher urge to smoke (? = .09) and lower self-efficacy (? = -.11) were confirmed as situational antecedents of lapses. Depressive symptoms at baseline were a strong predictor of a person's average negative affect (? = .35, all p <.001). However, the baseline characteristics influenced smoking frequency three months later only indirectly, through influences of average states on the number of lapses during the quit attempt.

    CONCLUSIONS: Controlling for nicotine dependence, higher depressive symptoms at baseline were strongly associated with a worse longer-term outcome. Negative affect experienced during the quit attempt was the only pathway through which the baseline depressive symptoms were associated with a reduced self-efficacy and increased urges to smoke, all leading to the increased probability of lapses.
  • Brown, A., Ford, T., Deighton, J., & Wolpert, M. (2014). Satisfaction in Child and Adolescent Mental Health Services: Translating Users’ Feedback into Measurement. Administration and Policy in Mental Health and Mental Health Services Research, 41, 434-446. doi:10.1007/s10488-012-0433-9
    The present research addressed gaps in our current understanding of validity and quality of measurement provided by Patient Reported Experience Measures (PREM). We established the psychometric properties of a freely available Experience of Service Questionnaire (ESQ), based on responses from 7,067 families of patients across 41 UK providers of Child and Adolescent Mental Health Services (CAMHS), using the two-level latent trait modeling. Responses to the ESQ were subject to strong ‘halo’ effects, which were thought to represent the overall positive or negative affect towards one’s treatment. Two strongly related constructs measured by the ESQ were interpreted as specific aspects of global satisfaction, namely Satisfaction with Care, and with Environment. The Care construct was sensitive to differences between less satisfied patients, facilitating individual and service-level problem evaluation. The effects of nesting within service providers were strong, with parental reports being the most reliable source of data for the between-provider comparisons. We provide a scoring protocol for converting the hand-scored ESQ to the model-based population-referenced scores with supplied standard errors, which can be used for benchmarking services as well as individual evaluations.
  • Stoeber, J., Kobori, O., & Brown, A. (2014). Examining mutual suppression effects in the assessment of perfectionism cognitions: Evidence supporting multidimensional assessment. Assessment, 21, 647-660. doi:10.1177/1073191114534884
    Perfectionism cognitions capture automatic perfectionistic thoughts and have explained variance in psychological adjustment and maladjustment beyond trait perfectionism. The aim of the present research was to investigate whether a multidimensional assessment of perfectionism cognitions has advantages over a unidimensional assessment. To this aim, we examined in a sample of 324 university students how the Perfectionism Cognitions Inventory (PCI) and the Multidimensional Perfectionism Cognitions Inventory (MPCI) explained variance in positive affect, negative affect, and depressive symptoms when factor or subscale scores were used as predictors compared to total scores. Results showed that a multidimensional assessment (PCI factor scores, MPCI subscale scores) explained more variance than a unidimensional assessment (PCI and MPCI total scores) because, when the different dimensions were entered simultaneously as predictors, perfectionistic strivings cognitions and perfectionistic concerns cognitions acted as mutual suppressors thereby increasing each others’ predictive validity. With this, the present findings provide evidence that?regardless of whether the PCI or the MPCI is used?a multidimensional assessment of perfectionism cognitions has advantages over a unidimensional assessment in explaining variance in psychological adjustment and maladjustment.
  • Stoeber, J., Kobori, O., & Brown, A. (2014). Perfectionism cognitions are multidimensional: A reply to Flett and Hewitt (2014). Assessment, 21, 666-668. doi:10.1177/1073191114550676
    We reply to Flett and Hewitt’s (2014) commentary on our findings (Stoeber, Kobori, & Brown, 2014) focusing on the multidimensionality of the Perfectionism Cognitions Inventory (PCI) and the question of whether the Multidimensional Perfectionism Cognitions Inventory (MPCI) represents an alternative to the PCI. In addition, we reiterate the importance of considering suppression effects when examining different dimensions of perfectionism and, in concluding, invite researchers to join forces to further advance the assessment of multidimensional perfectionism cognitions.
  • Deighton, J., Tymms, P., Vostanis, P., Belsky, J., Fonagy, P., Brown, A., Martin, A., Patalay, P., & Wolpert, M. (2013). The Development of a School-Based Measure of Child Mental Health. Journal of Psychoeducational Assessment, 31, 247-257. doi:10.1177/0734282912465570
    Early detection of child mental health problems in schools is critical for implementing strategies for prevention and intervention. The development of an effective measure of mental health and well-being for this context must be both empirically sound and practically feasible. This study reports the initial validation of a brief self-report measure for child mental health suitable for use with children as young as eight (“Me and My School” (M&MS)). After factor analysis, and studies of measurement invariance, two subscales emerged: emotional difficulties and behavioral difficulties. These two subscales were highly correlated with corresponding constructs of the Strengths and Difficulties Questionnaire (SDQ) and showed correlations with attainment, deprivation and educational needs similar to ones obtained between these demographic measures and the SDQ. Results suggest that this school-based self-report measure is psychometrically sound, and has the potential of contributing to school mental health surveys, evaluation of interventions, and recognition of mental health problems within schools.
  • Brown, A., & Maydeu-Olivares, A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18, 36-52. doi:10.1037/a0030641
    In multidimensional forced-choice (MFC) questionnaires, items measuring different attributes are presented in blocks, and participants have to rank-order the items within each block (fully or partially). Such comparative formats can reduce the impact of numerous response biases often affecting single-stimulus items (aka, rating or Likert scales). However, if scored with traditional methodology, MFC instruments produce ipsative data, whereby all individuals have a common total test score. Ipsative scoring distorts individual profiles (it is impossible to achieve all high or all low scale scores), construct validity (covariances between scales must sum to zero), criterion related validity (validity coefficients must sum to zero), and reliability estimates.
    We argue that these problems are caused by inadequate scoring of forced-choice items, and advocate the use of item response theory (IRT) models based on an appropriate response process for comparative data, such as Thurstone’s Law of Comparative Judgment. We show that by applying Thurstonian IRT modeling (Brown & Maydeu-Olivares, 2011), even existing forced-choice questionnaires with challenging features can be scored adequately and that the IRT-estimated scores are free from the problems of ipsative data.
  • Brodbeck, J., Bachmann, M., Croudace, T., & Brown, A. (2013). Comparing Growth Trajectories of Risk Behaviors From Late Adolescence Through Young Adulthood: An Accelerated Design. Developmental Psychology, 49, 1732-1738. doi:10.1037/a0030873
    Risk behaviors such as substance use or deviance are often limited to the early stages of the life course. Whereas the onset of risk behavior is well studied, less is currently known about the decline and timing of cessation of risk behaviors of different domains during young adulthood. Prevalence and longitudinal developmental patterning of alcohol use, drinking to the point of drunkenness, smoking, cannabis use, deviance, and HIV-related sexual risk behavior were compared in a Swiss community sample (N = 2,843). Using a longitudinal cohort-sequential approach to link multiple assessments with 3 waves of data for each individual, the studied period spanned the ages of 16 to 29 years. Although smoking had a higher prevalence, both smoking and drinking up to the point of drunkenness followed an inverted U-shaped curve. Alcohol consumption was also best described by a quadratic model, though largely stable at a high level through the late 20s. Sexual risk behavior increased slowly from age 16 to age 22 and then remained largely stable. In contrast, cannabis use and deviance linearly declined from age 16 to age 29. Young men were at higher risk for all behaviors than were young women, but apart from deviance, patterning over time was similar for both sexes. Results about the timing of increase and decline as well as differences between risk behaviors may inform tailored prevention programs during the transition from late adolescence to adulthood.
  • Brown, A., & Maydeu-Olivares, A. (2012). Fitting a Thurstonian IRT model to forced-choice data using Mplus. Behavior Research Methods, 44, 1135-1147. doi:10.3758/s13428-012-0217-x
    To counter response distortions associated with the use of rating scales (a. k. a. Likert scales), items can be presented in a comparative fashion, so that respondents are asked to rank the items within blocks (forced-choice format). However, classical scoring procedures for these forced-choice designs lead to ipsative data, which presents psychometric challenges that are well described in the literature. Recently, Brown and Maydeu-Olivares (Educational and Psychological Measurement 71: 460-502, 2011a) introduced a model based on Thurstone's law of comparative judgment, which overcomes the problems of ipsative data. Here, we provide a step-by-step tutorial for coding forced-choice responses, specifying a Thurstonian item response theory model that is appropriate for the design used, assessing the model's fit, and scoring individuals on psychological attributes. Estimation and scoring is performed using Mplus, and a very straightforward Excel macro is provided that writes full Mplus input files for any forced-choice design. Armed with these tools, using a forced-choice design is now as easy as using ratings.
  • Brown, A., & Maydeu-Olivares, A. (2011). Item Response Modeling of Forced-Choice Questionnaires. Educational and Psychological Measurement, 71, 460-502. doi:10.1177/0013164410375112
    Multidimensional forced-choice formats can significantly reduce the impact of numerous response biases typically associated with rating scales. However, if scored with classical methodology, these questionnaires produce ipsative data, which lead to distorted scale relationships and make comparisons between individuals problematic. This research demonstrates how item response theory (IRT) modeling may be applied to overcome these problems. A multidimensional IRT model based on Thurstone’s framework for comparative data is introduced, which is suitable for use with any forced-choice questionnaire composed of items fitting the dominance response model, with any number of measured traits, and any block sizes (i.e., pairs, triplets, quads, etc.). Thurstonian IRT models are normal ogive models with structured factor loadings, structured uniquenesses, and structured local dependencies. These models can be straightforwardly estimated using structural equation modeling (SEM) software Mplus. A number of simulation studies are performed to investigate how latent traits are recovered under various forced-choice designs and provide guidelines for optimal questionnaire design. An empirical application is given to illustrate how the model may be applied in practice. It is concluded that when the recommended design guidelines are met, scores estimated from forced-choice questionnaires with the proposed methodology reproduce the latent traits well.
  • Lievens, F., Sanchez, J., Bartram, D., & Brown, A. (2010). Lack of consensus among competency ratings of the same occupation: Noise or substance?. Journal of Applied Psychology, 95, 562-571. doi:10.1037/a0018035
    Although rating differences among incumbents of the same occupation have traditionally been viewed as error variance in the work analysis domain, such differences might often capture substantive discrepancies in how incumbents approach their work. This study draws from job crafting, creativity, and role theories to uncover situational factors (i.e., occupational activities, context, and complexity) related to differences among competency ratings of the same occupation. The sample consisted of 192 incumbents from 64 occupations. Results showed that 25% of the variance associated with differences in competency ratings of the same occupation was related to the complexity, the context, and primarily the nature of the occupation's work activities. Consensus was highest for occupations involving equipment-related activities and direct contact with the public.
  • Maydeu-Olivares, A., & Brown, A. (2010). Item Response Modeling of Paired Comparison and Ranking Data. Multivariate Behavioral Research, 45, 935-974. doi:10.1080/00273171.2010.531231
    The comparative format used in ranking and paired comparisons tasks can significantly reduce the impact of uniform response biases typically associated with rating scales. Thurstone's (1927, 1931) model provides a powerful framework for modeling comparative data such as paired comparisons and rankings. Although Thurstonian models are generally presented as scaling models, that is, stimuli-centered models, they can also be used as person-centered models. In this article, we discuss how Thurstone's model for comparative data can be formulated as item response theory models so that respondents' scores on underlying dimensions can be estimated. Item parameters and latent trait scores can be readily estimated using a widely used statistical modeling program. Simulation studies show that item characteristic curves can be accurately estimated with as few as 200 observations and that latent trait scores can be recovered to a high precision. Empirical examples are given to illustrate how the model may be applied in practice and to recommend guidelines for designing ranking and paired comparisons tasks in the future.
  • Brown, A., & Maydeu-Olivares, A. (2010). Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy. Industrial and Organizational Psychology, 3, 489-493. doi:10.1111/j.1754-9434.2010.01277.x
  • Bartram, D., Warr, P., & Brown, A. (2010). Let’s Focus on Two-Stage Alignment Not Just on Overall Performance. Industrial and Organizational Psychology, 3, 335-339. doi:10.1111/j.1754-9434.2010.01247.x
  • Brown, A. (2010). Doing less but getting more: Improving forced-choice measures with Item Response Theory. Assessment and Development Matters, 2, 21-25. Retrieved from http://shop.bps.org.uk/assessment-development-matters-vol-2-no-1-spring-2010.html
    Forced-choice tests, despite being resistant to response biases and showing good operational validities, have psychometric problems if scored traditionally. These questionnaires are generally longer than their normative counterparts, and more cognitively challenging.
    The OPQ32i was shortened and re-scored using the latest advances in IRT. One item was removed out of each block, making the completion quicker and less cognitively complex. The shortened version (OPQ32r) shows good reliability, equivalent or better validity than the full ipsative version, and produces scale scores with normative properties.
    Results suggest that the IRT methodology can significantly improve efficiency of existing forced-choice measures so that test takers can do less (complete shorter and easier questionnaire) and test users can get more (bias-resistant instrument of superior psychometric quality).
  • Bywater, J., & Brown, A. (2010). Shorter Personality Questionnaires—A User’s Guide Part 1. Assessment and Development Matters, 2, 15.
    In this two part series, James Bywater and Anna Brown summarise some of the issues involved in determining the correct length of assessment in a personality questionnaire (PQ). In the first instalment they discuss the general issues that test designers face, and in the second they cover some more modern solutions to these, with associated disadvantages.

    It is aimed at practitioners rather than hard core psychometricians and can not be exhaustive. However wherever possible it attempts to distil out practical messages for the audience.
  • Bywater, J., & Brown, A. (2010). Shorter Personality Questionnaires—A User’s Guide Part 2. Assessment and Development Matters, 2, 10.
    In this two part series, James Bywater and Anna Brown summarise some of the issues involved in determining the correct length of assessment in a personality questionnaire (PQ). In the last edition of Assessment & Development Matters they discussed the general issues that test designers face, and in this one they cover some more modern solutions to these.

    It is aimed at practitioners rather than hard core psychometricians and can not be exhaustive. However wherever possible it attempts to distil out practical messages for the audience.
  • Warr, P., Bartram, D., & Brown, A. (2005). Big Five validity: Aggregation method matters. Journal of Occupational and Organizational Psychology, 78, 377-386. doi:10.1348/096317905X53868
    Correlations between Big Five personality factors and other variables have been examined in three different ways: direct scoring of items within a factor, application of a composite score formula, and taking the average of single-scale correlations. Those methods were shown to yield consistently different outcomes in four sets of data from sales-people and managers. Factor correlations with job performance were greatest for direct scoring, and were reduced by half when scale correlations were averaged. The insertion of previously suggested estimates into the composite score formula yielded intermediate correlations with performance. It is necessary to interpret summary accounts of correlations with a compound construct in the light of the aggregation method employed.

Book section

  • Brown, A., & Maydeu-Olivares, A. (2018). Modeling forced-choice response formats. In P. Irwing, T. Booth, & D. Hughes (Eds.), The Wiley Handbook of Psychometric Testing (pp. 523-570). London: Wiley-Blackwell.
    To counter response distortions associated with the use of rating scales in personality and similar assessments, test items may be presented in so-called ‘forced-choice’ formats. Respondents may be asked to rank-order a number of items, or distribute a fixed number of points between several items – therefore they are forced to make a choice. Until recently, basic classical scoring methods were applied to such formats, leading to scores relative to the person’s mean (ipsative scores). While interpretable in intra-individual assessments, ipsative scores are problematic when used for inter-individual comparisons. Recent advances in estimation methods enabled rapid development of item response models for comparative data, including the Thurstonian IRT model (Brown & Maydeu-Olivares, 2011a), the Multi-Unidimensional Pairwise Preference model (Stark, Chernyshenko & Drasgow, 2005), and others. Appropriate item response modeling enables estimation of person scores that are directly interpretable for inter-individual comparisons, without the distortions and artifacts produced by ipsative scoring.
  • Brown, A. (2018). Item Response Theory approaches to test scoring and evaluating the score accuracy. In P. Irwing, T. Booth, & D. Hughes (Eds.), The Wiley Handbook of Psychometric Testing (pp. 607-638). London: Wiley-Blackwell.
    The ultimate goal of psychometric testing is to produce a score by which people can be differentiated. Item Response Theory (IRT) devises methods for estimating person’s score on one or more psychological constructs (traits) from his/her responses to test items. This chapter gives an overview of scoring methods applicable to situations when the test items indicate one trait only; or a set of related traits but each item contributes to measurement of one trait; or when each item indicates multiple traits. We consider scoring methods based on item responses only, as well as Bayesian methods, which use prior knowledge of the trait distribution. Much of this chapter is devoted to methods for assessing measurement precision provided by individual items, the whole test, and the prior distribution. In IRT, this precision can be evaluated for each individual response pattern. All described methods are illustrated with a single empirical example.
  • Wetzel, E., Böhnke, J., & Brown, A. (2016). Response biases. In F. T. Leong & D. Iliescu (Eds.), The ITC International Handbook of Testing and Assessment (pp. 349-363). New York: Oxford University Press.
  • Brown, A., & Croudace, T. (2015). Scoring and estimating score precision using multidimensional IRT. In S. P. Reise & D. A. Revicki (Eds.), Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment (pp. 307-333). New York: Taylor & Francis (Routledge). Retrieved from http://www.routledge.com/books/details/9781848729728/
    The ultimate goal of measurement is to produce a score by which individuals can be assessed and differentiated. Item response theory (IRT) modeling views responses to test items as indicators of a respondent’s standing on some underlying psychological attributes (van der Linden & Hambleton, 1997) – we often call them latent traits – and devises special algorithms for estimating this standing. This chapter gives an overview of methods for estimating person attribute scores using one-dimensional and multi-dimensional IRT models, focusing on those that are particularly useful with patient-reported outcome (PRO) measures.
    To be useful in applications, a test score has to approximate the latent trait well, and importantly, the precision level must be known in order to produce information for decision-making purposes. Unlike classical test theory (CTT), which assumes the precision with which a test measures the same for all trait levels, IRT methods assess the precision with which a test measures at different trait levels. In the context of patient-reported outcomes measurement, this enables assessment of the measurement precision for an individual patient. Knowing error bands around the patient’s score is important for informing clinical judgments, such as deciding upon significance of any change, for instance in response to treatment etc. (Reise & Haviland, 2005). At the same time, summary indices are often needed to summarize the overall precision of measurement in a research sample, population group, or in the population as a whole. Much of this chapter is devoted to methods for estimating measurement precision, including the score-dependent standard error of measurement and appropriate sample-level or population-level marginal reliability coefficients.
    Patient-reported outcome measures often capture several related constructs, the feature that may make the use of multi-dimensional IRT models appropriate and beneficial (Gibbons, Immekus & Bock, 2007). Several such models are described, including a model with multiple correlated constructs, a model where multiple constructs are underlain by a general common factor (second-order model), and a model where each item is influenced by one general and one group factor (bifactor model). To make the use of these models more easily accessible for applied researchers, we provide specialized formulae for computing test information, standard errors and reliability. We show how to translate a multitude of numbers and graphs conditioned on several dimensions into easy-to-use indices that can be understood by applied researchers and test users alike. All described methods and techniques are illustrated with a single data analysis example involving a popular PRO measure, the 28-item version of the General Health Questionnaire (GHQ28; Goldberg & Williams, 1988), completed in mid-life by a large community sample as a part of a major UK cohort study.
  • Brown, A. (2015). Personality Assessment, Forced-Choice. In J. Wright (Ed.), International Encyclopedia of the Social and Behavioural Sciences, 2nd Edition. Elsevier. doi:10.1016/B978-0-08-097086-8.25084-8
    Instead of responding to questionnaire items one at a time, respondents may be forced to make a choice between two or more items measuring the same or different traits. The forced-choice format eliminates uniform response biases, although the research on its effectiveness in reducing the effects of impression management is inconclusive. Until recently, forced-choice questionnaires were scaled in relation to person means (ipsative data), providing information for intra-individual assessments only. Item response modeling enabled proper scaling of forced-choice data, so that inter-individual comparisons may be made. New forced-choice applications in personality assessment and directions for future research are discussed.

Conference or workshop item

  • Brown, A. (2016). Response distortions in self-reported and other-reported measures: Is there light at the end of the tunnel?. In 10th conference of the International Test Commission (ITC). Vancouver, Canada.
    Asking people to assess themselves or others on a set of psychological characteristics is by far the most popular method of gathering data in our field. We use this method either because it is the cheapest, or the best there currently exists for measuring the target characteristic. However, respondent-reported data are commonly affected by conscious and unconscious response distortions. Examples include individual styles in using rating options, inattention or cognitive difficulties in responding to reversed items, tendency to present self in positive light, halo effects, distortions driven by political pressures etc. The extent to which respondents engage in such behaviors varies, and if not controlled, the biases alter the true ordering of respondents on traits of interest. Response distortions, therefore, should concern everyone who uses respondent-reported measures.

    This talk provides an overview of research on biasing factors evoked by responding to questionnaire items with different features and in different contexts, discussing the evolution of views on the problem. I will discuss the emerging methods of statistical control, which explicitly incorporate biases in the models of item-level response processes (e.g. Böckenholt, 2012; Johnson & Bolt, 2010). These methods offer a great promise as well as natural limitations in their applicability and scope. Alternatives to statistical control include prevention, and there have been advances in this area too. Special response formats are one of the bias prevention methods, with the forced-choice format being particularly promising. During the past 10-15 years we have acquired methodology that enables modelling forced-choice data. This enables comparing the effectiveness in bias control of the two methods – statistical control versus prevention. I will report latest findings in this regard and share some of my own views and recommendations for the use of these methods depending on the context and stakes of assessments.

    I will argue that despite some significant progress, we are still far off bias-proof assessments. In order to create a breakthrough in this area, we must invest in research of test taker cognitions, mixing qualitative and quantitative methods. Few available studies of test taker behavior (e.g. Kuncel & Tellegen, 2009; Robie, Brown, & Beaty, 2007) show that the test takers have conflicting motives, and complex cognitions when it comes to sitting our assessments. Only when we understand these factors, can we hope to create better assessments.
  • Seager, E., Abbot-Smith, K., & Brown, A. (2014). Concurrent validity for 2-year olds of three nursery-worker completed language screening measures with a direct measure of receptive language. In British Psychological Society, Developmental Psychology Section Annual Conference,. Amsterdam. Retrieved from https://www.bps.org.uk/events/conferences/developmental-section-annual-conference-2014
  • Brown, A., & Bartram, D. (2009). Doing less but getting more: Improving forced-choice measures with IRT. In Society for Industrial & Organizational Psychology annual conference. New Orleans.
    Multidimensional forced-choice (MFC) questionnaires typically show good validities and are resistant to impression management effects. However, they yield ipsative data, which distorts scale relationships and makes comparisons between people problematic. Depressed reliability estimates also led developers to create tests of potentially excessive length. We apply an IRT Preference Model to make more efficient use of information in existing MFC questionnaires. OPQ32i used for selection and assessment internationally is examined using this approach. The latent scores recovered from a much reduced number of MFC items are superior to the full test?s ipsative scores, and comparable to unbiased normative scores.


  • Brown, A., & Maydeu-Olivares, A. (2011). Forced-choice Five Factor markers. doi:10.1037/t05430-000
    The Big Five Questionnaire (Brown & Maydeu-Olivares, 2011) was developed in the context of a study researching item response theory (IRT) modeling of forced-choice questionnaires. The purpose of the questionnaire is to measure the Big Five personality factor markers. Items were drawn from the 100 items of the International Personality Item Pool. The authors selected 60 items so that 12 items would measure each of the five marker traits, with 8 positively and 4 negatively keyed items per trait combined in a way that equal number of pairwise comparisons occur between items keyed in the same direction and items keyed in opposite directions. Each block of the questionnaire was presented in two formats. First, participants rated the three items using a 5-point rating scale, ranging from "very accurate" to "very inaccurate." This single-stimulus presentation was immediately followed by the forced-choice presentation, where the participants were asked to select one "most like me" item and one "least like me" out of the same block of three items. A total of 438 volunteers from the United Kingdom completed the questionnaire online. The reliability estimates ranged from .775 to .844 for the single-stimulus data and from .601 to .766 for the forced-choice data. The maximum a posteriori estimated trait scores for individuals based on single-stimulus and forced-choice responses correlated strongly, with correlations ranging from .69 for Agreeableness to .82 for Extraversion.


  • Brown, A., & Bartram, D. (2011). OPQ32r Technical Manual. SHL Group Ltd., Thames Ditton, Surrey.
    A new SHL approach to designing and scoring forced-choice questionnaires using Item Response Theory (IRT) has enabled a revolutionary improvement in efficiency, accuracy and scaling properties of OPQ32 trait scores, leading to the new OPQ32r. This manual presents technical information relating the IRT scoring model and the new OPQ32r instrument. It should be read in conjunction with the more detailed and extensive OPQ32 Technical Manual (SHL, 2006) which covers the design, development and technical characteristics of the OPQ32i and OPQ32n.
  • Brown, A., & Bartram, D. (2009). The Occupational Personality Questionnaire Revolution: Applying Item Response Theory to questionnaire design and scoring. SHL Group, Thames Ditton, Surrey.
  • Bartram, D., Brown, A., Fleck, S., Inceoglu, I., & Ward, K. (2006). OPQ32 Technical Manual. SHL Group Ltd., Thames Ditton, Surrey.
    This Technical Manual is intended to be read in conjunction with the OPQ32 User Manual. The content of
    the latter focus on administration, scoring, norming and interpretation issues, and is intended to cover all
    the matters one needs to refer to when using the OPQ32. The technical manual is intended for reference
    purposes and provides all the technical backup needed when evaluating the OPQ32 in terms of its
    suitability for use.


  • Crispim, A. (2017). Exploring the validity evidence of core affect.
    Core affect is an elementary affective state expressed through subjective feelings. Nonetheless, despite extensive empirical evidence in the field, researchers still disagree about its dimensionality. Thus, the present thesis aims to verify the validity evidence of existing models of core affect, overcoming the methodological issues of previous studies, and establishing the dimensionality of core affect. First, theoretical contributions are presented, and both conceptual (e.g. what is core affect?) and methodological issues (e.g. how core affect is measured?) are discussed. Following that, two empirical studies are presented. The first study explores the dimensionality of core affect and provides validity evidence of a new core affect measure. In the second study, a robust-to-biases core affect measure is developed and tested. In addition, the relationship between core affect, contextual variables (e.g. mood) and personality traits are studied in a longitudinal design. Items formats and their consequences in the measurement of core affect (e.g. rating scales, forced-choice items) are debated. Theoretical and methodological advances are discussed at last, as well as limitations and future directions.
Last updated