Evaluating Agreement Conducting A Reliability Study

Demographic differences between the two subgroups were first assessed. Reliability, consistency and correlations within and beyond the two scoring subgroups were then analyzed. The method of analysis and the corresponding research questions are summarized in Figure 1. Figure 3. Dispersal of child assessments. Each point represents two assessments that are made available to a child. For the parent-teacher-assessment subgroup are the parent evaluations on the x axis, the teacher evaluations on the y axis, for the subgroup of the evaluation of parents, for the paternal evaluations on the x axis, for the maternal evaluations on the y axis. Ratings for bilingual children are represented by greyness, for monolingual children by blackheads. The dotted lines are surrounded by statistically identical ratings calculated on the basis of manually provided test reliability (less than 3 T points; 23 pairs out of 53). Straight lines circle statistically identical ratings, calculated on the basis of inter-rater reliability (ICC) in our study (less than 12 points in T).

Another way to illustrate the magnitude of the differences is to indicate the distribution of significant differences, with the average T values represented against the absolute differential values proposed by Bland and Altman (1986, 2003). This graph (see Figure 4) shows that 18 of the 30 differences observed (60%) 1 SD of differences (SD – 5.7). The boundaries of concordance in this study, as defined by Bland and Altman (2003), containing 95% of differences in similar populations, are 12.2 to 10.2 T points, an interval that contains all the differences observed in this study. The graphical approach used to assess the size of differences therefore reflects the result of a 100% failure agreement when CCI is considered reliable in calculating reliable differences. The term « agreement » describes the degree to which evaluations are identical (see z.B. de Vet et al., 2006; Shoukri, 2010; Kottner et al., 2011). Many studies, which claim to assess the matching of expressive vocabulary, rely (only) on relationship strength ratios such as linear correlations (for example). B Bishop and Baird, 2001; Janus, 2001; Van Noord and Prevatt, 2002; Bishop et al., 2006; Massa et al., 2008; Gudmundsson and Gretarsson, 2009). In some studies, gross values are used as benchmarks and critical differences are ignored (e.g. B Marchman and Martinez-Sussmann, 2002; McLeod and Harrison, 2009). However, the absolute differences between gross values or percentiles do not contain information on their statistical relevance.

We show the use of the Reliable Changes Index (ROI) to identify statistically significant differences between evaluation pairs. We obtained two different RCIs on the basis of two insurance measures: retest test reliability in the ELAN manual (Bockmann and 200th- Himmel, 2006) and inter-rater reliability derived from our sample (expressed in ICC). This dual approach was adopted to highlight the effects of more or less conservative reliability estimates on the measures of the rating agreement. We found that evaluations differ reliably if the absolute difference between them is three or more T-points, taking into account the reliability indicated in the ELAN manual. In terms of the reliability of our study, however, this difference, which is necessary to establish a reliable divergence between two ratings, is much greater, i.e. 12 T points or more. Kottner, J., Audige, L., Brorson, S., Thunder, A., Gajewski, B.