Validity of Echocardiographic Measurement in an Epidemiological Study
Abstract—In Project HeartBeat!, a longitudinal study of cardiovascular disease risk factors in healthy children and adolescents, 3 samples of 40, 80, and 182 echocardiograms, respectively, were randomly selected and reread to evaluate intraobserver and interobserver variabilities and comparability between measurements of field echocardiographic technicians and reference readings at Texas Children’s Hospital. Included in the evaluation were 8 M-mode echocardiographic measurements, ie, aortic root diameter, left atrial diameter, and end-diastolic and end-systolic measurements of interventricular septal thickness, left ventricular (LV) diameter, and LV posterior wall thickness; 8 Doppler measurements; and a calculated LV mass. Means and SDs of the differences of the paired measurements were used to assess the relative bias and random error of the measurements. For the intraobserver comparison, means and SDs of the differences were very small, indicating that the echo measurements were performed consistently by each project echo technician. Interobserver comparison showed statistically but not clinically significant differences between the paired readings of end-diastolic septal thickness, end-systolic LV posterior wall thickness, and 5 Doppler measurements. Comparison with reference readings at Texas Children’s Hospital showed significant differences in diastolic LV diameter, systolic septal thickness, and right ventricular ejection time. These differences, however, were minimal with limited clinical significance. Mean differences in LV mass for the corresponding comparisons were –1.82, 4.50, and 0.0013 g, and the SDs were 18.79, 24.16, and 12.35 g, respectively. We conclude that the echocardiographic measurements taken from healthy children in a longitudinal study can be made accurately with acceptable reproducibility.
The use of M-mode, 2-dimensional (2D), and Doppler echocardiography is an established, noninvasive clinical diagnostic method to examine various cardiac structure and functions. It has been increasingly used in population studies,1 2 including pediatric epidemiological studies,3 to access the role of the heart in human hypertension and other cardiovascular disease (CVD). Accurate and reproducible measurements are crucial to the success of this method in population studies and may be obtained only by both careful performance of echocardiographic imaging and consistent interpretation of the echocardiogram. Uniformity in the definition of an adequate echocardiogram and standardization of measurement methods are necessary to minimize intraobserver and interobserver variabilities and to facilitate interstudy comparison.4 Although these issues have been addressed previously, the magnitude of the intraobserver and interobserver measurement variabilities in populational studies and its impact on interstudy comparison have not yet been explored systematically.
In Project HeartBeat!, a population-based, intensive longitudinal study to evaluate CVD risk factors as an interrelated set of growth processes in healthy children and adolescents, echocardiographic measurement of the cardiac geometry and function was an integral component. This provision allows assessment of the morphological and functional growth of the heart and determinants of different aspects of this growth process. A detailed quality assurance protocol was developed and implemented for echocardiographic measurements, including training, certification, and recertification of the echocardiographic technicians and continuous monitoring of data quality and measurement accuracy by a single pediatric echocardiographer or experienced technicians at Texas Children’s Hospital. This report presents the results of a quality assessment study designed to evaluate observer variability in echocardiographic measurements. Three aspects of the echocardiographic measurements were assessed: intraobserver variability, interobserver variability within Project HeartBeat! staff, and interinstitutional measurement variability between project echocardiographic technicians and experienced technicians or a pediatric echocardiographer at Texas Children’s Hospital. The latter also served as validation for echocardiographic measurements collected in Project HeartBeat!.
Data collection of the Project HeartBeat! began October 1, 1991. Six hundred seventy-eight children in 3 cohorts, 8, 11, and 14 years of age in each cohort, respectively, were enrolled in the study from The Woodlands and Conroe, Montgomery County, Tex.5 They were followed and examined at 4-month intervals. Data collected included hemodynamics (blood pressure, heart rate, echocardiography), blood lipids, smoking (habits, cotinine levels), body size and composition, maturation, diet and nutrition, physical fitness and activity, and personal and family health history and health-related behavior.
Echocardiograms were performed with the Interspec XL (Apogee) Annular Phased Array echocardiographic machine with either a 5- or 3.5-MHz transducer and recorded on VHS videocassettes. The participants were required to rest for 5 minutes before data collection. Echocardiographic examinations were done with the participants in supine position with a pillow under the right shoulder. The heart was imaged with 2D echocardiography in the parasternal long-axis view, parasternal short-axis view, apical view, subxiphoid views, and suprasternal notch image. M-mode echocardiography, 2D, and 2D directed pulsed-wave Doppler recordings were obtained by standard methods,6 7 and measurements were made online with the Interspec Apogee measurement software package. M-mode measurements followed the standards of the American Society of Echocardiography (ASE).8 Eight M-mode echocardiographic measurements and 8 Doppler measurements were specified as the core measurements in the study protocol. They are aortic root diameter, left atrial diameter, end-diastolic interventricular septal thickness, end-diastolic left ventricular (LV) diameter, end-diastolic LV posterior wall thickness, end-systolic interventricular septal thickness, end-systolic LV diameter, end-systolic LV posterior wall thickness, right ventricular (RV) preejection period, RV ejection time, isovolumetric relaxation time (IVRT), aortic peak velocity, aortic time-velocity integral, heart rate, LV preejection period, and LV ejection time. These 16 core original measurements and LV mass (LVM), calculated from the formula reported by Devereux et al,9 10 were included for quality assessment.
Quality assessment was based on samples reviewed from 3600 studies completed by October 1994 and recorded on videotapes. Altogether, 4 persons were trained and certified as project echocardiographic technicians who performed the studies, although only 2 were active in the project at any given time. Three samples of echocardiograms were chosen for quality assessment to evaluate (1) intraobserver variability (sample 1), (2) interobserver variability (sample 2), and (3) comparability between measurements of field echocardiographic technicians and reference readings by experienced technicians or the pediatric echocardiographer at Texas Children’s Hospital (sample 3). No single echocardiographic recording was included in >1 sample. For sample 1, 20 echocardiograms from each of the 2 current echocardiographic technicians (40 total) were randomly selected to be reread by the same project technician. For sample 2, a total of 80 echocardiograms, 20 from the files for each of the 4 echo technicians, were selected and remeasured by 1 of the 2 current technicians, assigned to exclude their own originally measured echocardiograms. For sample 3, 5% of the echocardiograms from each of the 4 echo technicians, 182 in all, were randomly selected and reviewed at Texas Children’s Hospital by an experienced technician or a pediatric echocardiographer. All remeasurements were made with the technician blinded to the original results.
Completeness and quality of all echocardiograms were determined at the end of each study. Among the total 302 echocardiograms selected for quality assessment, 3 at original measurement and 6 at repeated measurement were rated clinically as suboptimal because of poor acoustic windows in the participants. Pediatric cardiologists at Texas Children’s Hospital concluded that even those studies considered to have imperfect image quality clinically permitted various measurements that provided data of acceptable quality. Thus, all 302 echocardiograms were included in the quality assessment.
Analyses were performed with the SPSS statistical package.11 Differences between the repeated measures (observation 1 minus observation 2) were first plotted against the mean of the repeated measures (observation 1 plus observation 2) divided by 2.12 Means and SDs of the differences were then calculated, and the corresponding paired t tests were performed. Means and SDs were also computed for the original and repeated measurements.
All 3 samples consisted of echocardiograms from project participants 10 to 17 years of age (mean, 12.7 to 13.2 years). Twenty-two (55%) were male and 34 (85%) were nonblack in sample 1. Thirty-six (45%) were male and 75 (93.8%) were nonblack in sample 2. The corresponding numbers in sample 3 were 93 (51%) and 163 (89.6%), respectively.
Seventeen plots of the differences between repeated measures (observation 1 minus observation 2) versus the mean of the repeated measures (observation 1 plus observation 2) divided by 2 for each of the 3 samples were generated (plots not shown). These plots provided visual information on the magnitude of disagreement, both random error and systematic bias, and on the relationship of the differences and size of the measurements. Most plots revealed uniform distribution patterns with most points on or near the zero-difference reference line. No easily discernible dependence of the differences on measurement size was observed.
Table 1⇓ displays the results of a comparison of original and repeated measurements by the same Project HeartBeat! echocardiographic technicians for the 16 core measurements. All the means of differences in Table 1⇓ were very small compared with the magnitude of the measurements, and paired t tests suggested no statistically significant systematic differences between the original and repeated measurements. SDs of the differences were also small, indicating high reproducibility of within-observer measurements.
Results of interobserver comparisons are shown in Table 2⇓. As expected, most means and SDs of the differences were larger than those of intraobserver differences. Differences between the first and second readings of end-diastolic septal thickness, end-systolic LV posterior wall thickness, and 5 Doppler measurements were statistically significant. However, all these differences were small with very limited clinical significance. Relatively large SDs of the differences were found for end-diastolic LV posterior wall thickness, end-systolic septal thickness, and end-systolic LV posterior wall thickness.
Repeated measurements done at Texas Children’s Hospital were compared with the original measurements by the Project HeartBeat! echo technicians. Results are shown in Table 3⇓. The means and SDs of the differences were very close to those of intraobserver differences and were smaller than those of the interobserver comparison. Differences of 0.19 mm in end-diastolic LV diameter, −0.25 mm in systolic septal thickness, and −0.003 second in RV ejection time were found to be statistically significant. These differences, however, were minimal with limited clinical significance. The small SDs of the differences also suggested high comparability of the echocardiographic measurements observed by project echo technicians and experienced technicians and by pediatric echocardiographers at Texas Children’s Hospital.
The intraobserver, interobserver, and intersite comparisons of LVM are presented in Table 4⇓. Means of differences were –1.82, 4.50, and 0.0013 g for the paired measurements by the same project observers, by different project observers, and by project echo technicians and Texas Children’s Hospital echocardiographers, respectively. Mean differences were small, and none was statistically significant. The corresponding SDs were 18.79, 24.16, and 12.35 g, respectively, which were smaller than the corresponding SDs of original and repeated LVM measurements.
Measurements of a valid measuring process should be both accurate and reproducible. In epidemiology, accuracy of a measurement process is defined as the degree to which a measurement represents the true value of the attribute being measured. Reliability and, in common usage, both repeatability and reproducibility refer to the capacity to produce the same result on each occasion under identical conditions.13 A measurement process is accurate, or unbiased, if the expected value of the measurement is the true value of the parameter being estimated. A reliable process may produce the same result each time but may not give accurate measurements. A measurement of accuracy is the mean difference between a series of replicate determinations and the same true quantity. The reproducibility of the process is measured by the SD of the differences between the replicate determinations and the true quantity. A larger mean difference indicates larger systematic bias and lower accuracy of the process. Similarly, a larger SD suggests larger random errors and lower reproducibility.
The accuracy of a particular measurement process may be assessed only if the “true” value or “gold standard” value is known. Such true values were not known in the present study. An alternative method was then used to estimate the “relative” accuracy of the measurements. A series of paired determinations were obtained, and mean and SD of the differences were calculated. If the paired observations were the same except for random error, the mean of the differences would be expected to be 0, and the paired t test was used to test this hypothesis. Thus, the mean of differences offered a measure for average systematic differences (relative bias) between original and repeated measurements. The SD of the paired differences, which indicated variability of the difference between the first and second measurements and thus provided estimates of random errors, was the measure of reproducibility of the measurement process. Independence of the differences and size of the measurements is the prerequisite for the analyses described above.12
The correlation coefficient between original and repeated measurements, often used in reproducibility studies of echo measurements, was not adopted in the present study because it is a measure of association, which is dependent on both the variation between study subjects (ie, between the true values) and the variation within study subjects (measurement error).12 A high correlation coefficient does not necessarily indicate good agreement. An example of this is the unmodified ASE-cube LVM formula, which systematically overestimated LVM by 25%, with a correlation coefficient between calculated and necropsy LVM of 0.90.9
The present quality assessment addressed 3 questions: How consistent were the Project HeartBeat! field echo observers in measuring the cardiac structure and function (intraobserver variation); were there any differences in measurements between field observers (interobserver variation); and to what extent did the measurements by the field observers agree with those by experienced clinical echocardiographic technicians? For all 3 questions, both bias and random error were at issue. Our study demonstrated that the echocardiographic measurements were performed by each project echo technician in a highly consistent manner, the extent of interobserver variation was acceptable, and measurements by Project HeartBeat! technicians agreed with those by experienced technicians and the pediatric echocardiographer at Texas Children’s Hospital.
Available results from other studies on intraobserver and interobserver variations of echocardiographic measurements are limited. Most studies conducted earlier varied greatly in study design and analysis method, making direct comparison difficult.8 14 15 16 17 18 Ladipo et al,14 evaluating measurements of 10 blind duplicated tracings by 3 observers, reported intraobserver mean absolute difference of 0.7 to 1.2, 0.2 to 0.8, 0.3 to 0.4, and 0.4 to 0.8 mm for diastolic and systolic LV diameters, diastolic LV posterior wall thickness, and diastolic interventricular septal thickness, respectively. Having 3 investigators measure 7 ventricular parameters of 20 randomly selected echocardiograms twice, Valdez et al16 showed that significant intraobserver difference was found in only 1 person in the measurement of end-diastolic LV posterior wall thickness. Small means and SDs of intraobserver differences in our study (Table 1⇑) showed that each technician read the same echocardiograms consistently the second time, achieving a high degree of agreement. Schieken et al17 obtained intraobserver measurement errors of aortic root diameter, left atrial diameter, end-diastolic interventricular septal thickness, end-diastolic LV diameter, end-diastolic LV posterior wall thickness, end-systolic LV diameter, and LV ejection time in 20 healthy children 6 to 16 years of age. The errors were reported as 0.5, 0.6, 0.6, 1.3, 0.6, and 1.0 mm and 0.01 second, respectively.17 Although analysis methods differed, the SDs reported here should be comparable to about twice the errors reported by Schieken et al.17 Thus, the “errors” for intraobserver variability (SDs in Table 1⇑ divided by 2) for the same measurements in the present study were either smaller or similar compared with their findings.
Previous studies on interobserver variation of echocardiographic measurement have shown different results.8 15 16 17 18 De Leonardis and Cinelli15 compared measurements of aortic root diameter, left atrial diameter, end-diastolic septal and posterior wall thicknesses, and end-diastolic and end-systolic LV diameters by 2 experienced interpreters on 50 routinely performed M-mode echocardiograms and concluded that no significant interobserver variability was found for all measured echocardiographic parameters. Valdez et al16 found statistically significant differences in measurements of end-diastolic septal thickness, end-diastolic and end-systolic LV posterior wall thicknesses, and end-diastolic and end-systolic LV diameters by 3 observers on 20 echocardiograms. The maximum mean difference was 2 mm. They concluded that the differences were not clinically significant. In our interobserver comparison, differences of 0.39 mm for end-diastolic septal thickness and 1.28 mm for end-systolic LV posterior wall thickness showed statistical significance. Comparison between Project HeartBeat! observers and Texas Children’s Hospital observers revealed statistical significant differences of 0.19 mm for end-diastolic LV diameter and −0.25 mm for end-systolic septal thickness. Magnitudes of all these differences, however, were small compared with available results.
Sahn et al8 evaluated measurements on 5 echocardiograms by 76 observers for aortic root diameter, left atrial diameter, diastolic and systolic LV diameters, diastolic interventricular septal thickness, and diastolic LV posterior wall thickness and showed minimum mean percent uncertainties of 13.5%, 11.2%, 8.2%, 14%, 19.5%, and 23.4%, respectively, when the ASE convention was used for measurement. The percent uncertainty was calculated for each measurement on each recording as the 95th percentile confidence limit, determined as 1.97 SD, divided by the mean for the measurement times 100. Schieken et al17 reported interobserver measurement precision for aortic root diameter, left atrial diameter, end-diastolic interventricular septal thickness, end-diastolic LV diameter, end-diastolic LV posterior wall thickness, end-systolic LV diameter, and LV ejection time of 0.5, 0.6, 0.9, 2.3, 1.6, and 1.1 mm and 0.1 second, respectively. Again, the SDs reported here should be comparable to about twice the precision reported by Schieken et al.17 The estimates of “precision” for interobserver variability (SDs in Table 2⇑ divided by 2) were larger in the present study for aortic root diameter and left atrial diameter but smaller for end-diastolic septal thickness and end-diastolic and end-systolic LV diameters, whereas they were similar for end-diastolic LV posterior wall thickness. By the same comparison, the estimates of precision for intersite measurement variability (SDs in Table 3⇑ divided by 2) for the same echo parameters in the present study were either similar or smaller.
LVM has been repeatedly associated with CVD death in adults. Use of echocardiographic measurement of LVM as an outcome measure in epidemiological investigation of hypertension still poses a challenge regarding measurement precision and comparability across studies.19 Reproducibility of measurement of LVM has been studied by use of a variety of methods.4 15 20 A recent report from the Treatment of Mild Hypertension Study showed acceptable measurement accuracy and reproducibility in adults.18 The means and SDs of intraobserver difference in LVM were reported from that study to be –0.0 and 20.4 g for 1 cardiologist and –6.1 and 26.8 g for another. The means and SDs for interobserver difference were 7.9 and 34.7 g between the 2 cardiologists and 5.7 and 46.1 g between the cardiologist and echo technicians. Means and SDs of intraobserver and interobserver measurement differences from our study in healthy children and adolescents were either similar or smaller. Minimal mean differences and small variation of the paired measurements by project echo technicians and experienced technicians or pediatric echocardiographers at Texas Children’s Hospital further suggest that echocardiographic measurement of LVM from population studies could be comparable to that from a clinical setting.
Doppler measurements of RV preejection period, RV ejection time, IVRT, aortic peak velocity, aortic time-velocity integral, heart rate, LV preejection period, and LV ejection time were included in the present analysis. Except for LV ejection time, no earlier results were available for between-study comparison. In our results, no significant difference was found for intraobserver comparison. Interobserver comparison showed only significant differences for RV preejection period, RV ejection time, IVRT, LV preejection period, and LV ejection time, and the interinstitutional comparison showed significant differences only for RV ejection time. The magnitudes of these differences, however, were trivial. Overall, the results showed good agreement between original and repeated measurements.
Variation of echocardiographic measurements arose from a variety of sources.1 4 Several factors can affect image quality and thus influence the definition of anatomic structures: participant’s body habitus; respiratory status and cooperation; the technician’s experience in recognizing the correct image signal and Doppler position and envelope, along with transducer orientation and placement; and the technician’s familiarity with echocardiographic equipment. Although criteria regarding these factors had been defined in the study protocol, their effects on measurement variability were not evaluated in the present study. The proportion of adequate echocardiograms in population studies has been reported variably from a minimum of 28% during the first 5 months of a population study to a recent report of 93%.2 4 20 Although several individual measurements were not possible and thus not included in the present analysis, all echocardiograms were included in the quality assessment study. This fact, with the intention to include as many as possible measures for each cardiac parameter, may have sacrificed reproducibility of measurements from a few technically imperfect echocardiograms, resulting in increased differences of the paired measurements and SDs of the differences.
We conclude that the echocardiographic measurements taken from healthy children in a longitudinal study can be made accurately with acceptable reproducibility. Echocardiographic measurements from an epidemiological study can compare favorably with those taken in a clinical setting with experienced technical support. Thus, these measurements can be applied meaningfully to clinical observation.
Cooperative Agreement U01-HL-41166, from the National Heart, Lung, and Blood Institute, provided major funding for the project. Support of the Centers for Disease Control and Prevention, through the Southwest Center for Prevention Research (U48/CCU609653), and Compaq Computer Corporation is also gratefully acknowledged, as is that of the University of Texas–Houston Health Science Center, School of Public Health. We acknowledge with gratitude the contribution of time and dedication of each Project HeartBeat! participant and family. Cooperation of the Conroe Independent School District and generous support of The Woodlands Corporation are deeply appreciated. The Woodland and Conroe Advisory committees have assisted greatly in the planning and conduct of the project.
- Received December 8, 1998.
- Revision received January 5, 1999.
- Accepted March 22, 1999.
Devereux RB, Liebson PR, Horan MJ. Recommendations concerning use of echocardiography in hypertension and general population research. Hypertension. 1987;9(suppl II):II-97–II-104.
Savage DD, Garrison RJ, Kannel WB, Anderson SJ, Feinleib M, Castelli WP. Considerations in the use of echocardiography in epidemiology: the Framingham study. Hypertension. 1987;9(suppl II):II-40–II-44.
Schieken RM. Measurement of left ventricular wall mass in pediatric populations. Hypertension. 1987;9(suppl II):II-47–II-52.
Wallerson DC, Devereux RB. Reproducibility of echocardiographic left ventricular measurements. Hypertension. 1987;9(suppl II):II-6–II-18. Review.
Labarthe DR, Nichaman MZ, Harrist RB, Grunbaum JA, Dai S. Development of cardiovascular risk factors from age 8 to 18 in Project HeartBeat!: study design and patterns of change in plasma total cholesterol concentration. Circulation. 1997;95:2636–2642.
Feigenbaum H. Echocardiography. 4th ed. Philadelphia, Pa: Lea and Febiger; 1986.
Snider AR, Serwer GA. Echocardiography in Pediatric Heart Disease. St Louis, Mo: Mosby Yearbook; 1990.
Sahn DJ, DeMaria A, Kisslo J, Weyman A. Recommendations regarding quantitation in M-mode echocardiography: results of a survey of echocardiographic measurements. Circulation. 1978;58:1072–1083.
Devereux RB. Detection of left ventricular hypertrophy by M-mode echocardiography: anatomic validation, standardization, and comparison to other methods. Hypertension. 1987;9(suppl II):II-9–II-26. Review.
SPSS Base System: Syntax Reference Guide. Release 6.0. Chicago, Ill: SPSS Inc; 1993.
Last JM. A Dictionary of Epidemiology. New York, NY: Oxford University Press; 1993.
Ladipo GIA, Dunn FG, Pringle TH, Bastian B, Lawrie TDV. Serial measurements of left ventricular dimensions by echocardiography: assessment of week-to-week, inter- and intraobserver variability in normal subjects and patients with valvular heart disease. Br Heart J. 1980;44:284–289.
Valdez RS, Motta JA, London E, Martin RP, Haskell WL, Farquhar JW, Popp RL, Horlick L. Evaluation of the echocardiogram as an epidemiologic tool in an asymptomatic population. Circulation. 1979;60:921–929.
Schieken RM, Clarke WR, Mahoney LT, Lauer RM. Measurement criteria for group echocardiographic studies. Am J Epidemiol. 1979;110:504–514.
Mahoney LT, Clarke WR, Knoedel D, Lauer R. M. Echocardiographic reproducibility and precision among multiple sonographers: implications for population studies. Circulation. 1989;80(suppl II):II-543. Abstract.