Dr Martin Frankel, Dr Julian Baim, Dr Michal Galin, Joe Agresti and Konstantin Augemberg, Mediamark Research & Intelligence
The visual capabilities offered by the internet provide a platform by which magazine readers may be queried about their viewing, noting and recognizing of ad copy appearing in specific magazine issues. However, it is well known that “samples” used in these studies may be subject to substantial bias arising from the non-probability nature of the sample selection process. Furthermore, when correctly computed, the response rates on many internet panels are quite low.
In those situations when certain key variables are statistically linked (i.e. strongly correlated) with sample selection bias and key substantive outcomes, these variables may be used to adjust or calibrate these estimates. This is sometimes known as post stratification in traditional full-probability sampling and model-based estimation for model based (non-probability) sampling.
In examining a large number of internet samples used to collect data on ad-noting and ad recognition it is has been found that these outcome measures are associated and correlated, to varying degrees, with gender, time spent reading, place of reading, percent of pages opened, and frequency of reading. Furthermore, we have found the distribution of these variables among internet respondents is substantially different from those in traditional full-probability surveys.
We have developed a series of sample weighting procedures to remove a substantial amount of the “selection bias” linked to these reading qualities. This bias reduction step results in meaningful changes in readership ad-noting and ad identification.
This paper will show, using actual data, how our approach to bias reduction weighting was developed, and how it impacts the outcomes of ad-noting and identification. In our decision to apply these weights we have adopted a standard minimization of mean squared error approach and perspective. That is, any weighting which increases variable random error must be offset with bias reduction. Bias reduction occurs when changes in the survey estimates are observed. Within a single magazine issue, the overall changes in ad noting scores are not typically large. However, there are ads in which noting scores do show substantial change. These changes are consistent with expectations linked to the adjustment measures. Furthermore, while an outside validation of the model based estimates has not been undertaken, our examination of overall impact across magazines is highly consistent with those expected on the basis of the variables involved. Thus, while we do not claim that our results are externally validated, we are comfortable in saying that the adjustments are in the expected direction and appear to make sense.
“Starch Scores” for print advertising have been part of the advertising vocabulary since the 1920’s.1
When first introduced, Starch Scores were obtained from an in-person sample of individuals who were asked to indicate if a print ad had been noted and associated with a particular (i.e. the actual) ad sponsor. More recently, the Starch methodology has been adapted for use on the internet. Since internet surveys are not generally based on probability samples of the full population, but rather samples of individuals who have “opted in” or agreed to be part of a sample panel, an obvious question is that of sample validity. Specifically, can we have confidence that the results of these samples are consistent with the results that would be obtained from a full population based probability sample? Based on our review of the literature it appears that in person Starch administration was based on the quota sampling that was in common use for marketing research in the early 1940s.2
1 William Leiss, Stephen Kline et. Al., “Social Communication in Advertising,” Richard W. Pollay, ed. Information Sources in Advertising History, Westport, CT, Greenwood Press, 1979.
2 T. Mills Shepard, “The Starch Application of the Recognition Technique,” The Journal of Marketing, Vol. 6, No. 4, Part 2 (Apr., 1942), pp. 118-124.
One obvious approach to assessing the comparability of an internet based administration with a door-to-door full-probability sample question, would involve pairing the current methodology with a large scale door-to-door full-probability study. While this approach is theoretically correct, the development of this study would be not only cost prohibitive but, most likely, operationally impossible. As a more feasible approach to this comparison we undertook an examination of the statistical behavior of the basic Starch measures with the goal of understanding the basic reading behavior covariates that seem to drive (or at least vary with) ad recognition and noting. On the basis of this examination we determined that without adjustment, our basic Starch levels were not consistent with those that would be obtained from a door-to-door full-probability sample. Furthermore, these analyses suggested that estimates obtained in door-to-door administrations might have also suffered from some bias due to the timing of the interview relative to the publication date.
The results of our analyses suggest a weighting process that adjusts the sample of data collected from our internet administration to more closely conform to the corresponding sample that would be obtained from strict probability samples implemented under ideal conditions. The development of these weights was based on our examination of the sample characteristics that are drivers of ad recognition.
From the standpoint of statistical and sampling theory, the approach outlined above falls under the heading of Model Based (or Assisted) sampling and estimation. The translation of this statistical theory into a sampling and estimation approach involves the selection of a sample from 2 different internet sample panels based on the reporting of prior reading in the appropriate magazine category. It also involves the use of the MRI full-probability national readership survey and the MRI “issue specific” survey to develop respondent level estimation weights that “reduce” the sample selection bias of the basic Starch internet sample. While we know that the use of this model based adjustment weighting cannot fully account for the lack of full- probability randomization, we believe that these statistical methods may be used to reduce both sample selection and sample estimation bias.
We address this issue by describing a series of estimation procedures that attempt to “correct” for observable “sample bias” and to examine the results of these procedures in terms of our expectations about the readership experience.
The development, implementation and evaluation of the model based (bias reducing) adjustment procedure took place in three basic phases or steps.
Step 1: Understanding the Drivers of Ad recognition and noting.
We made use of a large number of Starch studies to explore and understand some of the basic drivers of ad recognition and noting. This step involve the use of multivariate regression (GLM-OLS) using a data set of more than n=30,000 ads in multiple issues of 100 magazines
Step 2: Developing the weighting process.
This step involved comparing the distribution of the “drivers” of ad noting found in the internet samples with the distribution of these drivers obtained in full-probability sample and developing a weighting scheme that adjusts the distribution of drivers in the internet sample so they mirror those found in a full-probability sample.
Step 3: Examination of the results of the weighting adjustment.
We examined the magnitude overall magazines as well as various segments of magazines and types of ads.
STEP 1: UNDERSTANDING THE DRIVERS OF AD RECOGNITION AND NOTING
The basic rationale for Starch studies is the belief that not all advertisements in a given magazine have the same impact on all readers. In translating this observation into actual survey measurements, the Starch approach has focused on three basic steps: Ad Noting, Recognition and Actions Taken.
In order to understand some of the ways in which internet samples and in-person samples might produce different Starch measures we focused on readership characteristics associated with the first step in this measurement process, that of Ad-Noting.
It should be understood that our basic assumption is that the single most important driver of ad-noting is the creative content of the ad itself. This includes both the topic and how it is presented on the page (in terms of pictures, text, and layout). However, based on general beliefs among individuals involved in the print media and based on respondent reports about how they read both magazine editorial and ads, we also assumed (subject to our analyses) that there would be a number of secondary factors (in addition to the creative content) influencing and associated with Ad-Noting and subsequent Recognition .
With this observation in mind we examined the variation in the propensity to note ads among more than 30,000 respondents in approximately 100 recently conducted Starch studies. This examination made use of OLS regression analysis in which the outcome variable was the probability of ad noting and the predictor variables were basic respondent demographics, reported readership behavior, and the individual magazine titles.
The basic regression model is of the form
Y = В0 + В1X1+ В2X2+ В3X3+ ВkXk+s
Where Y is the outcome measure, “propensity to note an ad”, and X1 , X2, X3 …. Xk are the predictor variables (demographics, reading behaviors, and magazine titles).
The respondents used in this analysis were those who participated in one of approximately 100 Starch studies conducted over the internet. The respondents for these studies are selected from among those opt in internet panel members who indicated that they generally read or subscribe to certain magazines (both E-Rewards and Survey Sampling International panels are used).
For a particular study of the ads in a specific magazine issue, panel members who indicated that they are readers of the magazine are sent an email invitation to take a screening interview. This screening interview is used to determine if the respondent read the particular issue that is being studied.
Those who qualify as readers are shown a series of 25 ads that appeared in the issue and are asked if they remember seeing the ad and associating it with a particular (correct) advertiser. Those who give a positive first response are counted as “noting the ad.” Those that give two positive responses are counted as “associated the ad.” Since most magazines carry more than 25 ads, separate qualifying samples of individuals are used for each group of up to 25 ads. The typical Starch study uses a sample of size 125 respondents for each group of ads.
For the regression analysis, an average noting score was calculated for each respondent by dividing the number of ads noted by the number of ads shown to the respondent. This score may be viewed as a probability or propensity that the respondent noted an ad in the particular issue of the magazine. This score has a range of 0.0 (no ads noted) to 1.0 (all ads noted). This was the outcome measure Y.
The potential predictor variables to be examined and assessed in the regression consisted of basic demographics, readership characteristics and behaviors as well as the individual magazines titles.
The demographic variables reflected the demographic characteristics of the respondent. These characteristics were gender, age, education level, income, marital status, race, Hispanic origin, and employment status.
The readership characteristics reflected the respondents reported behavior with respect to frequency of reading that particular magazine title (1, 2, 3, and 4 of 4 issues on average), time spent reading that issue (under 30, 30-60 minutes, more than 60 minutes), how many of the pages were opened (from just skimmed to the entire issue), the source of copy (subscriber, newsstand, other), where read (in home, out of home), how long have you been a reader of the magazine.
Finally each of the different magazine titles (59) was used in the equation. In the case of magazine titles, these variables were expressed as dummy variables. For example, the dummy variable people was set to 1 for respondents who were asked about People Magazine and it was set of 0 for respondents who were asked about some other magazine. As is standard practice only fifty-eight (58), rather than fifty-nine (59) magazine variables were created to avoid the singularity condition.
The overall regression model involved 75 variables and was based on a sample of 30,555 respondents. The regressions were run using both all variables and also using a stepwise regression procedure. The results of this regression (coefficients) are shown in Appendix A. To conserve space only the results of the full regressions are shown. In general however, because of the large sample sizes the results of the full versus stepwise regressions were almost identical.
For the full set of variables, both full and stepwise regressions produced a multiple R-squared (adjusted) of 0.20 or 20%. This is an important result since it indicates that a large portion of the variation in ad-noting is the result of factors other than demographics, reading behavior and magazine context. It is assumed that a substantial portion of variation is due to the ads themselves.
However, we have also learned that about 20% of the variation is due to factors that we might call “non-ad-creative” drivers. This suggests that differences between the sample and population with respect to these “non-ad-creative” drivers will probably produce bias3 in our sample estimates. However, we may be able to reduce some of this bias by appropriate sample weighting. That is by making the sample more closely resemble the population with respect to these drivers.
For example, our analysis shows that factors such as time spent reading, percentage of pages opened, number of issues out of four read and gender, are “drivers” of ad noting. If we find that the representation of these factors in the sample is not in line with that of the full population this means that our estimates will probably be “biased.” If we are able to correct the sample distribution of some of these drivers we may be able to eliminate some of this bias.
3 The term bias is used in the standard statistical context. For a particular estimator f of population parameter F, the bias of f is defined as Bias (f) = E(f) – F, where E(f) is the sample expectation over the full sampling distribution.
Once we identified the non ad-creative drivers of ad-noting we were ready to move to our next step. This step consisted of first determining our internet samples were properly representative of these driver distributions and, if not, developing corrective weights.
STEP 2 DEVELOPING THE WEIGHTING PROCESS SELECTION OF VARIABLES FOR WEIGHTING
On the basis of previous research we were aware that the demographic composition of internet samples is generally not the same as found in either well executed full-probability surveys of the full population. We were also aware that the demographic composition of sample subsets of readers of specific magazines, did not agree with those compositions found in full-probability sample. We also had reason to suspect that internet surveys would be somewhat skewed with respect to non-demographic readership characteristics specific magazines as well.
With this in mind we examined, on a magazine by magazine basis, the distribution of readership characteristics among readers in our full-probability survey “The Survey of the American Consumer” and those found among Starch respondents.
In general, we found that when compared to full-probability samples, the internet tended to produce samples of readers who were more likely to be in-home readers, more frequent readers, and readers who looked into more of the magazine. Furthermore readers in internet samples tended to spend more time reading than those found in full-probability samples. Additionally, depending upon the genre of the magazine, the gender distribution in internet samples tended to favor the dominant gender relative to what is found in full-probability samples. Finally, we found that there were sample composition differences (internet versus full-probability) with respect to reader’s education, employment, marital status and to some degree race/ethnicity.
Once we had established the fact that internet samples produced distributions of magazine readers that were different from those found in full-probability samples, our next objective was to focus on those differences that were important to ad-noting. In carrying out this process under “ideal-conditions” we would first rank order variables by a measure of their importance as drivers and then proceed to examine the degree to which the distribution of these drivers differed between our internet samples and the full-probability sample standard. Typically this ordering is accomplished by examining the size of the “standardized” regression coefficients. In Table 1 we show both undstandardized and standardized coefficients for all demographic and readership characteristics variables. For this analysis we excluded the individual magazine titles.
TABLE 1 Coefficients for Regression excluding Individual Magazine Titles
|Model R-Square = 15.8%||Unstandardized Coefficients||Standardized Coefficients|
|Time Spend Reading||6.731||.241||.180||27.969||.000|
|No. issues (0-4)||4.099||.261||.105||15.714||.000|
|How Long Reader||.456||.151||.017||3.016||.003|
a. Dependent Variable: ad score
As the magnitude of the standardized coefficients indicates our order of examination in a theoretically ideal world would start with % of Pages, and continue with Time Spent Reading, Subscriber Status, No of Issues read out of 4 etc. However, there were two other considerations that were taken into account (in our real world). First, while we know that bias reduction based on sample weighting was possible, we were also aware of the fact that the number of respondents in the Starch Surveys was only moderately large (approximately 125 respondents are asked about a specific ad). As a result we decided to limit the number of variables to be used in weighting to a maximum of three.
Furthermore, since our development of an issue specific magazine measure, we were very aware of the fact that, along with variation in audience size, the specific demographic composition of a particular magazine varies from issue to issue. Furthermore we found that the “readership characteristics” among readers varied from issue to issue as well. Thus, to the extent that either a demographic or readership behavior characteristic was to be used for “weighting” the sample to agree with a more appropriate parameter, the estimate of the parameter could not be properly based on an average issue value. Rather, the parameter estimate had to reflect the particular readership of the issue in which the ad appeared. This requirement restricted our choice of variables to those that were “consistently” measured in our national study, our issue specific study, as well as the Starch study.
Given these conditions our initial choice of variables for post adjustment weighing was number of issues read, place of reading and gender4. Based on a regression analysis restricted to the variables selected for weighting and individual magazines (Shown in Appendix B), an R-squared value of 10.0% indicates that we have captured approximately 50% (final variable and magazine R-squared 0.10 full variable and magazine R-squared 0.20) of the potential “bias reduction” that is available through weighting. Furthermore, the three basic weighting variables show a statistically significant impact on ad noting.
DEVELOPMENT OF POPULATION TARGETS (PARAMETERS)
While our basic analysis, which compared the distribution of readers found in our national probability based Survey of the American Consumer with those obtained on the internet, we recognized that these average distributions might, in fact, change from issue to issue. We had ample evidence of this from our Issue Specific study where large issue to issue differences in the gender distribution are both observed and make a great deal of sense.5 The same holds with respect to type reader (readership behavior).
As we have noted above, rather than weighting the distribution among key drivers of noting to “averages” for the magazine, we felt it would be more appropriate to make use of the MRI issue specific study to produce target distributions for the specific issue where ads were measured. The methodology for using the issue specific study is similar to that used to derive issue specific audiences.
Specifically, the issue specific study relies on internet samples and our estimation algorithm for deriving issue specific audiences does not use the “absolute” readership levels but rather the issue to issue differences in these readership levels. The same general method is applied in order to derive our required “issue specific” weighting parameters. In recognition of the fact that our weighting parameters of gender, number of issues read and place of reading refer to the specific issue that was used in the Starch study, the term “Composition Targeting” has been adopted to describe this issue specific process.
One of the standard outputs of the issue specific study is the issue specific gender distribution. Thus the gender distribution is available for our composition targeting weighting system without further processing. Development of the frequency of reading distribution (number of issues out of four) as well as place of reading distribution makes use of an estimation process similar to that used to develop issue specific age and gender distributions. Specifically, the distributions of frequency of reading (among readers) and place of reading from the Survey of the American Consumer are adjusted based on the relative changes from issue to issue found in the issue specific study. If a particular issue of a magazine tends to attract less frequent readers (as is often does with larger than average audience) this is reflected in target distribution for frequency of reading.
It should be noted that the development of these issue specific targets and the application of these targets in the weighting involves time bound processing intervals, since final results are typically delivered 6-8 weeks after the publication date of a weekly magazine.
STEP 3: EXAMINATION OF THE RESULTS OF THE WEIGHTING ADJUSTMENT
Our goal in the application of weights to the internet based Starch sample data is the reduction of bias. Based on Mean Squared Error Model (MSEM) for the evaluation of the impact of weighting, we expect that a weighting process that reduces bias will results in changes in the estimates produced.6 That is, the survey estimates produced by a weighting that reduces bias will be different from those obtained without weighting. Since the application of weights increases the “random error” associated with an estimate, if there is no change in the estimate itself, then the increased error due to weighting cannot be justified.
We have examined the overall impact in ad noting scores over 194 Starch Studies (magazine issues) covering more than 40,000 ads. The average unadjusted ad noting score is 50.45% and the average weighted (composition targeted) ad noting score is 48.83%. This change in ad noting scores is not on large on average, but changes in individual scores may be more substantial (upward of 10% in either direction and occasionally greater than 15% in either direction). The direction of the change between adjusted and adjusted ad scores is entirely consistent with our expectations since we observed that internet based samples tend to over-represent both in home and more frequent readers and these two groups tend to produce higher ad noting scores. By correctly down-weighting these sample groups, a decline in ad noting is entirely consistent with our expectations.
We have found the magnitude of the overall magazine adjustment level for ad noting scores varies by publication interval. (Table 2 shows the average unadjusted and adjusted ad noting scores by publication interval.) Also shown, are the minimum and maximum average (by magazine) change in score associated with composition targeting weighting.
4 We are in the process of examining the possibility of changing the wording of certain questions in our issue specific and Starch studies in order to increase our choice of bias correction variables.
5 For example, issues of the same weekly news magazine that focus on family topics seem to attract a larger proportion of female readers while those focusing on war tend to disproportionately skew toward males.
6 The Mean Square Error Model evaluates the random error portion and the bias portion of an estimate. The mean squared error is equal to the variance of the estimate (standard error squared) plus the squared bias. Given a weighting model, the difference between the unweighted and weighted estimate provides an estimate of the bias term.
Impact of Composition Targeting on Ad Noting Score Average (by Magazine Publication Frequency)
|Frequency of publication||No. of Issues||Average Unweighed||Average Weighted||Change from Weighting||Largest Negative||Largest Positive|
We have also examined the degree of adjustment by magazine genre. In this case we find that while there is some independent impact of genre, much of the differences are driven by frequency of publication. This can be seen in the fact that for Newspaper Distributed magazines there is virtually no change and in News and Entertainment Weeklies the overall change is minimal. We find that within these titles, changes in particular ad scores seemed to be the result of bias correction with respect to gender and frequency of reading. Ads which differentially appeal to more frequent and /on-gender readers show declines, while those ads that appeal to less frequent and off-gender readers show increases.
Impact of Composition Targeting on Ad Nting Score Average by Magazine Genre
|Magazine Genre||No. of Issues||Average Unweighted||Average Weighted||Change from Weighting|
|Brides, Babies and Parents||6||52.75||50.88||-1.87|
|News and Entertainment
|Science and Tech||2||44.13||41.10||-3.03|
SUMMARY, CONCLUSIONS and FURTHER WORK
The Starch Ad Measure study makes use of an internet (non-probability) sample. We have shown how an analysis of the drivers of ad-noting was used to derive adjustment (model-based) weights for ad-noting estimates derived from this sample. Our results provide evidence that the weighted (behavior targeted) estimates are subject to less sample selection bias than those derived without adjustment.
We will be continuing our examination of ways by which this process might be improved by changes in question wording to make the use of additional or alternative ad-noting drivers possible.
APPENDIX A – FULL MODEL READING ATTRIBUTES MAGAZINES-DEMOS
|Model – R2=0.20||Unstandardized Coefficients||Standardized Coefficients||95.0% Confidence Interval for B|
|B||Std. Error||Beta||t||Sig.||Lower Bound||Upper Bound|
|Time Spend Reading||6.275||.241||.168||26.090||.000||5.804||6.747|
|No. issues (0-4)||4.382||.255||.112||17.168||.000||3.882||4.883|
|How Long Reader||.781||.158||.029||4.935||.000||.471||1.092|
a. Dependent Variable: Noting ad score
APPENDIX B – WEIGHTING VARIABLES AND MAGAZINES
|Model – R2-=0.10||Unstandardized Coefficients||Standardized Coefficients||95.0% Confidence Interval for B|
|B||Std. Error||Beta||t||Sig.||Lower Bound||Upper Bound|
|No. issues (0-4)||7.488||.236||.192||31.757||.000||7.026||7.951|
a. Dependent Variable: ad score