www.freebookslides.com
Business Statistics
For these Global Editions, the editorial team at Pearson has collaborated with educators across the world to address a wide range of subjects and requirements, equipping students with the best possible learning tools. This Global Edition preserves the cutting-edge approach and pedagogy of the original, but also features alterations, customization and adaptation from the North American version.
Global edition
Global edition
Global edition
Business Statistics THIRD edition
S harpe • De Veaux • Velleman
THIRD edition Sharpe • De Veaux • Velleman
This is a special edition of an established title widely used by colleges and universities throughout the world. Pearson published this exclusive edition for the benefit of students outside the United States and Canada. If you purchased this book within the United States or Canada you should be aware that it has been imported without the approval of the Publisher or Author. Pearson Global Edition
SHARPE_1292058692_mech.indd 1
7/10/14 2:52 PM
www.freebookslides.com
Business Statistics 3rd Edition Global Edition
A01_SHAR8696_03_SE_FM.indd 1
14/07/14 7:27 AM
www.freebookslides.com
A01_SHAR8696_03_SE_FM.indd 2
14/07/14 7:27 AM
www.freebookslides.com
s c i t s i t a t S s s e n i s Bu n o i t i d E d r 3 n o i t i d E l a Glob
e p r a h S . NoregaetnowRn University Geor x u a e V e D . D d r a h c i R ams College Willi n a m e l l e Paul F. V sity
niver U l l e n id Bock r v a Co D y b ibutions r t n o C With
Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montrèal Toronto Delhi Mexico City São Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo
A01_SHAR8696_03_SE_FM.indd 3
14/07/14 7:27 AM
www.freebookslides.com
Editor in Chief: Deirdre Lynch Head of Learning Asset Acquisition, Global Editions: Laura Dent Senior Content Editor: Chere Bemelmans Assistant Editor: Sonia Ashraf Senior Marketing Manager: Erin Lane Marketing Assistant: Kathleen DeChavez Senior Managing Editor: Karen Wernholm Associate Managing Editor: Tamela Ambush Acquisitions Editor, Global Editions: Subhasree Patra Senior Production Project Manager: Peggy McMahon Project Editor, Global Editions: K.K. Neelakantan Senior Production Manufacturing Controller, Global Editions: Trudy Kimber Digital Assets Manager: Marianne Groth Senior Author Support/Technology Specialist: Joe Vetere
Associate Director of Design, USHE EMSS/HSC/EDU: Andrea Nix Program Design Lead: Barbara T. Atkinson Cover Photo: © yienkeat/Shutterstock Text Design: Studio Montage Image Manager: Rachel Youdelman Permissions Liaison Manager: Joseph Croscup Media Producer: Aimee Thorne Media Production Manager, Global Editions: Vikram Kumar Project Supervisor, MyStatLab: Robert Carroll QA Manager, Assessment Content: Marty Wright Procurement Specialist: Debbie Rossi Full-Service Project Management, Composition, and Illustrations: Lumina Datamatics, Inc.
Credits and acknowledgments borrowed from other sources and reproduced, with permission, in this textbook appear on the appropriate page within text or in Appendix C, which is hereby made part of this copyright page. Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsonglobaleditions.com © Pearson Education Limited 2015 The rights of Norean R. Sharpe, Richard D. De Veaux, and Paul F. Velleman to be identified as the authors of this work have been asserted by them in accordance with the C opyright, Designs and Patents Act 1988. Authorized adaptation from the United States edition, entitled Business Statistics, 3rd edition, ISBN 978-0-321-92583-1, by Norean R. Sharpe, Richard D. DeVeaux, and Paul F. Velleman, published by Pearson Education © 2015. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a license permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library 10 9 8 7 6 5 4 3 2 1 15 14 13 12 11 ISBN 10: 1-292-05869-2 ISBN 13: 978-1-292-05869-6 Typeset by Lumina Datamatics, Inc. Printed and bound by Courier Kendallville in The United States of America. MICROSOFT AND/OR ITS RESPECTIVE SUPPLIERS MAKE NO REPRESENTATIONS ABOUT THE SUITABILITY OF THE INFORMATION CONTAINED IN THE DOCUMENTS AND RELATED GRAPHICS PUBLISHED AS PART OF THE SERVICES FOR ANY PURPOSE. ALL SUCH DOCUMENTS AND RELATED GRAPHICS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. MICROSOFT AND/OR ITS RESPECTIVE SUPPLIERS HEREBY DISCLAIM ALL WARRANTIES AND CONDITIONS WITH REGARD TO THIS INFORMATION, INCLUDING ALL WARRANTIES AND CONDITIONS OF MERCHANTABILITY, WHETHER EXPRESS, IMPLIED OR STATUTORY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL MICROSOFT AND/OR ITS RESPECTIVE SUPPLIERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF INFORMATION AVAILABLE FROM THE SERVICES. THE DOCUMENTS AND RELATED GRAPHICS CONTAINED HEREIN COULD INCLUDE TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES ARE PERIODICALLY ADDED TO THE INFORMATION HEREIN. MICROSOFT AND/OR ITS RESPECTIVE SUPPLIERS MAY MAKE IMPROVEMENTS AND/OR CHANGES IN THE PRODUCT(S) AND/OR THE PROGRAM(S) DESCRIBED HEREIN AT ANY TIME. PARTIAL SCREEN SHOTS MAY BE VIEWED IN FULL WITHIN THE SOFTWARE VERSION SPECIFIED. MICROSOFT®, WINDOWS®, and MICROSOFT OFFICE® ARE REGISTERED TRADEMARKS OF THE MICROSOFT CORPORATION IN THE U.S.A. AND OTHER COUNTRIES. THIS BOOK IS NOT SPONSORED OR ENDORSED BY OR AFFILIATED WITH THE MICROSOFT CORPORATION.
A01_SHAR8696_03_SE_FM.indd 4
14/07/14 7:27 AM
www.freebookslides.com
To my husband, Peter, for his patience and support
—Norean
To my family
—Dick
To my father, who taught me about ethical business practice by his constant example as a small businessman and parent
—Paul
A01_SHAR8696_03_SE_FM.indd 5
14/07/14 7:27 AM
www.freebookslides.com
Meet the Authors Norean Radke Sharpe (Ph.D. University of Virginia) has developed an international reputation as an educator, administrator, and consultant on assessment and accreditation. She is currently a professor at the McDonough School of Business at Georgetown University, where she is also Senior Associate Dean and Director of Undergraduate Programs. Prior to joining Georgetown, Norean taught business statistics and operations research courses to both undergraduate and MBA students for fourteen years at Babson College. Before moving into business education, she taught statistics for several years at Bowdoin College and conducted research at Yale University. Norean is coauthor of the recent text, ACasebook for Business Statistics: Laboratories for Decision Making, and she has authored more than 30 articles—primarily in the areas of statistics education and women in science. Norean currently serves as Associate Editor for the journal Cases in Business, Industry, and Government Statistics. Her scholarship focuses on business forecasting, statistics education, and student learning. She is co-founder of the DOME Foundation, a nonprofit foundation that works to increase Diversity and Outreach in Mathematics and Engineering, and she currently serves on two other nonprofit boards in the Washington, D.C. area. Norean has been active in increasing the participation of women and underrepresented students in science and mathematics for several years and has two children of her own.
Richard D. De Veaux (Ph.D. Stanford University) is an internationally known educator, consultant, and lecturer. Dick has taught statistics at a business school (Wharton), an engineering school (Princeton), and a liberal arts college (Williams). While at Princeton, he won a Lifetime Award for Dedication and Excellence in Teaching. Since 1994, he has taught at Williams College, although he returned to Princeton for the academic year 2006–2007 as the William R. Kenan Jr. Visiting Professor of Distinguished Teaching. He is currently the C. Carlisle and Margaret Tippit Professor of Statistics at Williams College. Dick holds degrees from Princeton University in Civil Engineering and Mathematics and from Stanford University in Dance Education and Statistics, where he studied with Persi Diaconis. His research focuses on the analysis of large data sets and data mining in science and industry. Dick has won both the Wilcoxon and Shewell awards from the American Society for Quality. He is an elected member of the International Statistics Institute (ISI) and a Fellow of the American Statistical Association (ASA). He currently serves on the Board of Directors of the ASA. Dick is also well known in industry, having consulted for such Fortune 500 companies as American Express, Hewlett-Packard, Alcoa, DuPont, Pillsbury, General Electric, and Chemical Bank. He was named the “Statistician of the Year” for 2008 by the Boston Chapter of the American Statistical Association for his contributions to teaching, research, and consulting. In his spare time he is an avid cyclist and swimmer. He also is the founder and bass for the doo-wop group the Diminished Faculty and is a frequent singer and soloist with various local choirs, including the Choeur Vittoria of Paris, France. Dick is the father of four children.
Paul F. Velleman (Ph.D. Princeton University) has an international reputation for innovative statistics education. He designed the Data Desk® software package and is also the author and designer of the award-winning ActivStats® multimedia software, for which he received the EDUCOM Medal for innovative uses of computers in teaching statistics and the ICTCM Award for Innovation in Using Technology in College Mathematics. He is the founder and CEO of Data Description, Inc. (www.datadesk .com ), which supports both of these programs. He also developed the Internet site Data and Story Library (DASL; lib.stat.cmu.edu/DASL/ ), which provides data sets for teaching Statistics. Paul coauthored (with David Hoaglin) the book ABCs of Exploratory Data Analysis. Paul teaches Statistics at Cornell University in the Department of Statistical Sciences and in the School of Industrial and Labor Relations, for which he has been awarded the MacIntyre Prize for Exemplary Teaching. His research often focuses on statistical graphics and data analysis methods. Paul is a Fellow of the American Statistical Association and of the American Association for the Advancement of Science. Paul’s experience as a professor, entrepreneur, and business leader brings a unique perspective to the book. Richard De Veaux and Paul Velleman have authored successful books in the introductory college and AP High School market with David Bock, including Intro Stats, Fourth Edition (Pearson, 2014); Stats: Modeling the World, Fourth Edition (Pearson, 2015); and Stats: Data and Models, Third Edition (Pearson, 2012).
A01_SHAR8696_03_SE_FM.indd 6
14/07/14 7:27 AM
www.freebookslides.com
Contents Preface 11 Index of Applications 22
Part I
Exploring and Collecting Data
Chapter 1
Data and Decisions (E-Commerce) 29
Chapter 2
Displaying and Describing Categorical Data (Keen, Inc.) 47 2.1 Summarizing a Categorical Variable, 48 • 2.2 Displaying a Categorical Variable, 49 2.3 Exploring Two Categorical Variables: Contingency Tables, 53 • 2.4 Segmented Bar Charts and Mosaic Plots, 57 • 2.5 Simpson’s Paradox, 61 Ethics in Action 64 Technology Help: Displaying Categorical Data 65 Brief Case: Credit Card Bank 67
Chapter 3
Displaying and Describing Quantitative Data (AIG) 77
Chapter 4
Part II
Chapter 5
Randomness and Probability (Credit Reports and the Fair Isaacs Corporation) 175
Chapter 6
Random Variables and Probability Models (Metropolitan Life Insurance Company) 209
1.1 What Are Data? 30 • 1.2 Variable Types, 34 • 1.3 Data Sources: Where, How, and When, 37 Ethics in Action 39 Technology Help: Data 41 Brief Case: Credit Card Bank 42
3.1 Displaying Quantitative Variables, 78 • 3.2 Shape, 81 • 3.3 Center, 84 3.4 Spread of the Distribution, 86 • 3.5 Shape, Center, and Spread—A Summary, 88 3.6 Standardizing Variables, 88 • 3.7 Five-Number Summary and Boxplots, 90 3.8 Comparing Groups, 93 • 3.9 Identifying Outliers, 95 • 3.10 Time Series Plots, 97 *3.11 Transforming Skewed Data, 100 Ethics in Action 105 Technology Help: Displaying and Summarizing Quantitative Variables 108 Brief Cases: Detecting the Housing Bubble and Socio-Economic Data on States 110
Correlation and Linear Regression (Amazon.com) 125
4.1 Looking at Scatterplots, 126 • 4.2 Assigning Roles to Variables in Scatterplots, 129 4.3 Understanding Correlation, 130 • 4.4 Lurking Variables and Causation, 134 4.5 The Linear Model, 136 • 4.6 Correlation and the Line, 137 • 4.7 Regression to the Mean, 140 • 4.8 Checking the Model, 142 • 4.9 Variation in the Model and R2, 145 4.10 Reality Check: Is the Regression Reasonable? 147 • 4.11 Nonlinear Relationships, 149 Ethics in Action 154 Technology Help: Correlation and Regression 157 Brief Cases: Fuel Efficiency, Cost of Living, and Mutual Funds 159 Case Study I: Paralyzed Veterans of America 172
Modeling with Probability 5.1 Random Phenomena and Probability, 176 • 5.2 The Nonexistent Law of Averages, 178 • 5.3 Different Types of Probability, 179 • 5.4 Probability Rules, 181 • 5.5 Joint Probability and Contingency Tables, 186 • 5.6 Conditional Probability, 187 • 5.7 Constructing Contingency Tables, 190 • 5.8 Probability Trees, 191 • *5.9 Reversing the Conditioning: Bayes’ Rule, 193 Ethics in Action 195 Technology Help: Generating Random Numbers 197 Brief Case: Global Markets 198
6.1 Expected Value of a Random Variable, 210 • 6.2 Standard Deviation of a Random Variable, 212 • 6.3 Properties of Expected Values and Variances, 215 • 6.4 Bernoulli Trials, 219 • 6.5 Discrete Probability Models, 220
7
A01_SHAR8696_03_SE_FM.indd 7
14/07/14 7:27 AM
www.freebookslides.com 8
Contents
Ethics in Action 227 Technology Help: Random Variables and Probability Models 229 Brief Case: Investment Options 230
Chapter 7
The Normal and Other Continuous Distributions (The NYSE) 237
Chapter 8
Surveys and Sampling (Roper Polls)
Chapter 9
Sampling Distributions and Confidence Intervals for Proportions (Marketing Credit Cards: The MBNA Story) 299
7.1 The Standard Deviation as a Ruler, 238 • 7.2 The Normal Distribution, 240 • 7.3 Normal Probability Plots, 248 • 7.4 The Distribution of Sums of Normals, 249 • 7.5 The Normal Approximation for the Binomial, 253 • 7.6 Other Continuous Random Variables,255 Ethics in Action 259 Technology Help: Probability Calculations and Plots 260 Brief Case: Price/Earnings and Stock Value 261 271 8.1 Three Ideas of Sampling, 272 • 8.2 Populations and Parameters, 276 • 8.3 Common SamplingDesigns, 276 • 8.4 The Valid Survey, 282 • 8.5 How to Sample Badly, 284 Ethics in Action 287 Technology Help: Random Sampling 289 Brief Cases: Market Survey Research and The GfK Roper Reports Worldwide Survey 290
9.1 The Distribution of Sample Proportions, 300 • 9.2 A Confidence Interval for a Proportion, 305 • 9.3 Margin of Error: Certainty vs. Precision, 310 • 9.4 Choosing the Sample Size, 314 Ethics in Action 319 Technology Help: Confidence Intervals for Proportions 321 Brief Cases: Has Gold Lost Its Luster? and Forecasting Demand 322 Case Study II: Real Estate Simulation 332
Part III
Chapter 10
Testing Hypotheses about Proportions (Dow Jones Industrial Average) 333
Chapter 11
Confidence Intervals and Hypothesis Tests for Means (Guinness & Co.) 359 11.1 The Central Limit Theorem, 360 • 11.2 The Sampling Distribution of the Mean, 363 11.3 How Sampling Distribution Models Work, 365 • 11.4 Gosset and the t-Distribution, 366 • 11.5 A Confidence Interval for Means, 368 • 11.6 Assumptions and Conditions, 370 • 11.7 Testing Hypotheses about Means—the One-Sample t-Test, 376 Ethics in Action 381 Technology Help: Inference for Means 383 Brief Cases: Real Estate and Donor Profiles 385
Chapter 12
More about Tests and Intervals (Traveler’s Insurance) 395
Chapter 13
A01_SHAR8696_03_SE_FM.indd 8
Inference for Decision Making 10.1 Hypotheses, 334 • 10.2 A Trial as a Hypothesis Test, 336 • 10.3 P-Values, 337 10.4 The Reasoning of Hypothesis Testing, 339 • 10.5 Alternative Hypotheses, 341 10.6 P-Values and Decisions: What to Tell about a Hypothesis Test, 344 Ethics in Action 348 Technology Help: Hypothesis Tests 350 Brief Cases: Metal Production and Loyalty Program 351
12.1 How to Think about P-Values, 397 • 12.2 Alpha Levels and Significance, 402 12.3 Critical Values, 404 • 12.4 Confidence Intervals and Hypothesis Tests, 405 12.5 Two Types of Errors, 408 • 12.6 Power, 410 Ethics in Action 414 Brief Case: Confidence Intervals and Hypothesis Tests 415
Comparing Two Means (Visa Global Organization) 423 13.1 Comparing Two Means, 424 • 13.2 The Two-Sample t-Test, 427 • 13.3 Assumptions and Conditions, 427 • 13.4 A Confidence Interval for the Difference Between Two Means, 431 • 13.5 The Pooled t-Test, 434 • 13.6 Paired Data, 439 • 13.7 Paired t-Methods, 440
14/07/14 7:27 AM
www.freebookslides.com
Contents 9
Ethics in Action 446 Technology Help: Comparing Two Groups 448 Brief Cases: Real Estate and Consumer Spending Patterns (Data Analysis) 451
A01_SHAR8696_03_SE_FM.indd 9
Chapter 14
Inference for Counts: Chi-Square Tests (SAC Capital) 469
Part IV
Chapter 15
Inference for Regression (Nambé Mills) 507 15.1 A Hypothesis Test and Confidence Interval for the Slope, 508 • 15.2 Assumptions and Conditions, 512 • 15.3 Standard Errors for Predicted Values, 518 • 15.4 Using Confidence and Prediction Intervals, 520 Ethics in Action 523 Technology Help: Regression Analysis 525 Brief Cases: Frozen Pizza and Global Warming? 526
Chapter 16
Understanding Residuals (Kellogg’s) 541 16.1 Examining Residuals for Groups, 542 • 16.2 Extrapolation and Prediction, 545 16.3 Unusual and Extraordinary Observations, 548 • 16.4 Working with Summary Values, 551 • 16.5 Autocorrelation, 552 • 16.6 Transforming (Re-expressing) Data, 554 16.7 The Ladder of Powers, 558 Ethics in Action 565 Technology Help: Examining Residuals 566 Brief Cases: Gross Domestic Product and Energy Sources 567
Chapter 17
Multiple Regression (Zillow.com) 583
Chapter 18
Building Multiple Regression Models (Bolliger and Mabillard) 625 18.1 Indicator (or Dummy) Variables, 628 • 18.2 Adjusting for Different Slopes— Interaction Terms, 632 • 18.3 Multiple Regression Diagnostics, 634 • 18.4 Building Regression Models, 640 • 18.5 Collinearity, 648 • 18.6 Quadratic Terms, 651 Ethics in Action 657 Technology Help: Building Multiple Regression Models 659 Brief Case: Building Models 660
Chapter 19
Time Series Analysis (Whole Foods Market®) 671 19.1 What Is a Time Series? 672 • 19.2 Components of a Time Series, 673 19.3 Smoothing Methods, 676 • 19.4 Summarizing Forecast Error, 681 19.5 Autoregressive Models, 683 • 19.6 Multiple Regression–based Models, 689 19.7 Choosing a Time Series Forecasting Method, 700 • 19.8 Interpreting Time Series Models: The Whole Foods Data Revisited, 701 Ethics in Action 702 Technology Help: Time Series 705 Brief Cases: U.S. Trade with the European Union and Tiffany & Co. 705 Case Study IV: Health Care Costs 718
14.1 Goodness-of-Fit Tests, 471 • 14.2 Interpreting Chi-Square Values, 476 14.3 Examining the Residuals, 477 • 14.4 The Chi-Square Test of Homogeneity, 478 14.5 Comparing Two Proportions, 482 • 14.6 Chi-Square Test of Independence, 484 Ethics in Action 490 Technology Help: Chi-Square 492 Brief Cases: Health Insurance and Loyalty Program 494 Case Study III: Investment Strategy Segmentation 506
Models for Decision Making
17.1 The Multiple Regression Model, 585 • 17.2 Interpreting Multiple Regression Coefficients, 587 • 17.3 Assumptions and Conditions for the Multiple Regression Model, 589 • 17.4 Testing the Multiple Regression Model, 597 • 17.5 Adjusted R2 and the F-statistic, 599 • *17.6 The Logistic Regression Model, 601 Ethics in Action 608 Technology Help: Regression Analysis 610 Brief Case: Golf Success 612
14/07/14 7:27 AM
www.freebookslides.com 10
Contents
Part V
Selected Topics in Decision Making
Chapter 20
Chapter 21
Quality Control (Sony) 771
Chapter 22
Nonparametric Methods (i4cp) 807 22.1 Ranks, 808 • 22.2 The Wilcoxon Rank-Sum/Mann-Whitney Statistic, 809 22.3 Kruskal-Wallace Test, 813 • 22.4 Paired Data: The Wilcoxon Signed-Rank Test, 816 *22.5 Friedman Test for a Randomized Block Design, 819 • 22.6 Kendall’s Tau: Measuring Monotonicity, 820 • 22.7 Spearman’s Rho, 821 • 22.8 When Should You Use Nonparametric Methods? 822 Ethics in Action 823 Technology Help: Nonparametric Methods 825 Brief Case: Real Estate Reconsidered 826
Chapter 23
Decision Making and Risk (Data Description, Inc.) 835
Chapter 24
Introduction to Data Mining (Paralyzed Veterans of America) 857 24.1 The Big Data Revolution, 858 • 24.2 Direct Marketing, 859 • 24.3 The Goals of Data Mining, 861 • 24.4 Data Mining Myths, 862 • 24.5 Successful Data Mining, 863 24.6 Data Mining Problems, 865 • 24.7 Data Mining Algorithms, 865 • 24.8 The Data Mining Process, 869 • 24.9 Summary, 871 Ethics in Action 872 Case Study V: Marketing Experiment 874
Design and Analysis of Experiments and Observational Studies (Capital One) 721
20.1 Observational Studies, 722 • 20.2 Randomized, Comparative Experiments, 724 20.3 The Four Principles of Experimental Design, 725 • 20.4 Experimental Designs, 727 20.5 Issues in Experimental Design, 731 • 20.6 Analyzing a Design in One Factor— The One-Way Analysis of Variance, 733 • 20.7 Assumptions and Conditions for ANOVA, 737 *20.8 Multiple Comparisons, 740 • 20.9 ANOVA on Observational Data, 742 20.10 Analysis of Multifactor Designs, 743 Ethics in Action 753 Technology Help: Analysis of Variance 757 Brief Case: Design a Multifactor Experiment 758 21.1 A Short History of Quality Control, 772 • 21.2 Control Charts for Individual Observations (Run Charts), 776 • 21.3 Control Charts for Measurements: X and R Charts, 780 • 21.4 Actions for Out-of-Control Processes, 786 • 21.5 Control Charts for Attributes: p Charts and c Charts, 792 • 21.6 Philosophies of Quality Control, 795 Ethics in Action 796 Technology Help: Quality Control Charts 798 Brief Case: Laptop Touchpad Quality 799
23.1 Actions, States of Nature, and Outcomes, 836 • 23.2 Payoff Tables and Decision Trees, 837 • 23.3 Minimizing Loss and Maximizing Gain, 838 • 23.4 The Expected Value of an Action, 839 • 23.5 Expected Value with Perfect Information, 840 • 23.6 Decisions Made with Sample Information, 841 • 23.7 Estimating Variation, 843 • 23.8 Sensitivity, 845 23.9 Simulation, 846 • 23.10 More Complex Decisions, 848 Ethics in Action 848 Brief Cases: Texaco-Pennzoil and Insurance Services, Revisited 850
Appendixes A-1 A. Answers A-1 B. Tables and Selected Formulas A-55 C. Photo Acknowledgments A-74 Subject Index I-1
A01_SHAR8696_03_SE_FM.indd 10
18/07/14 12:50 PM
www.freebookslides.com
Preface The question that should motivate a business student’s study of Statistics should be “How can I make better decisions?”1 As entrepreneurs and consultants, we know that in today’s data-rich environment, knowledge of Statistics is essential to survive and thrive in the business world. But, as educators, we’ve seen a disconnect between the way business statistics is traditionally taught and the way it should be used in making business decisions. In Business Statistics, we try to narrow the gap between theory and practice by presenting relevant statistical methods that will empower business students to make effective, data-informed decisions. Of course, students should come away from their statistics course knowing how to think statistically and how to apply statistics methods with modern technology. But they must also be able to communicate their analyses effectively to others. When asked about statistics education, a group of CEOs from Fortune 500 companies recently said that although they were satisfied with the technical competence of students who had studied statistics, they found the students’ ability to communicate their findings to be woefully inadequate. Our Plan, Do, Report rubric provides a structure for solving business problems that mimics the correct application of statistics to solving real business problems. Unlike many other books, we emphasize the often neglected thinking (Plan) and communication (Report) steps in problem solving in addition to the methodology (Do). This approach requires up-to-date, real-world examples and data. So we constantly strive to illustrate our lessons with current business issues and examples.
What’s New in This Edition?
We’ve been delighted with the reaction to previous editions of Business Statistics. We’ve streamlined the third edition further to help students focus on the central material. And, of course, we continue to update examples and exercises so that the story we tell is always tied to the ways Statistics informs modern business practice. • Recent data. We teach with real data whenever possible, so we’ve updated data throughout the book. New examples reflect current stories in the news and recent economic and business events. The brief cases have been updated with new data and new contexts. • Improved organization. We have retained our “data first” presentation of topics because we find that it provides students with both motivation and a foundation in real business decisions on which to build an understanding. • Chapters 1–4 have been streamlined to cover collecting, displaying, summarizing, and understanding data in four chapters. We find that this provides students with a solid foundation to launch their study of probability and statistics. • Chapters 5–9 introduce students to randomness and probability models. They then apply these new concepts to sampling. This provides a gateway to the core material on statistical inference. We’ve moved the discussion of probability trees and Bayes’ rule into these chapters. • Chapters 10–14 cover inference for both proportions and means. We introduce inference by discussing proportions because most students are better acquainted with proportions reported in surveys and news stories. However, this edition ties in the discussion of means immediately so students can appreciate that the reasoning of inference is the same in a variety of contexts. • Chapters 15–19 cover regression-based models for decision making. • Chapters 20–24 discuss special topics that can be selected according to the needs of the course and the preferences of the instructor. 1
Unfortunately, not the question most students are asking themselves on the first day of the course.
11
A01_SHAR8696_03_SE_FM.indd 11
14/07/14 7:27 AM
www.freebookslides.com 12
Preface
• Streamlined design. Our goal has always been an accessible text. This edition sports a new design that clarifies the purpose of each text element. The major theme of each chapter is more linear and easier to follow without distraction. Supporting material is clearly boxed and shaded, so students know where to focus their study efforts. • Enhanced Technology Help with expanded Excel 2013 coverage. We’ve updated Technology Help and added detailed instructions for Excel 2013 to almost every chapter. • Updated Ethics in Action features. We’ve updated more than half of our Ethics in Action features. Ethically and statistically sound alternative approaches to the questions raised in these features and a link to the American Statistical Association’s Ethical Guidelines are now presented in the Instructor’s Solutions Manual, making the Ethics features suitable for assignment or class discussion. • Updated examples to reflect the changing world. The time since our last revision has seen marked changes in the U.S. and world economies. This has required us to update many of our examples. Our chapter on time series was particularly affected. We’ve reworked those examples and discussed the real-world challenges of modeling economic and business data in a changing world. The result is a chapter that is more realistic and useful. • Increased focus on core material. Statistics in practice means making smart decisions based on data. Students need to know the methods, how to apply them, and the assumptions and conditions that make them work. We’ve tightened our discussions to get students there as quickly as possible, focusing increasingly on the central ideas and core material.
Our Approach
Statistical Thinking For all of our improvements, examples, and updates in this edition of Business Statistics we haven’t lost sight of our original mission—writing a modern business statistics text that addresses the importance of statistical thinking in making business decisions and that acknowledges how Statistics is actually used in business. Statistics is practiced with technology, and this insight informs everything from our choice of forms for equations (favoring intuitive forms over calculation forms) to our extensive use of real data. But most important, understanding the value of technology allows us to focus on teaching statistical thinking rather than calculation. The questions that motivate each of our hundreds of examples are not “How do you find the answer?” but “How do you think about the answer?”, “How does it help you make a better decision?”, and “How can you best communicate your decision?” Our focus on statistical thinking ties the chapters of the book together. An introductory Business Statistics course covers an overwhelming number of new terms, concepts, and methods, and it is vital that students see their central core: how we can understand more about the world and make better decisions by understanding what the data tell us. From this perspective, it is easy to see that the patterns we look for in graphs are the same as those we think about when we prepare to make inferences. And it is easy to see that the many ways to draw inferences from data are several applications of the same core concepts. And it follows naturally that when we extend these basic ideas into more complex (and even more realistic) situations, the same basic reasoning is still at the core of our analyses.
A01_SHAR8696_03_SE_FM.indd 12
14/07/14 7:27 AM
www.freebookslides.com
Preface 13
Our Goal: Read This Book! The best textbook in the world is of little value if it isn’t read. Here are some of the ways we made Business Statistics more approachable: • Readability. We strive for a conversational, approachable style, and we introduce anecdotes to maintain interest. Instructors report (to their amazement) that their students read ahead of their assignments voluntarily. Students tell us (to their amazement) that they actually enjoy the book. In this edition, we’ve tightened our discussions to be more focused on the central ideas we want to convey. • Focus on assumptions and conditions. More than any other textbook, Business Statistics emphasizes the need to verify assumptions when using statistical procedures. We reiterate this focus throughout the examples and exercises. We make every effort to provide templates that reinforce the practice of checking these assumptions and conditions, rather than rushing through the computations. Business decisions have consequences. Blind calculations open the door to errors that could easily be avoided by taking the time to graph the data, check assumptions and conditions, and then check again that the results and residuals make sense. • Emphasis on graphing and exploring data. Our consistent emphasis on the importance of displaying data is evident from the first chapters on understanding data to the sophisticated model-building chapters at the end. Examples often illustrate the value of examining data graphically, and the Exercises reinforce this. Good graphics reveal structures, patterns, and occasional anomalies that could otherwise go unnoticed. These patterns often raise new questions and inform both the path of a resulting statistical analysis and the business decisions. Hundreds of new graphics found throughout the book demonstrate that the simple structures that underlie even the most sophisticated statistical inferences are the same ones we look for in the simplest examples. This helps tie the concepts of the book together to tell a coherent story. • Consistency. We work hard to avoid the “do what we say, not what we do” trap. Having taught the importance of plotting data and checking assumptions and conditions, we are careful to model that behavior throughout the book. (Check the Exercises in the chapters on multiple regression or time series and you’ll find us still requiring and demonstrating the plots and checks that were introduced in the early chapters.) This consistency helps reinforce these fundamental principles and provides a familiar foundation for the more sophisticated topics. • The need to read. In this book, important concepts, definitions, and sample solutions are not always set aside in boxes. The book needs to be read, so we’ve tried to make the reading experience enjoyable. The common approach of skimming for definitions or starting with the exercises and looking up examples just won’t work here. (It never did work as a way to learn about and understand Statistics.) Coverage The topics covered in a Business Statistics course are generally mandated by our students’ needs in their studies and in their future professions. But the order of these topics and the relative emphasis given to each is not well established. Business Statistics presents some topics sooner or later than other texts. Although many chapters can be taught in a different order, we urge you to consider the order we have chosen. We’ve been guided in the order of topics by the fundamental goal of designing a coherent course in which concepts and methods fit together to provide a new understanding of
A01_SHAR8696_03_SE_FM.indd 13
14/07/14 7:27 AM
www.freebookslides.com 14
Preface
how reasoning with data can uncover new and important truths. Each new topic should fit into the growing structure of understanding that students develop throughout the course. For example, we teach inference concepts with proportions first and then with means. Most people have a wider experience with proportions, seeing them in polls and advertising. And by starting with proportions, we can teach inference with the Normal model and then introduce inference for means with the Student’s t distribution. We introduce the concepts of association, correlation, and regression early in Business Statistics. Our experience in the classroom shows that introducing these fundamental ideas early makes Statistics useful and relevant even at the beginning of the course. By Chapter 4, students can discuss relationships among variables in a meaningful way. Later in the semester, when we discuss inference, it is natural and relatively easy to build on the fundamental concepts learned earlier and enhance them with inferential methods. GAISE Report We’ve been guided in our choice of what to emphasize by the GAISE (Guidelines for Assessment and Instruction in Statistics Education) Report, which emerged from extensive studies of how students best learn Statistics (www.amstat.org/education/gaise/ ). Those recommendations, now officially adopted and recommended by the American Statistical Association, urge (among other detailed suggestions) that Statistics education should: 1. 2. 3. 4. 5. 6.
Emphasize statistical literacy and develop statistical thinking. Use real data. Stress conceptual understanding rather than mere knowledge of procedures. Foster active learning. Use technology for developing conceptual understanding and analyzing data. Make assessment a part of the learning process.
In this sense, this book is thoroughly modern.
Syllabus Flexibility
But to be effective, a course must fit comfortably with the instructor’s preferences. The early chapters—Chapters 1–14—present core material that will be part of any introductory course. Chapters 15–20—multiple regression, time series, model building, and Analysis of Variance—may be included in an introductory course, but our organization provides flexibility in the order and choice of specific topics. Chapters 21–24 may be viewed as “special topics” and selected and sequenced to suit the instructor or the course requirements. Here are some specific notes: • Chapter 4, Correlation and Linear Regression, may be postponed until just before covering regression inference in Chapters 15 and 16. (But we urge you to teach it where it appears.) • Chapter 18, Building Multiple Regression Models, must follow the introductory material on multiple regression in Chapter 17. • Chapter 19, Time Series Analysis, requires material on multiple regression from Chapter 17. • Chapter 20, Design and Analysis of Experiments and Observational Studies, may be taught before the material on regression—at any point after Chapter 13.
A01_SHAR8696_03_SE_FM.indd 14
14/07/14 7:27 AM
www.freebookslides.com
Preface 15
The following topics can be introduced in any order (or omitted) after basic inference has been covered: • Chapter 14, Inference for Counts: Chi-Square Tests • Chapter 21, Quality Control • Chapter 22, Nonparametric Methods • Chapter 23, Decision Making and Risk • Chapter 24, Introduction to Data Mining
Continuing Features
A textbook isn’t just words on a page. A textbook is many elements that come together to form a big picture. The features in Business Statistics provide a real-world context for concepts, help students apply these concepts, promote problem solving, and integrate technology—all of which help students understand and see the big picture of Business Statistics. Providing Real-World Context Motivating Vignettes. Each chapter opens with a motivating vignette, often taken from the authors’ consulting experiences. Companies featured include Amazon.com, Zillow.com, Keen Inc., and Whole Foods Market. We analyze data from or about the companies in the motivating vignettes throughout the chapter. Brief Cases. Each chapter includes one or more Brief Cases that use real data and ask students to investigate a question or make a decision. Students define the objective, plan the process, complete the analysis, and report a conclusion. Data for the Brief Cases are available on and website, formatted for various technologies. Case Studies. Each of the five parts of the book ends with a Case Study. Students are given realistically large data sets and challenged to respond to open-ended business questions using the data. Students can bring together methods they have learned throughout the book to address the issues raised. Students will have to use a computer to work with the large data sets that accompany these Case Studies. What Can Go Wrong? In each chapter, What Can Go Wrong? highlights the most common statistical errors and the misconceptions about Statistics. The most common mistakes for the new user of Statistics often involve misusing a method—not miscalculating a statistic. One of our goals is to arm students with the tools to detect statistical errors and to offer practice in debunking misuses of Statistics, whether intentional or not. Applying Concepts For Examples. Almost every section of every chapter includes a focused example that illustrates and applies the concepts or methods of that section to a real-world business context. Step-by-Step Guided Examples. The answer to a statistical question is almost never just a number. Statistics is about understanding the world and making better decisions with data. Guided Examples model a thorough solution in the right column with commentary in the left column. The overall analysis follows our innovative Plan, Do, Report template. Each analysis begins with a clear question about a business decision and an examination of the data (Plan), moves to calculating the selected statistics (Do), and finally concludes with a Report that specifically addresses the question. To emphasize that our goal is to address
A01_SHAR8696_03_SE_FM.indd 15
14/07/14 7:27 AM
www.freebookslides.com 16
Preface
the motivating question, we present the Report step as a business memo that summarizes the results in the context of the example and states a recommendation if the data are able to support one. To preserve the realism of the example, whenever it is appropriate, we include limitations of the analysis or models in the concluding memo, as one should in making such a report. By Hand. Even though we encourage the use of technology to calculate statistical quantities, we recognize the pedagogical benefits of occasionally doing a calculation by hand. The By Hand boxes break apart the calculation of some of the simpler formulas and help the student through the calculation of a worked example. Reality Check. We regularly offer reminders that Statistics is about understanding the world and making decisions with data. Results that make no sense are probably wrong, no matter how carefully we think we did the calculations. Mistakes are often easy to spot with a little thought, so we ask students to stop for a reality check before interpreting results. Notation Alert. Throughout this book, we emphasize the importance of clear communication. Proper notation is part of the vocabulary of Statistics, but it can be daunting. We’ve found that it helps students when we are clear about the letters and symbols statisticians use to mean very specific things, so we’ve included Notation Alerts whenever we introduce a special notation that students will see again. Math Boxes. In many chapters, we present the mathematical underpinnings of the statistical methods and concepts. We set proofs, derivations, and justifications apart from the narrative, so the underlying mathematics is there for those who want greater depth, but the text itself presents the logical development of the topic at hand without distractions. What Have We Learned? Each chapter ends with a What Have We Learned? summary that includes learning objectives and definitions of terms introduced in the chapter. Students can think of these as study guides. Promoting Problem Solving Just Checking. Throughout each chapter we pose short questions to help students check their understanding. The answers are at the end of the exercise sets in each chapter to make them easy to check. The questions can also be used to motivate class discussion. Ethics in Action. Statistics is not just plugging numbers into formulas; most statistical analyses require a fair amount of judgment. Ethics in Action vignettes—updated for this edition—in each chapter provide a context for some of the judgments needed in statistical analyses. Possible errors, a link to the American Statistical Association’s Ethical Guidelines, and ethically and statistically sound alternative approaches are presented in the Instructor’s Solutions Manual. Section Exercises. The exercises for each chapter begin with straightforward exercises targeted at the topics in each section. These are designed to check understanding of specific topics. Because they are labeled by section, it is easy to turn back to the chapter to clarify a concept or review a method. Chapter Exercises. These exercises are designed to be more realistic than Section Exercises and to lead to conclusions about the real world. They may combine concepts and methods from different sections, and they contain relevant, modern, and real-world
A01_SHAR8696_03_SE_FM.indd 16
14/07/14 7:27 AM
www.freebookslides.com
Preface 17
questions. Many come from news stories; some come from recent research articles. The exercises marked with a T indicate that the data are provided at the book’s companion website, www.pearsonglobaleditions.com/sharpe in a variety of formats. We pair the exercises so that each odd-numbered exercise (with the answer at the back of the book) is followed by an even-numbered exercise on the same Statistics topic. Exercises are roughly ordered within each chapter by both topic and by level of difficulty. Integrating Technology Data and Sources. Most of the data used in examples and exercises are from real-world sources and whenever we can, we include URLs for Internet data sources. The data we use are usually on the companion website, www.pearsonglobaleditions.com/sharpe. Videos with Optional Captioning. Videos, featuring the Business Statistics authors, review the high points of each chapter. The presentations feature the same student-friendly style and emphasis on critical thinking as the textbook. In addition, 10 Business Insight Videos feature Deckers, Southwest Airlines, Starwood, and other companies and focus on statistical concepts as they pertain to the real world. Videos are available with captioning. They can also be viewed from within the online MyStatLab course. Technology Help. In business, Statistics is practiced with computers using a variety of statistics packages. In Business-school Statistics classes, however, Excel is the software most often used. In Technology Help at the end of each chapter, we summarize what students can find in the most common software, often with annotated output. Updated for this edition, we offer extended guidance for Excel 2013, and start-up pointers for Minitab, SPSS, and JMP, formatted in easy-to-read bulleted lists. This advice is not intended to replace the documentation for any of the software, but rather to point the way and provide start-up assistance.
A01_SHAR8696_03_SE_FM.indd 17
14/07/14 7:27 AM
www.freebookslides.com
Supplements Student Supplements Business Statistics, for-sale student edition. Study Cards for Business Statistics Software: This series of study cards, available for Excel 2013 with XLSTAT, Excel 2013 with Data Analysis Toolpak, Minitab, JMP, SPSS, and StatCrunch provide students with easy step-by-step guides to the most common business statistics software.
Instructor Supplements Instructor’s Resource Guide (download only), written by the authors, contains chapter-by-chapter comments on the major concepts, tips on presenting topics (and what to avoid), teaching examples, suggested assignments, basic exercises, and web links and lists of other resources. Available within MyStatLab or at www.pearsonglobaleditions.com/sharpe. Online Test Bank (download only), by Linda Dawson, University of Washington, and Rose Sebastianelli, University of Scranton, includes chapter quizzes and part level tests. The Test Bank is available at www.pearsonglobaleditions.com/sharpe. Instructor’s Solutions Manual (download only), by Linda Dawson, University of Washington and Rose Sebastianelli, University of Scranton, contains detailed solutions to all of the exercises. The Instructor’s Solutions Manual is available at www .pearsonglobaleditions.com/sharpe.
Technology Resources MyStatLab™ Online Course (access code required) MyStatLab from Pearson is the world’s leading online resource in statistics, integrating interactive homework, assessment, and media in a flexible, easy-to-use format. MyStatLab is a course management system that delivers proven results in helping individual students succeed. MyStatLab can be implemented successfully in any environment— lab-based, hybrid, fully online, traditional—and demonstrates the quantifiable difference that integrated usage has on student retention, subsequent success, and overall achievement. MyStatLab’s comprehensive online gradebook automatically tracks students’ results on tests, quizzes, homework, and in the study plan. Instructors can use the gradebook to provide positive feedback or intervene if students have trouble. Gradebook data can be easily exported to a variety of spreadsheet programs, such as Microsoft Excel. You can determine which points of data
you want to export, and then analyze the results to determine success. MyStatLab provides engaging experiences that personalize, stimulate, and measure learning for each student. In addition to the resources below, each course includes a full interactive online version of the accompanying textbook. • Tutorial Exercises with Multimedia Learning Aids: The homework and practice exercises in MyStatLab align with the exercises in the textbook, and most regenerate algorithmically to give students unlimited opportunity for practice and mastery. Exercises offer immediate helpful feedback, including guided solutions, sample problems, animations, and videos. • Adaptive Study Plan: Pearson now offers an optional focus on adaptive learning in the study plan to allow students to work on just what they need to learn when it makes the most sense to learn it. The adaptive study plan maximizes students’ potential for understanding and success. • Additional Question Libraries: In addition to algorithmically regenerated questions that are aligned with your textbook, MyStatLab courses come with two additional question libraries. 450 Getting Ready for Statistics questions cover the developmental math topics students need for the course. These can be assigned as a prerequisite to other assignments, if desired. The 1000 Conceptual Question Library requires students to apply their statistical understanding. • StatCrunch®: MyStatLab includes web-based statistical software, StatCrunch, within the online assessment platform so that students can analyze data sets from exercises and the text. In addition, MyStatLab includes access to www.StatCrunch.com, a web site where users can access tens of thousands of shared data sets, conduct online surveys, perform complex analyses using the powerful statistical software, and generate compelling reports. • Integration of Statistical Software: We make it easy to copy our data sets, both from the ebook and the MyStatLab questions, into software such as StatCrunch, Minitab, Excel, and more. Students have access to a variety of support tools—Technology Instruction Videos, Technology Study Cards, and Manuals for select titles—to learn how to use statistical software. • Business Insight Videos: Ten engaging videos show managers at top companies using statistics in their everyday work. Assignable questions encourage debate and discussion. • StatTalk Videos: Fun-loving statistician Andrew Vickers takes to the streets of Brooklyn, New York, to demonstrate important statistical concepts through interesting stories and real-life events. This series of 24 videos includes available assessment questions and an instructor’s guide.
18
A01_SHAR8696_03_SE_FM.indd 18
14/07/14 7:27 AM
www.freebookslides.com
Preface 19
StatCrunch® StatCrunch is powerful web-based statistical software that allows users to perform complex analyses, share data sets, and generate compelling reports of their data. The vibrant online community offers tens of thousands data sets for students to analyze. • Collect. Users can upload their own data to StatCrunch or search a large library of publicly shared data sets, spanning almost any topic of interest. An online survey tool allows users to collect data via web-based surveys. • Crunch. A full range of numerical and graphical methods allows users to analyze and gain insights from any data set. Interactive graphics help users understand statistical concepts, and are available for export to enrich reports with visual representations of data. • Communicate. Reporting options help users create a wide variety of visually appealing representations of their data. Full access to StatCrunch is available with a MyStatLab kit, and StatCrunch is available by itself to qualified adopters. For more information, visit our website at www.S tatCrunch .com, or contact your Pearson representative.
TestGen® TestGen ® (www.pearsoned.com/testgen) enables instructors to build, edit, print, and administer tests using a computerized bank
A01_SHAR8696_03_SE_FM.indd 19
of questions developed to cover all the objectives of the text. TestGen is algorithmically based, so instructors can create multiple but equivalent versions of the same question or test with the click of a button. Instructors can also modify test bank questions or add new questions. The software and testbank are available for download from Pearson Education’s online catalog.
PowerPoint® Lecture Slides PowerPoint ® Lecture Slides provide an outline to use in a lecture setting, presenting definitions, key concepts, and figures from the text. These slides are available within MyStatLab and in the Instructor Resource Center at www.pearsonglobaleditions.com/sharpe.
XLStat for Pearson XLStat for Pearson is an Excel® add-in that offers a wide variety of functions to enhance the analytical capabilities of Microsoft Excel, making it the ideal tool for your everyday data analysis and statistics requirements. Developed in 1993, XLStat is used by leading businesses and universities around the world. XLStat is compatible with all Excel versions from version 97 to version 2013 (except Mac 2008) including PowerPC and Intel-based Mac systems. For more information, visit www.pearsonhighered.com/xlstat/.
14/07/14 7:27 AM
www.freebookslides.com 20
Preface
Acknowledgments
This book would not have been possible without many contributions from David Bock, our coauthor on several other texts. Many of the explanations and exercises in this book benefit from Dave’s pedagogical flair and expertise. We are honored to have him as a colleague and friend. Many people have contributed to this book from the first day of its conception to its publication. Business Statistics would have never seen the light of day without the assistance of the incredible team at Pearson. Our Editor in Chief, Deirdre Lynch, was central to the support, development, and realization of the book from day one. Chere Bemelmans, Senior Content Editor, kept us on task as much as humanly possible. Peggy McMahon, Senior Production Project Manager, and Nancy Kincade, Project Manager at PreMediaGlobal, worked miracles to get the book out the door. We are indebted to them. Sonia Ashraf, Assistant Editor; Erin Lane, Senior Marketing Manager; Kathleen DeChavez, Marketing Associate; and Dona Kenly, Senior Market Development Manager, were essential in managing all of the behind-the-scenes work that needed to be done. Aimee Thorne, Media Producer, put together a top-notch media package for this book. Barbara Atkinson, Senior Designer, and Studio Montage are responsible for the wonderful way the book looks. Procurement Specialist Debbie Rossi worked miracles to get this book in your hands, and Greg Tobin, President, was supportive and good-humored throughout all aspects of the project. We’d also like to thank our accuracy checkers whose monumental task was to make sure we said what we thought we were saying: James Lapp; Joan Saniuk, Wentworth Institute of Technology; Sarah Streett; and Dirk Tempelaar, Maastricht University. We also thank those who provided feedback through focus groups, class tests, and reviews: Hope M. Baker, Kennesaw State University John F. Beyers, University of Maryland—University College Scott Callan, Bentley College Laurel Chiappetta, University of Pittsburgh Anne Davey, Northeastern State University Joan Donohue, The University of South Carolina Robert Emrich, Pepperdine University Michael Ernst, St. Cloud State Mark Gebert, University of Kentucky Kim Gilbert, University of Georgia Nicholas Gorgievski, Nichols College Clifford Hawley, West Virginia University Kathleen Iacocca, University of Scranton Chun Jin, Central Connecticut State University Austin Lampros, Colorado State University Roger Lee, Salt Lake Community College Monnie McGee, Southern Methodist University Richard McGowan, Boston College Mihail Motzev, Walla Walla University Robert Potter, University of Central Florida Eugene Round, Embry-Riddle Aeronautical University Sunil Sapra, California State University—Los Angeles Dmitry Shishkin, Georgia Gwinnett College
A01_SHAR8696_03_SE_FM.indd 20
14/07/14 7:27 AM
www.freebookslides.com
Preface 21
Courtenay Stone, Ball State University Gordon Stringer, University of Colorado—Colorado Springs Arnold J. Stromberg, University of Kentucky Joe H. Sullivan, Mississippi State University Timothy Sullivan, Towson University Minghe Sun, University of Texas—San Antonio Patrick Thompson, University of Florida Jackie Wroughton, Northern Kentucky University Ye Zhang, Indiana University—Purdue Indianapolis Finally, we want to thank our families. This has been a long project, and it has required many nights and weekends. Our families have sacrificed so that we could write the book we envisioned. Norean Sharpe Richard De Veaux Paul Velleman Pearson would like to thank and acknowledge the following people for their work on the Global Edition: Contributors Dirk Tempelaar, Maastricht University Hend Ghazzai, Qatar University Walid Alwagfi, Gulf University of Science and Technology Reviewers Ghassan H. Mardini, Qatar University Rajnish K. Mishra, Avaquant
A01_SHAR8696_03_SE_FM.indd 21
14/07/14 7:27 AM
www.freebookslides.com
Index of Applications BE = Boxed Example; E = Exercises; EIA = Ethics in Action; GE = Guided Example; IE = In-Text Example; JC = Just Checking; P = Project; TH = Technology Help Accounting Administrative and Training Costs (E), 72, 454–455 Annual Reports (E), 70 Audits and Tax Returns (E), 202, 330, 392 Bookkeeping (E), 296; (IE), 32 Budgets (E), 390 Company Assets, Profit, and Revenue (BE), 151, 632, 723; (E), 69, 71–72, 231, 532, 535, 618, 620, 664, 708–709; (GE), 818–819; (IE), 30, 35, 125, 300, 424, 557, 626 Cost Cutting (E), 499, 502 Expenses (E), 575; (IE), 32, 36 Financial Close Process (E), 459 Probability Calculations and Plots (TH), 260–261 Purchase Records (E), 77; (IE), 32 Random numbers, generating (TH), 197 Random Variables and Probability Models (TH), 229
Advertising Ads (E), 354, 356–357, 460–462, 617 Advertising in Business (BE), 338; (E), 71–72, 75–76, 461, 466–467, 617, 854–855; (EIA), 658; (GE), 184–186; (IE), 30, 34 Branding (E), 461; (IE), 732 Coupons (EIA), 414; (IE), 728, 734–736, 819 Free Products (IE), 340, 379, 420, 733, 735–736, 741 International Advertising (E), 205 Jingles (IE), 462 Predicting Sales (E), 168–169 Product Claims (BE), 425; (E), 266, 462, 465, 467, 498, 500, 760; (EIA), 154–155 Target Audience (E), 205, 234, 457–458; (EIA), 872; (JC), 367 Truth in Advertising (E), 356
Agriculture Agricultural Discharge (EIA), 287 Beef and Livestock (E), 388 Drought and Crop Losses (E), 463 Farmers’ Markets (E), 233 Fruit Growers (E), 581 Lawn Equipment (E), 854–855 Lobster Fishing Industry (E), 578–579, 582, 619–620 Lumber (E), 580 Seeds (E), 327, 356
Banking Annual Percentage Rate (IE), 732; (P), 230 ATMs (E), 198; (IE), 423 Bank Tellers (E), 762 Certificates of Deposit (CDs) (P), 230 Credit Card Bank (P), 67 Credit Card Charges (E), 111, 330–331, 389, 539; (GE), 92–93, 342–343, 441–444; (IE), 300, 548–549 Credit Card Companies (BE), 316; (E), 325, 330–331, 352, 389, 420; (GE), 37, 132–133, 176, 299–300, 316, 342–343, 423–425, 429–434, 548–549, 721–723, 857–859; (JC), 403, 406; (P), 42 Credit Card Customers (BE), 316; (E), 234, 330–331, 352, 389, 502; (GE), 92–93, 342–343, 429–431, 441–444; (IE), 299–300, 302, 316, 423–424, 548–549, 721–723; (JC), 406
Credit Card Debt (E), 461; (JC), 406 Credit Card Offers (BE), 316; (E), 330–331; (GE), 342–343, 429–434, 729–730, 748–751; (IE), 37, 176–177, 300, 316, 424–425, 548–549, 724, 732, 743–744; (P), 42, 874 Credit Scores (IE), 175–176 Credit Unions (EIA), 319 Federal Reserve Board (BE), 675 Interest Rates (E), 163, 200, 576–577, 713, 834; (IE), 300, 728; (P), 230 Investment Banks (E), 854–855 Liquid Assets (E), 709 Maryland Bank National Association (IE), 299–300 Mortgages (E), 45, 163, 834; (GE), 304–305 Subprime Loans (IE), 37, 445 World Bank (E), 122, 166
Business (General) Attracting New Business (E), 391 Best Places to Work (E), 504, 536 Bossnapping (E), 323; (GE), 312–313 Business Planning (IE), 125, 409 Chief Executives (E), 120–121, 207, 267, 389, 502; (IE), 100–101, 371–372 Company Case Reports and Lawyers (GE), 304–305 Company Databases (IE), 35, 37 Contract Bids (E), 232–233, 203 Elder Care Business (EIA), 523 Enterprise Resource Planning (E), 459, 504, 831 Entrepreneurial Skills (E), 502 Forbes 500 Companies (E), 123, 389–390 Fortune 500 Companies (E), 324, 532, 721 Franchises (BE), 632; (EIA), 154–155, 523 Industry Sector (E), 503–504 International Business (E), 68, 76, 292–293, 329; (IE), 272; (P), 290 Job Growth (E), 504, 536 Organisation for Economic Cooperation and Development (OECD) (E), 116, 580 Outside Consultants (IE), 63 Outsourcing (E), 503 Real Estate (P), 826–827 Research and Development (E), 72; (IE), 125–126; (JC), 441 Small Business (E), 70–71, 164, 202, 232, 391, 502, 575, 617, 854–855; (IE), 30, 836–837 Start-Up Companies (E), 43, 331, 853–854 Trade Secrets (IE), 508 Women-Led Businesses (E), 231, 356
Company Names Adair Vineyards (E), 111 AIG (GE), 94–95; (IE), 77–78, 80, 86 Allied Signal (IE), 796 Alpine Medical Systems, Inc. (EIA), 609 Amazon.com (IE), 30, 125 American Express (IE), 423 Amtrak (BE), 723 Arby’s (E), 43 Bank of America (IE), 299, 423
Bell Telephone Laboratories (IE), 773 BMW (E), 169 Bolliger & Mabillard Consulting Engineers, Inc. (B&M) (IE), 626–627 Buick (E), 165 Burger King (BE), 632; (E), 622; (IE), 632–633 Capital One (IE), 37, 31, 721–722 Chevy (E), 461 Circuit City (E), 386 Cisco Systems (E), 70 Coca-Cola (E), 69 CompUSA (E), 386 Cypress (JC), 132 Data Description (IE), 835–837, 840–841, 843–844 Deliberately Different (EIA), 491 Desert Inn Resort (E), 201 Diners Club (IE), 423 Eastman Kodak (E), 800 eBay (E), 234 Expedia.com (IE), 584 Fair Isaac Corporation (IE), 175–176 Fisher-Price (E), 70 Ford (E), 165, 461; (IE), 283 General Electric (IE), 333, 773, 796 General Motors Corp. (BE), 696 GfK Roper (E), 71–72, 292, 329, 500–501; (GE), 59–60; (IE), 53, 59, 272–273, 275, 478–479; (P), 290 Google (E), 71–72, 504, 710; (IE), 48–53, 220–222 Guinness & Co. (BE), 224; (IE), 359–361 Holes-R-Us (E), 121 The Home Depot (E), 578; (GE), 686–689, 697–700; (IE), 689–690, 692–693 Honda (E), 165 Hostess (IE), 275 IBM (IE), 807 i4cp (IE), 807 Intel (JC), 132 J.Crew (JC), 685 Jeep (E), 206 KEEN (IE), 47–48 Kellogg’s (IE), 541–542 Kelly’s BlueBook (E), 206 KomTek Technologies (GE), 788–791 Kraft Foods, Inc. (P), 526 L.L. Bean (E), 44 Lycos (E), 292 Mattel (E), 70 Mellon Financial Corporation (E), 709 Metropolitan Life (MetLife) (IE), 209–210 Microsoft (E), 70; (IE), 51–52 M&M/Mars (E), 202, 326, 354, 763; (GE), 184–186 Motorola (IE), 796 Nambé Mills, Inc. (GE), 514–531; (IE), 507–508, 518–521 National Beverage (E), 69 Netflix (BE), 724; (IE), 31 Nissan (IE), 248 PepsiCo (E), 69, 201, 416
22
A01_SHAR8696_03_SE_FM.indd 22
14/07/14 7:27 AM
www.freebookslides.com
Index of Applications 23
Pew Research (E), 199, 204, 457, 504, 763; (IE), 180, 274 Pillsbury (BE), 632 Pontiac (E), 165 Roper Worldwide (JC), 225 Sara Lee Corp. (E), 709 SmartWool (BE), 404, 405, 408 Sony Corporation (IE), 771–772, 776 Starbucks (IE), 36 Suzuki (E), 622 Systemax (E), 386 Target Corp. (E), 709 Texaco-Pennzoil (P), 850–852 Tiffany & Co. (P), 706 Time-Warner (BE), 302–303 Toyota (BE), 696; (E), 165, 532, 709 Trax (EIA), 796 UPS (IE), 863 Visa (IE), 423–424 Wal-Mart (E), 466, 618, 620, 664, 713 Western Electric (IE), 779 Whole Foods Market (BE), 691; (IE), 671–674, 690, 701 WinCo Foods (E), 466–467 Yahoo (E), 710; (IE), 50–51 Zenna’s Café (EIA), 105 Zillow.com (IE), 583–584
Consumers Categorizing Consumers (E), 499, 502, 762; (IE), 34–35, 276–277 Consumer Confidence Index (CCI) (IE), 305 Consumer Groups (E), 356, 392, 461 Consumer Loyalty (E), 353; (IE), 30, 542; (JC), 406; (P), 352, 494 Consumer Perceptions About a Product (E), 499; (IE), 626–627 Consumer Price Index (CPI) (E), 263, 618, 620, 664, 706–707, 712 Consumer Research (IE), 125–126, 283, 820–821 Consumer Spending (E), 168; (GE), 92–93, 132–133, 429–434; (IE), 432; (P), 494 Customer Databases (E), 44, 120, 266, 292; (IE), 30–40, 49–50, 176–177, 859, 864; (JC), 57; (P), 43, 351–352 Customer Satisfaction (E), 235–236, 356, 663; (EIA), 39, 657 Customer Service (E), 296; (EIA), 39, 287; (IE), 30 Detecting the Housing Bubble (P), 110 Restaurant Patrons (JC), 276 Shopping Patterns (E), 110–111
Demographics Age (E), 387, 571–573; (GE), 485–487; (IE), 484–489 Average Height (E), 262; (JC), 248 Birth and Death Rates (E), 170, 459, 531 Income (E), 74–75, 622, 710–711, 834; (IE), 857, 859, 866–867; (JC), 88, 124; (P), 567–568 Lefties (E), 235 Life Expectancy (E), 580, 622, 666–667; (IE), 135, 149 Marital Status (E), 572, 576–577 Murder Rate (E), 622 Paralyzed Veterans dataset (P), 415 Population (JC), 563; (P), 567 Race/Ethnicity (E), 498, 830 U.S. Census Bureau (E), 75, 231, 266, 498; (EIA), 657; (IE), 37, 275, 859; (JC), 88, 276; (P), 567 Using Demographics in Business Analysis (EIA), 872; (IE), 630, 859; (P), 660
A01_SHAR8696_03_SE_FM.indd 23
Distribution and Operations Management Construction (E), 765–766 Delivery Services and Times (E), 76, 353, 460, 504 International Distribution (E) 75 Inventory (E), 203, 500; (GE), 213–215 Mail Order (E), 44 Maintenance Costs (E), 356 Overhead Costs (E), 70 Packaging (E), 165, 234; (GE), 245–247, 250–252 Product Distribution (E), 69–70, 75, 325, 353, 460 Productivity and Efficiency (E), 70, 765 Sales Order Backlog (E), 70 Shipping (BE), 364; (E), 231; (GE), 213–214, 250–252 Storage and Retrieval Systems (E), 766 Tracking (BE), 364; (E), 76; (IE), 35, 863 Waiting Lines (E), 295, 762; (IE), 256–257, 626; (JC), 219
E-Commerce Advertising and Revenue (E), 161 Internet and Globalization (E), 539 Internet Sales (E), 121, 354, 497, 502, 529, 716, 763 Online Businesses (BE), 404–405, 408; (E), 168, 201–202, 232, 327–328, 353, 500, 502, 529, 709, 763 (EIA), 347, 319, 490; (IE), 35–36, 47–48, 125–126, 337 Online Sales and Blizzards, 161 Product Showcase Websites (IE), 48–53 Search Engine Research (IE), 49–53 Security of Online Business Transactions (E), 202–203, 502, 762 Special Offers via Websites (EIA), 414; (IE), 34–36; (P), 351–352 Tracking Website Hits (E), 232, 235, 268, 351–352, 760; (IE), 49–53 Web Design, Management, and Sales (E), 202, 353, 760, 855–856; (IE), 338, 402
Economics Cost of Living (E), 169, 536; (P), 159 Dow Jones Industrial Average (GE), 474–476; (IE), 333–335, 470 Forecasting (E), 200; (IE), 305 Gross Domestic Product (E), 166–167, 169, 505–506, 572, 580–581, 617, 663–664, 833; (EIA), 657, 697; (IE), 504; (P), 567 Growth Rates of Countries (E), 504–505 Human Development Index (E), 572, 581 Inflation Rates (BE), 481–482, 484; (E), 166, 501 Organization for Economic Cooperation and Development (E), 580, 617 Personal Consumption Expenditures (EIA), 657 U.S. Bureau of Economic Analysis (E), 504–505; (EIA), 657 Views on the Economy (E), 69–70, 329, 353, 355; (IE), 305–307
Education Academic Research and Data (E), 497 ACT, Inc. (E), 327 Admissions, College (BE), 61; (E), 43, 73, 76, 169, 533–534 College Choice and Birth Order (E), 499 College Courses (E), 763 College Social Life (JC), 489 College Tuition (E), 121, 124, 621; (IE), 104
Core Plus Mathematics Project (E), 455 Cornell University (IE), 104 Education and Quality of Life (IE), 149 Education Levels (E), 497, 761, 763 Enriched Early Education (IE), 30 Entrance Exams (BE), 243–245; (E), 265, 327–328; (JC), 365 Freshman 15 Weight Gain (E), 831–832 GPA (E), 43, 169 Graduates and Graduation Rates (E), 112, 331, 622 High School Dropout Rates (E), 325 Ithaca Times (IE), 104 Learning Disabilities (EIA), 39 Literacy and Illiteracy Rates (E), 169, 622 MBAs (E), 43, 73, 353, 357 Online Education (EIA), 446 Rankings of Business Schools (E), 169 Reading Ability and Height (IE), 134 Stanford University (IE), 220 Statistics Grades (IE), 477 Test Scores (BE), 243–245; (E), 43, 119, 198, 265, 355, 461, 533–534, 537, 761, 829; (JC), 236, 241 Traditional Curriculums (E), 455 University at California Berkeley (BE), 61; (E), 111, 76
Energy Batteries (E), 232–233, 391, 533 Energy Use (E), 538–539; (P), 322 Fuel Economy (E), 44, 118, 163, 296, 393, 461, 498, 533, 534, 537, 575–576, 761, 769; (IE), 248, 424, 554–556; (JC), 88, 124; (P), 159 Gas Prices and Consumption (E), 114–118, 122, 388, 457, 498, 711–713, 715–716; (IE), 545 Heat for Homes (GE), 644–648 Oil (E), 70, 716–717, 853–854; (IE), 545–547 Renewable Energy Sources (P), 568 Wind Energy (E), 392, 464; (IE), 551–552; (P), 568
Environment Atmospheric Levels of Carbon Dioxide (E), 529 Clean Air Emissions Standards (E), 331, 419 Conservation Projects (EIA), 287 El Niño (E), 170 Emissions/Carbon Footprint of Cars (E), 165–166, 356, 833–834 Environmental Causes of Disease (E), 459 Environmental Defense Fund (BE), 370 Environmental Groups (E), 329 Environmental Protection Agency (BE), 370; (E), 44, 166, 262, 294, 534, 833 Environmental Sustainability (E), 538 Global Warming (E), 198–199, 294, 355–356, 457; (P), 527 Greenhouse Gases (E), 170, 527 Hurricanes (E), 121, 457–458, 574 Ozone Levels (E), 118, 534–535 Pollution Control (E), 205, 331, 356, 391, 617, 766 Toxic Waste (E), 294
Ethics Bias in Company Research and Surveys (E), 291–297; (EIA), 287; (IE), 282–285 Bossnapping (E), 323; (GE), 312–313; (JC), 314 Business Ethics (E), 329, 357
14/07/14 7:27 AM
www.freebookslides.com 24
Index of Applications
Employee Discrimination (E), 498–499, 764; (EIA), 608, 753–754 False Claims (EIA), 227 Housing Discrimination (E), 294, 503 Misleading Research (EIA), 39 Sweatshop Labor (IE), 286
Famous People Armstrong, Lance (IE), 556 Bernoulli, Daniel (IE), 219–220 Bonferroni, Carlo, 740 Box, George (IE), 240 Castle, Mike (IE), 299 Cohen, Steven A. (IE), 469–470 Deming, W. Edward (IE), 772–773, 795–796 De Moivre, Abraham (IE), 239 Descartes, Rene (IE), 129 Dow, Charles (IE), 333 Edgerton, David (BE), 632 Fairbank, Richard (IE), 721 Fisher, Sir Ronald (IE), 153, 367, 402 Galton, Sir Francis (BE), 140 Gates, Bill (IE), 83 Gosset, William S. (BE), 224; (IE), 359–360, 366–370 Gretzky, Wayne (E), 115 Howe, Gordie (E), 115 Ibuka, Masaru (IE), 771 Jones, Edward (IE), 333 Juran, Joseph (IE), 772 Kellogg, John Harvey and Will Keith (IE), 541 Kendall, Maurice (BE), 820 Laplace, Pierre-Simon (IE), 362 Legendre, Adrien-Marie (BE), 137, 141 Likert, Rensis (IE), 807 Lockhart, Denis (BE), 675 Lowell, James Russell (IE), 341 MacArthur, Douglas (IE), 772 MacDonald, Dick and Mac (BE), 632 Mann, H. B. (BE), 810 Martinez, Pedro (E), 664 McGwire, Mark (E), 115 McLamore, James (BE), 632 Morita, Akio (IE), 771 Morris, Nigel (IE), 721 Obama, Michelle (JC), 685 Pepys, Samuel (IE), 773 Sagan, Carl (IE), 406 Sammis, John (IE), 837–838 Sarasohn, Homer (IE), 772, 773 Savage, Sam (IE), 220 Shewhart, Walter A. (IE), 773, 774, 796–797 Spearman, Charles (IE), 150, 822 Starr, Cornelius Vander (IE), 77 Street, Picabo (IE), 651–652, 654 Taguchi, Genichi, 152 Tukey, John W. (IE), 91 Tully, Beth (EIA), 105 Twain, Mark (IE), 470 Whitney, D. R. (BE), 810 Wilcoxon, Frank (BE), 809 Zabriskie, Dave (IE), 556
Finance and Investments Annuities (E), 501 Assessing Risk (E), 69, 417, 501; (IE), 175–176, 315
A01_SHAR8696_03_SE_FM.indd 24
Blue Chip Stocks (E), 856 Bonds (E), 501; (IE), 333–334 Brokerage Firms (E), 497, 501; (EIA), 39 CAPE10 (BE), 249; (IE), 238; (P), 261 Currency (BE), 679–680, 682, 685; (E), 264–265, 328; (IE), 34–35 Dow Jones Industrial Average (BE), 240; (E), 166; (GE), 475; (IE), 333–335, 341, 470–471 Financial Planning (E), 43–45 Gold Prices (IE), 180 Growth and Value Stocks (P), 230 Hedge Funds (IE), 469–470 Investment Analysts and Strategies (BE), 217–218; (E), 501; (GE), 304–305; (P), 322 London Stock Exchange (IE), 359 Market Sector (IE), 556 Moving Averages (BE), 678–680; (E), 708; (IE), 677–679 Mutual Funds (E), 44, 114, 119, 121, 162, 168, 264–266, 354, 462–463, 531, 856; (IE), 30, 34; (P), 160, 230 NASDAQ (BE), 96 NYSE (IE), 96, 98, 237–238 Portfolio Managers (E), 78, 357 Price/Earnings and Stock Value (P), 261 Public vs. Private Company (BE), 632; (IE), 359–360 Stock Market and Prices (E), 44, 72, 200, 264–265, 267, 323, 357, 421, 708–711; (GE), 94–95; (IE), 34, 78–82, 84–85, 93–94, 98–101, 103, 133, 178, 181, 333–334, 677–678; (JC), 179, 441; (P), 160 Stock Returns (E), 266, 357, 462–463, 504, 764; (IE), 470 Stock Volatility (IE), 78–79, 96 Student Investors (E), 326, 327, 355 Trading Patterns (E), 497; (GE), 474–476; (IE), 85, 98–99, 470, 478 Venture Capital (BE), 225 Wall Street (IE), 469 Wells Fargo/Gallup Small Business Index (E), 70
Food/Drink Alcoholic Beverages (E), 323 Apples (E), 327–328 Baby Food (IE), 772 Bananas (E), 709 Candy (BE), 776, 780, 785–788, 794–795 Carbonated Drinks (E), 69, 416 Cereal (BE), 425; (E), 456, 667, 761, 767–768, 830; (GE), 245–247; (IE), 253, 542–544 Coffee (E), 163–164, 711; (EIA), 105; (JC), 302 Company Cafeterias and Food Stations (E), 388; (JC), 428 Farmed Salmon (BE), 370, 380 Fast Food (E), 294, 500–501, 622; (IE), 632–633; (P), 291 Food Consumption and Storage (E), 122; (GE), 59–60; (JC), 428 Food Prices (E), 709, 711 Hot Dogs (E), 455 Ice Cream Cones (E), 163 Irradiated Food (E), 329 Milk (E), 800; (IE), 772; (JC), 428 Nuts (E), 497–498 Opinions About Food (E), 500–501; (GE), 59–60; (JC), 489; (P), 291 Oranges (E), 581 Organic Food (E), 455, 829; (EIA), 287, 348 Pet Food (E), 75; (IE), 772 Pizza (E), 115–116, 164, 461, 663; (IE), 552–553; (P), 526–527 Popcorn (E), 421 Potatoes (E), 233 Salsa (E), 296
Seafood (E), 169–170, 296, 500 Wine (E), 111, 115, 616–617, 761–762; (EIA), 657 Yogurt (E), 457, 766
Games Cards (E), 202–203; (IE), 178–179 Casinos (E), 202–203, 233, 352, 391 Computer Games (E), 575 Dice (E), 497; (IE), 360–361 Gambling (E), 391, 801; (P), 527 Jigsaw Puzzles (GE), 280–281 Keno (IE), 178–179 Lottery (BE), 179, 212; (E), 198, 498, 801; (IE), 180 Odds of Winning (E), 202, 391, 498 Roulette (E), 200
Government, Labor, and Law AFL-CIO (E), 618 City Council (E), 329 European Union (IE), 37 Fair and Accurate Credit Transaction Act (IE), 176 Food and Agriculture Organization of the United Nations (E), 122 Government Agencies (E), 574, 834; (IE), 37, 77 Immigration Reform (E), 501 IRS (E), 202, 330, 392 Jury Trials (BE), 338; (E), 356; (IE), 336–338, 403, 409 Labor Productivity and Costs (E), 534 Minimum Wage (E), 74–75 National Center for Productivity (E), 121 Protecting Workers from Hazardous Conditions (E), 761 Settlements (P), 850–851 Social Security (E), 198 Unemployment (E), 117, 122–123, 531–532, 538, 717 United Nations (BE), 820; (E), 531, 538–539, 573, 762, 833 U.S. Bureau of Labor Statistics (E), 534, 614, 710, 762, 764 U.S. Department of Labor (E), 74 U.S. Fish and Wildlife Service (E), 293 U.S. Food and Drug Administration (E), 800 U.S. Geological Survey (BE), 553 U.S. Securities and Exchange Commission (IE), 469; (P), 160 Zoning Laws (IE), 308
Human Resource Management/Personnel Assembly Line Workers (E), 458 Employee Athletes (E), 465 Flexible Work Week (BE), 817 Hiring and Recruiting (E), 70, 296, 325, 331; (IE), 541 Human Resource Accounting (IE), 807 Human Resource Data (E), 202, 292, 503; (IE), 807 Job Interviews (E), 231 Job Performance (E), 162; (IE), 61, 286 Job Satisfaction (E), 234, 235, 266, 296, 459, 503, 831 Mentoring (E), 502 Promotions (E), 234 Ranking by Seniority (IE), 36 Rating Employees (JC), 441 Relocation (E), 207 Shifts (E), 765 Staff Cutbacks (IE), 283 Testing Job Applicants (E), 416, 458 Training (E), 262, 763 Worker Productivity (E), 121, 465, 765
14/07/14 7:27 AM
www.freebookslides.com
Index of Applications 25
Insurance Auto Insurance and Warranties (E), 201, 329, 497 Fire Insurance (E), 201 Health Insurance (E), 76, 292, 330; (IE), 860; (JC), 588; (P), 494 Hurricane Insurance (E), 235 Insurance Company Databases (BE), 136, 403; (E), 76; (IE), 103; (JC), 38, 45 Insurance Costs (BE), 403; (E), 69–70; (IE), 212–217 Insurance Profits (E), 117; (GE), 373–375, 377–378; (IE), 77, 215, 375–376 Life Insurance (E), 580, 666; (IE), 209–217 Medicare (E), 355 National Insurance Crime Bureau (E), 165 Online Insurance Companies (E), 463–464, 832 Property Insurance (GE), 373–375, 377–378; (JC), 38 Sales Reps for Insurance Companies (BE), 376; (E), 168; (GE), 374–375, 377–378; (IE), 375, 376 Tracking Insurance Claims (E), 165; (P), 851–852 Travel Insurance (GE), 846–847; (P), 851–852
Management Data Management (IE), 30, 37–38 Employee Management (IE), 61 Hotel Management (BE), 632 Management Consulting (E), 201 Management Styles (E), 503–504 Marketing Managers (E), 205, 536, 762, 764; (P), 352 Middle Managers (E), 764; (JC), 489 Production Managers (E), 419 Product Managers (P), 526 Project Management (E), 296 Restaurant Manager (JC), 489 Sales Managers (E), 536, 764
Manufacturing Adhesive Compounds, 800 Appliance Manufacturers (E), 356, 616 Assembly Line Production (BE), 632 Camera Makers (E), 204 Car Manufacturers (E), 328, 353, 497, 761; (IE), 774 Ceramics (E), 163 Computer and Computer Chip Manufacturers (E), 234, 356–357, 804; (IE), 191, 776–777 Cooking and Tableware Manufacturers (IE), 507–508 Drug Manufacturers (E), 205, 266, 417 Exercise Equipment (E), 465 Injection Molding (E), 764 Manufacturing Companies and Firms (E), 504, 765 Metal Manufacturers (GE), 514; (IE), 507–508; (P), 351–352 Product Registration (IE), 276, 283 Prosthetic Devices (GE), 788–791 Silicon Wafer (IE), 367, 776–777, 781, 792–794 Stereo Manufacturers (GE), 250–252 Tire Manufacturers (E), 201, 266, 465–466 Toy Manufacturers (E), 70, 201; (IE), 772 Vacuum Tubes (IE), 772
Marketing Chamber of Commerce (IE), 308 Direct Mail (BE), 316; (E), 329; (EIA), 872; (GE), 729–730, 748–751; (IE), 316, 724, 743–745, 857, 859; (P), 352 Global Markets (P), 198 International Marketing (E), 69–70, 75, 205, 293; (GE), 59–60, 184–186
A01_SHAR8696_03_SE_FM.indd 25
Market Demand (E), 72, 205, 295; (GE), 280–281; (IE), 315–316; (P), 322 Marketing Costs (E), 72 Marketing New Products (E), 353–354, 356–357, 457; (GE), 184–186; (IE), 315 Marketing Slogans (E), 462 Marketing Strategies (E), 205, 231; (GE), 729–730; (IE), 484–487, 544, 724 Market Research (E), 69–70, 292–293, 354, 388, 460; (GE), 59–60, 429–431; (IE), 271–272, 274, 279, 724; (P), 290 Market Share (E), 69 Online Marketing (IE), 724 Researching Buying Trends (E), 205; (GE), 435–438; (IE), 35, 48, 186–187, 277, 314–316, 434, 723; (JC), 191, 225; (P), 234 Researching New Store Locations (E), 278; (JC), 225 Web-Based Loyalty Program (P), 352
Media and Entertainment British Medical Journal (E), 499, 833 Broadway and Theater (E), 45, 73, 614–616 Business Week (E), 43, 68, 113; (IE), 125 Cartoons (IE), 61, 219, 282, 286, 311 Chance (E), 833 Chicago Tribune (IE), 271 CNN Money, 218 Concertgoers (E), 323 Consumer Reports (E), 44, 353, 455, 499, 533 Cosmopolitan (BE), 480 The Economist (BE), 480 Errors in Media Reporting (IE), 271 Financial Times (E), 43, 709 Forbes (E), 123; (IE), 556 Fortune (BE), 102; (E), 43, 68, 503, 536; (IE), 271; (P), 850 Globe & Mail (GE), 818 The Guardian (E), 323 Journal of Applied Psychology (E), 460 Lancet (E), 500 Le Parisien (GE), 312 Magazines (BE), 480; (E), 43, 68–69, 76, 292, 328, 356; (IE), 125 Medical Science in Sports and Exercise (E), 465 Moneyball, 31 Movies (E), 73–75, 530, 714 Newspapers (E), 43, 68–69, 72, 323; (EIA), 565; (GE), 312 Paris Match (GE), 313 Science (E), 76; (P), 527 Sports Illustrated (BE), 480 Television (E), 235, 294; (IE), 135–136, 283–284 Theme Parks (E), 44, 295 Variety (E), 614 The Wall Street Journal (E), 43, 68, 113, 326 WebZine (E), 356
Pharmaceuticals, Medicine, and Health Accidental Death (E), 69 AIDS (IE), 30 Aspirin (JC), 338 Binge Drinking (E), 327 Blood Pressure (E), 69, 205, 575; (IE), 133 Blood Type (E), 202, 234; (GE), 223 Body Fat Percentages (E), 575; (JC), 588 Cancer (E), 69, 500; (IE), 153 Centers for Disease Control and Prevention (E), 69, 572; (IE), 62 Cholesterol (E), 205, 266, 386 Colorblindness (E), 577 Cranberry Juice and Urinary Tract Infections (E), 499
Drinking and Driving (E), 294 Drug Tests and Treatments (E), 417, 458, 832; (IE), 30, 424; (JC), 338 Freshman 15 Weight Gain (E), 831 Genetic Defects (E), 328 Health and Education Levels (E), 761 Health Benefits of Fish (E), 500 Hearing Aids (E), 762 Heart Disease (E), 69; (IE), 83 Hepatitis C (E), 74 Herbal Compounds (E), 43 Hormones (GE), 818–819 Hospital Charges and Discharges (E), 76 Lifestyle and Weight (IE), 248 Medical Tests and Equipment (EIA), 608; (IE), 408; (JC), 588 Number of Doctors (IE), 135 Nutrition Labels (E), 542, 667; (IE), 543–544, 632–633 Orthodontist Costs (E), 231 Patient Complaints (E), 805 Pharmaceutical Companies (E), 330, 354, 575 Placebo Effect (E), 352, 416; (IE), 731 Public Health Research (IE), 722 Respiratory Diseases (E), 69 Side Effects of a Drug (E), 234 Smoking (E), 327 Teenagers and Dangerous Behaviors (E), 417, 833; (IE), 62 Vaccinations (E), 324 Vision (E), 327 Vitamins (E), 268, 330; (IE), 30 World Health Organization (IE), 37
Politics and Popular Culture 2008 Elections (E), 424 Attitudes on Appearance (GE), 485–487; (IE), 484–485 Candidates (BE), 338 Cosmetics (IE), 484–488 Election Polls (E), 294, 329; (IE), 271, 274, 283, 307–308 Fashion (BE), 187, 190; (EIA), 287; (IE), 469; (JC), 685; (P), 706 Governor Approval Ratings (E), 331 Pets (E), 75; (IE), 722–723; (JC), 723, 726, 732, 740 Playgrounds (E), 295 Political Parties (E), 198, 234; (EIA), 872 Readiness for a Woman President (E), 330, 534, 712 Religion (E), 293 Roller Coasters in Theme Parks (IE), 625–630, 635–639 Tattoos (E), 74 Titanic, sinking of (E), 204, 498, 503 Truman vs. Dewey (IE), 271, 283
Quality Control Cuckoo Birds (E), 832 Food Inspection and Safety (E), 294, 296, 329; (GE), 59–60; (IE), 276 Product Defects (E), 165, 232, 234, 328, 331, 353, 497, 802–803; (IE), 802; (P), 351–352 Product Inspections and Testing (E), 119, 203, 234–235, 264, 269, 293, 326, 329, 457, 465, 497, 760, 766–767, 829; (IE), 152, 359–360, 625, 774; (P), 758 Product Ratings and Evaluations (E), 44, 200, 353–354, 458, 663–665; (IE), 626 Product Recalls (E), 234 Product Reliability (E), 207, 499, 802; (IE), 626 Repair Calls (E), 233 Six Sigma (IE), 796 Taste Tests (E), 663–665, 761; (IE), 731 Warranty on a Product (E), 204
14/07/14 7:27 AM
www.freebookslides.com 26
Index of Applications
Real Estate Commercial Properties (BE), 812, 814, 822; (GE), 561–564 Comparative Market Analyses (E), 117, 331, 465 Fair Housing Act of 1968 (E), 503 Foreclosures (E), 327, 389, 830; (GE), 304–305 Home Buyers (IE), 285, 583 Home Ownership (E), 356 Home Sales and Prices (BE), 90, 96–97, 649; (E), 72, 120, 122, 163, 198, 201, 206, 234, 392, 465, 530, 577, 616–618; (GE), 593–597, 604–606, 644–648; (IE), 190, 583–591, 597–601, 607, 643, 649–650; (P), 332, 385, 451, 826–827 Home Size and Price (E), 166; (GE), 147–149 Home Values (E), 323, 389, 463, 465, 532; (GE), 593–597; (IE), 583–584 Housing Development Projects (EIA), 848–849 Housing Industry (E), 392–393, 463; (EIA), 848–849 Housing Inventory and Time on Market (E), 121, 422; (GE), 604–606 MLS (E), 121 Real Estate Websites (IE), 583–584 Renting (E), 503 Standard and Poor’s Case-Shiller Home Price Index (E), 122 Zillow.com real estate research site (GE), 593; (IE), 583–584
Promotional Sales (E), 201, 354–355; (GE), 429–433; (IE), 186–187, 424–425, 431 Quarterly Sales and Forecasts (BE), 691; (E), 708, 711, 713; (GE), 686–689, 697–699; (IE), 672, 689–697; (P), 706 Regional Sales (BE), 471; (E), 165 Retail and Wholesale Sales (E), 296 Retail Price (GE), 514–516; (IE), 508, 518–522; (JC), 563; (P), 526–527 Sales Costs and Growth (E), 72 Sales Representatives (E), 498; (EIA), 608 Seasonal Spending (E), 539, 711, 713; (GE), 441–444; (IE), 673–674, 691–692 Secret Sales (E), 201 Shelf Location and Sales (E), 457, 767–768; (IE), 30, 544 Shopping Malls (IE), 285; (JC), 302 U.S. Retail Sales and Food Index (E), 618, 620–621 Weekly Sales (E), 455, 461; (IE), 552–553 Yearly Sales (E), 577–578, 709
Science
Assigned Parking Spaces (JC), 489 Companionship and Non-medical Home Services (EIA), 523 Day Care (E), 324 Employee Benefits (E), 330, 356 Executive Compensation (E), 162–163, 389; (IE), 100–102, 258, 371–372 Hourly Wages (E), 536, 764 Pensions (IE), 210 Raises and Bonuses (E), 45, 764; (IE), 34–35 Salaries (BE), 102; (E), 161, 163–165, 614–616, 618, 762; (EIA), 608; (IE), 477, 607 Training and Mentorship Programs (EIA), 523
Aerodynamics (IE), 276 Biotechnology Firms (E), 328–329 Chemical Industry (E), 329 Chemicals and Congenital Abnormalities (E), 355 Cloning (E), 328 Cloud Seeding (E), 463 Contaminants and Fish (BE), 370; (E), 296 Gemini Observatories (E), 801 IQ Tests (E), 262, 264, 265, 760; (IE), 153 Metal Alloys (IE), 507 Psychology Experiments (BE), 731; (E), 760–761 Research Grant Money (EIA), 39 Soil Samples (E), 294 Space Exploration (E), 295 Temperatures (E), 170; (IE), 753 Testing Food and Water (E), 294, 457, 531 Units of Measurement (E), 296; (IE), 36n, 555, 640
Sales and Retail
Service Industries and SocialIssues
Salary and Benefits
Air Conditioner Sales (E), 163 American Girl Sales (E), 70 Book Sales and Stores (E), 160–162, 497, 500 Campus Calendar Sales (E), 72 Car Sales (E), 114, 121, 164 Catalog Sales (BE), 604; (E), 44, 328; (IE), 733, 735–736; (JC), 685 Closing (E), 234 Clothing Stores (BE), 404, 405, 408, 586–587; (E), 187 Coffee Shop (E), 163–164, 292, 497; (EIA), 105; (JC), 302 Comparing Sales Across Different Stores (E), 456 Computer Outlet Chain (JC), 138 Department Store (E), 75 Food Store Sales (BE), 112, 117, 329, 455; (EIA), 287, 348 Friendship and Sales (GE), 435–438, 812–813; (IE), 434 Gemstones (BE), 544, 547, 550, 553–554, 558, 559, 630–634, 639–640, 643–644, 654–655, 821 International Sales (JC), 563 Monthly Sales (E), 708 Music Stores (E), 110; (IE), 733, 735–736, 738–739 New Product Sales (IE), 424 Number of Employees (JC), 138, 146 Optometry Shop (JC), 57 Paper Sales (IE), 61 Predicted vs. Realized Sales (E), 45; (EIA), 608
A01_SHAR8696_03_SE_FM.indd 26
American Association of Retired People (E), 331 American Heart Association (IE), 541 American Red Cross (E), 202; (GE), 223–224 Charities (E), 331; (IE), 409; (P), 385–386 Firefighters (IE), 154 Fundraising (E), 325–326 Nonprofit and Philanthropic Organizations (E), 120, 266, 325–326, 331, 420; (GE), 223–224; (IE), 37, 47–48, 273, 807, 857; (P), 385–386, 660 Paralyzed Veterans of America (IE), 273, 857; (P), 660 Police (E), 391–392, 498–499, 614–616 Service Firms (E), 504 Volunteering (EIA), 105
Sports Baseball (E), 45, 115, 164–165, 167–168, 293, 458, 463, 664–665, 800, 805–806, 830–831, 833; (GE), 345–346; (IE), 31, 178; (JC), 219 Basketball (E), 801 Cycling (E), 203, 233, 708, 856; (EIA), 796; (IE), 556 Exercise (general) (E), 266, 459, 465 Fishing (E), 232, 386 Football (E), 164, 534; (IE), 48–49 Golf (E), 116–117, 393, 462; (P), 612 Hockey (E), 115
Indianapolis 528 (E), 44 Kentucky Derby (E), 44–45, 118 NASCAR (E), 263 Olympics (E), 70–71, 461–462, 760; (IE), 654–656 Pole Vaulting (E), 800 Running (E), 461–462; (IE), 556 Sailing (E), 830 Skiing (E), 421, 708; (IE), 654–656 Super Bowl (IE), 48–49 Swimming (E), 263, 462, 760 Tennis (E), 268
Surveys and Opinion Polls Company Surveys (E), 292, 330, 503–504 Consumer Polls (E), 198, 203–204, 234, 292–296, 329–331, 356, 499–502, 762; (EIA), 287, 490; (GE), 485; (IE), 272, 277, 281–285, 808; (JC), 191, 225, 276, 280, 302; (P), 198, 290–291 Cornell National Social Survey (CNSS) (BE), 518 Gallup Polls (BE), 481–482, 484, 488; (E), 203, 294–296, 329, 355, 386, 534, 712; (IE), 271–272, 306–307; (P), 322 International Polls (E), 329–330, 500; (IE), 484–488 Internet and Email Polls (E), 43–44, 111, 112, 294, 324, 325– 326, 331, 763; (GE), 184; (JC), 276; (P), 290–291 Mailed Surveys (BE), 316; (E), 292, 328; (GE), 184; (IE), 283, 316 Market Research Surveys (E), 72, 75, 201–202, 292–296, 325, 354, 855; (GE), 59–60, 184, 280; (IE), 53, 272–275, 278, 285; (P), 290–291 Newspaper Polls (E), 294–295, 331 Public Opinion Polls (BE), 316–317; (E), 69–70, 75, 201–205, 294–296, 327, 503–504; (GE), 59–60, 312–313; (IE), 53–55, 271–272, 275, 277–278, 281–282, 306–308; (JC), 314, 441 Student Surveys (E), 43–44, 329, 499; (GE), 280–281; (IE), 30, 35, 39; (JC), 441, 489 Telephone Surveys (E), 201–202, 205, 235, 323, 324, 355, 518; (GE), 186; (IE), 180, 271–272, 274, 283–285, 317; (JC), 225
Technology Cell Phones (E), 163, 235, 264, 269, 295, 323, 391, 762; (IE), 149–150, 285, 340 Compact Discs (E), 331 Computers (BE), 35; (E), 203, 207, 294, 323, 391, 457, 539; (GE), 213–214; (IE), 274, 277; (P), 799; (TH), 65–67, 289 Digital music (E), 330 Downloading Movies or Music (BE), 79–80, 95, 99; (E), 110, 329, 331; (JC), 379 DVDs (E), 801; (IE), 835–836 E-Mail (E), 202, 328 Flash Drives (IE), 61 Hard Drives (E), 160, 161 Help Desk (IE), 836–837, 843–845 Impact of the Internet and Technology on Daily Life (E), 504 Information Technology (E), 354, 456, 500, 503; (P), 851–852 Internet Access (BE), 724; (E), 295, 461, 763 iPods and MP3 Players (E), 117, 323, 804; (JC), 88 LCD screens (BE), 255; (E), 234 Managing Spreadsheet Data (TH), 42 Multimedia Products (IE), 835–836 Online Journals or Blogs (E), 504 Personal Data Assistant (PDA) (E), 853 Personal Electronic Devices (IE), 191 Product Instruction Manuals (E), 355; (IE), 278 Software (E), 71, 232, 293; (IE), 38, 39, 103, 357, 420 Technical Support (IE), 836, 848 Telecommunications (BE), 472, 478; (E), 201; (IE), 836
14/07/14 7:27 AM
www.freebookslides.com
Index of Applications 27
Transportation Air Travel (BE), 488–489, 837, 839; (E), 200, 231, 263, 296, 323, 354, 390, 574, 577, 715–716, 855; (EIA), 657; (IE), 284, 300; (JC), 280, 406; (P), 351, 495 Border Crossings (BE), 675–676, 692, 695–696 Cars (BE), 426, 696; (E), 45, 164, 232, 328, 354, 499, 532, 537, 761, 768–769; (EIA), 872; (GE), 814–816; (IE), 542, 642–643, 650, 808, 813–814
A01_SHAR8696_03_SE_FM.indd 27
Commuting to Work (E), 232; (JC), 248 Motorcycles (E), 45, 621–622, 668 National Highway Transportation Safety Administration (BE), 128; (E), 768–769 Seatbelt Use (E), 268, 352 Texas Transportation Institute (IE), 126; (E), 663 Traffic Accidents (BE), 128, 131, 136, 147 Traffic and Parking (E), 234, 324, 389, 391–392, 663–666
Traffic Congestion and Speed (E), 262, 267, 327; (IE), 127 Travel and Tourism (E), 231, 323, 714; (EIA), 657 U.S. Bureau of Transportation Statistics (E), 390 U.S. Department of Transportation (BE), 675; (E), 390
14/07/14 7:27 AM
www.freebookslides.com
A01_SHAR8696_03_SE_FM.indd 28
14/07/14 7:27 AM
1
www.freebookslides.com
Data and Decisions
E-Commerce E-Commerce and mobile commerce have dramatically changed the way the world shops. Online shoppers can buy clothes, food, even cars with the click of a mouse and a digital swipe of their credit card—24 hours a day, 7 days a week. Companies now reach their customers in ways no one could even imagine just a generation ago. Online sales in some sectors, such as clothing and electronics, already account for over 15% of total sales, which is about double what it was five years ago. U.S. adults, on average, currently spend about $1200 a year online, but some projections put that number at nearly $2000 a year by 2016. The trend in online shopping is worldwide. The amount Australians spend online is expected to grow by $10B in the next five years. The research firm Forrester estimates that global digital retailing is headed toward 15 to 20% of total sales worldwide in the near future. A few generations ago, many store owners knew their customers well. With that knowledge, they could personalize their suggestions, guessing which items that particular customer might like. Online marketers rely on similar information about customers and potential customers to make decisions. But in today’s digital age retailers never meet their customers, so, that information has to be obtained in other ways. How do today’s companies know which ads to place on your browser or what order to list the websites from your search? How do marketers know what to advertise and to whom? The answer is … Data.
29
M01_SHAR8696_03_SE_C01.indd 29
14/07/14 7:27 AM
www.freebookslides.com 30
CHAPTER 1 Data and Decisions
1.1 “Data is king at Amazon. Clickstream and purchase data are the crown jewels at Amazon. They help us build features to personalize the website experience.” —Ronny Kohavi, Former Director of Data Mining and Personalization, Amazon.com
“It is the mark of a truly intelligent person to be moved by statistics.” —George Bernard Shaw
Q:
What is Statistics?
A:
Statistics is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world. What are statistics?
Q: A: Q: A:
Q: A:
Statistics (plural) are quantities calculated from data. So what is data? You mean, “what are data?” Data is the plural form. The singular is datum. So, what are data? Data are values along with their context.
What Are Data?
B
usinesses have always relied on data for planning and to improve efficiency and quality. Now, more than ever before, businesses rely on the information in data to compete in the global marketplace. Every time you make an online purchase, much more information is actually captured than just the details of the purchase itself. What pages did you search in order to get to your purchase? How much time did you spend looking at each? Companies use this information to make decisions about virtually all phases of their business, from inventory to advertising to website design. These data are recorded and stored electronically, in vast digital repositories called data warehouses. In the past few decades these data warehouses have grown enormously in size, but with the use of powerful computers, the information contained in them is accessible and used to help make decisions. The huge capacity of these warehouses has given rise to the term Big Data to describe data sets so large that traditional methods of storage and analysis are inadequate. Even though the data amounts are huge, some decisions can be made quickly. When you pay with your credit card, for example, the information about the transaction is transmitted to a central computer where it is processed and analyzed. A decision whether to approve or deny your purchase is made and transmitted back to the point of sale, all within a few seconds. But data alone can’t help you make better business decisions. You must be able to summarize, model, and understand what the data can tell you. That collection of tools and its associated reasoning is what we call “Statistics.” Statistics plays a role in making sense of our complex world in an astonishing number of ways. Statisticians assess the risk of genetically engineered foods or of a new drug being considered by the Food and Drug Administration (FDA). Statisticians predict the number of new cases of AIDS by regions of the country or the number of customers likely to respond to a sale at the supermarket. And statisticians help scientists, social scientists, and business leaders understand how unemployment is related to environmental controls, whether enriched early education affects the later performance of school children, and whether vitamin C really prevents illness. Whenever you have data and a need to understand the world or make an informed decision, you need Statistics. If we want to analyze student perceptions of business ethics (a question we’ll come back to in a later chapter), should we administer a survey to every single university student in the United States—or, for that matter, in the world? Well, that wouldn’t be very practical or cost-effective. Instead, we can try to obtain survey responses from a smaller, representative group of students. Statistics can help us make the leap from a smaller sample of data we have at hand to an understanding of the world at large. We talk about the specifics of sampling in Chapter 8, and the theme of generalizing from the specific to the general is one that we revisit throughout this book. We hope this text will empower you to draw conclusions from data and make valid business decisions in response to such questions as: • Will the new design of our website increase click-through rates and result in more sales? • What is the effect of advertising on sales? • Do aggressive, “high-growth” mutual funds really have higher returns than more conservative funds? • Is there a seasonal cycle in your firm’s revenues and profits? • What is the relationship between shelf location and cereal sales? • Do students around the world perceive issues in business ethics differently? • Are there common characteristics about your customers and why they choose your products?—and, more importantly, are those characteristics the same among those who aren’t your customers? Our ability to answer questions such as these and make sound business decisions with data depends largely on our ability to understand variation. That may not be the
M01_SHAR8696_03_SE_C01.indd 30
14/07/14 7:27 AM
www.freebookslides.com
31
What Are Data?
term you expected to find at the end of that sentence, but it is the essence of Statistics. The key to learning from data is understanding the variation that is all around us. Data vary. People are different. So are economic conditions from month to month. We can’t see everything, let alone measure it all. And even what we do measure, we measure imperfectly. So the data we wind up looking at and basing our decisions on provide, at best, an imperfect picture of the world. Variation lies at the heart of what Statistics is all about. How to make sense of it is the central challenge of Statistics. Companies use data to make decisions about nearly every aspect of their business. By studying the past behavior of customers and predicting their responses, they hope to better serve their customers and to compete more effectively. This process of using data, especially of transactional data (data collected for recording the companies’ transactions), to make decisions and predictions is sometimes called data mining or predictive analytics. The more general term business analytics (or sometimes simply analytics) describes any use of data and statistical analysis to drive business decisions from data whether the purpose is predictive or simply descriptive. Leading companies are embracing business analytics. Reed Hastings, a former computer science major, is the founder and CEO of Netflix. Netflix uses analytics on customer information both to recommend new movies and to adapt the website that customers see to individual tastes. Netflix offered a $1 million prize to anyone who could improve on the accuracy of their recommendations by more than 10%. That prize was won in 2009 by a team of statisticians and computer scientists using datamining techniques. The Oakland Athletics use analytics to judge players instead of the traditional methods used by scouts and baseball experts for over a hundred years. The book and movie Moneyball document how business analytics enabled them to put together a team that could compete against the richer teams in spite of the severely limited resources available to the front office. eBay used analytics to examine its own use of computer resources. Although not obvious to their own technical people, once they crunched the data they found huge inefficiencies. According to Forbes, they were able to “save millions in capital expenditures within the first year.” To begin to make sense of data, we first need to understand its context. Whether the data are numerical (consisting only of numbers), alphabetic (consisting only of letters), or alphanumerical (mixed numbers and letters), they are useless unless we know what they represent. Newspaper journalists know that the lead paragraph of a good story should establish the “Five W’s”: who, what, when, where, and (if possible) why. Often, we add how to the list as well. Answering these questions can provide a context for data values and make them meaningful. The answers to the first two questions are essential. If you can’t answer who and what, you don’t have data, and you don’t have any useful information. We can make the meaning clear if we add the context of who the data are about and what was measured and organize the values into a data table such as this one. Table 1.1 shows purchase records from an online music retailer. Each row represents a purchase of a music album. In general, rows of a data table correspond to individual cases about which we’ve recorded some characteristics called variables.
The W’s: Who What When Where Why
Order Number
Name
State/Country
Price
Area Code
Album Download
Gift?
Stock ID
Artist
105-2686834-3759466
Katherine H.
Ohio
5.99
440
Identity
N
B00000I5Y6
James Fortune & Flya
105-9318443-4200264
Samuel P.
Illinois
9.99
312
Port of Morrow
Y
B000002BK9
The Shins
105-1872500-0198646
Chris G.
Massachusetts
9.99
413
Up All Night
N
B000068ZVQ
Syco Music UK
103-2628345-9238664
Monique D.
Canada
10.99
902
Fallen Empires
N
B000001OAA
Snow Patrol
002-1663369-6638649
Katherine H.
Ohio
11.99
440
Sees the Light
N
B002MXA7Q0
La Sera
Table 1.1 Example of a data table. The variable names are in the top row. Typically, the Who of the table are found in the leftmost column.
M01_SHAR8696_03_SE_C01.indd 31
14/07/14 7:27 AM
www.freebookslides.com 32
CHAPTER 1 Data and Decisions
Cases go by different names, depending on the situation. Individuals who answer a survey are referred to as respondents. People on whom we experiment are subjects or (in an attempt to acknowledge the importance of their role in the experiment) participants, but animals, plants, websites, and other inanimate subjects are often called experimental units. Often we call cases just what they are: for example, customers, economic quarters, or companies. In a database, rows are called r ecords—in this example, purchase records. Perhaps the most generic term is cases. In Table 1.1, the cases are the individual orders. The column titles (variable names) tell what has been recorded. What does a row of Table 1.1 represent? Be careful. Even if people are involved, the cases may not correspond to people. For example, in Table 1.1, each row represents a different order and not the customer who made the purchases (notice that the same person made two different orders). A common place to find the who of the table is the leftmost column. It’s often an identifying variable for the cases, in this example, the order number. If you collect the data yourself, you’ll know what the cases are and how the variables are defined. But, often, you’ll be looking at data that someone else collected. The information about the data, called the metadata, might have to come from the company’s database administrator or from the information technology department of a company. Metadata typically contains information about how, when, and where (and possibly why) the data were collected; who each case represents; and the definitions of all the variables. A general term for a data table like the one shown in Table 1.1 is a spreadsheet, a name that comes from bookkeeping ledgers of financial information. The data were typically spread across facing pages of a bound ledger, the book used by an accountant for keeping records of expenditures and sources of income. For the accountant, the columns were the types of expenses and income, and the cases were transactions, typically invoices or receipts. These days, it is common to keep modest-size datasets in a spreadsheet even if no accounting is involved. It is usually easy to move a data table from a spreadsheet program to a program designed for statistical graphics and analysis, either directly or by copying the data table and pasting it into the statistics program. Although data tables and spreadsheets are great for relatively small data sets, they are cumbersome for the complex data sets that companies must maintain on a day-to-day basis. Try to imagine a spreadsheet from a company the size of Amazon with customers in the rows and products in the columns. Amazon has tens of millions of customers and millions of products. But very few customers have purchased more than a few dozen items, so almost all the entries would be blank––not a very efficient way to store information. For that reason, various other database architectures are used to store data. The most common is a relational database. In a relational database, two or more separate data tables are linked together so that information can be merged across them. Each data table is a relation because it is about a specific set of cases with information about each of these cases for all (or at least most) of the variables (“fields” in database terminology). For example, a table of customers, along with demographic information on each, is such a relation. A data table of all the items sold by the company, including information on price, inventory, and past history, is another relation. Transactions may be held in a third “relation” that references each of the other two relations. Table 1.2 shows a small example. In statistics, analyses are typically performed on a single relation because all variables must refer to the same cases. But often the data must be retrieved from a relational database. Retrieving data from these databases may require specific expertise with that software. In the rest of the book, we’ll assume that the data have been retrieved and placed in a data table or spreadsheet with variables listed as columns and cases as the rows.
M01_SHAR8696_03_SE_C01.indd 32
14/07/14 7:27 AM
www.freebookslides.com
33
What Are Data?
Customers Customer Number
Name
City
State
Zip Code
Customer since
Gold Member?
473859
R. De Veaux
Williamstown
MA
01267
2007
No
127389
N. Sharpe
Washington
DC
20052
2000
Yes
335682
P. Velleman
Ithaca
NY
14580
2003
No
… Items Product ID
Name
Price
Currently in Stock?
SC5662
Silver Cane
43.50
Yes
TH2839
Top Hat
29.99
No
RS3883
Red Sequined Shoes
35.00
Yes
… Transactions Transaction Number
Date
Customer Number
Product ID
Quantity
Shipping Method
Free Ship?
T23478923
9/15/13
473859
SC5662
1
UPS 2nd Day
N
T23478924
9/15/13
473859
TH2839
1
UPS 2nd Day
N
T63928934
10/20/13
335682
TH2839
3
UPS Ground
N
T72348299
12/22/13
127389
RS3883
1
Fed Ex Ovnt
Y
Table 1.2 A relational database shows all the relevant information for three separate relations linked together by customer and product numbers.
For Example
Identifying variables and the W’s
Carly, a marketing manager at a credit card bank, wants to know if an offer mailed 3months ago has affected customers’ use of their cards. To answer that, she asks the information technology department to assemble the following information for each customer: total spending on the card during the 3 months before the offer (Pre Spending); total spending for 3 months after the offer (Post Spending); the customer’s Age (by category); what kind of expenditure they made (Segment); if customers are enrolled in the website (Enroll?); what offer they were sent (Offer); and the amount each customer spent on the card in their segment (Segment Spend ). She gets a spreadsheet whose first six rows look like this:
Account ID
Pre Spending
Spending
Age
Segment
Enroll?
Offer
Segment Spend
393371 462715 433469 462716 420605 473703
$2,698.12 $2,707.92 $800.51 $3,459.52 $2,106.48 $2,603.92
$6,261.40 $3,397.22 $4,196.77 $3,335.00 $5,576.83 $7,397.50
25–34 45–54 65 + 25–34 35–44 6 25
Travel/Ent Retail Retail Services Leisure Travel/Ent
NO NO NO YES YES YES
None Gift Card None Double Miles Double Miles Double Miles
$887.36 $5,062.55 $673.80 $800.75 $3,064.81 $491.29 (continued )
M01_SHAR8696_03_SE_C01.indd 33
14/07/14 7:27 AM
www.freebookslides.com 34
CHAPTER 1 Data and Decisions
Question Identify the cases and the variables. Describe as many of the W’s as you can for this data set.
Answer The cases are individual customers of the credit card bank. The data are from the internal records of the credit card bank for the past 6 months (3 months before and 3 months after an offer was sent to the customers). The variables include the account ID of the customer (Account ID) and the amount charged on the card before (Pre Spending) and after (Post Spending) the offer was sent out. Also included are the customer’s Age, marketing Segment, whether they enrolled on the website (Enroll?), what offer they were sent (Offer), and how much they charged on the card in their marketing segment (Segment Spend).
1.2
Categorical, or Quantitative? When area codes were first introduced all phones had dials. To reduce wear and tear on the dials and to speed calls, the lowest-digit codes (the fastest to dial—those for which the dial spun the least) were assigned to the largest cities. So, New York City was given 212, Chicago 312, LA 213, and Philadelphia 215, but rural upstate New York was 607, Joliet was 815, and San Diego 619. Back then, the numerical value of an area code could be used to guess something about the population of its region. But after dials gave way to push buttons, new area codes were assigned without regard to population and area codes are now just categories.
Variable Types When the values of a variable are simply the names of categories we call it a categorical, or qualitative, variable. When the values of a variable are measured numerical quantities with units, we call it a quantitative variable. Descriptive responses to questions are often categories. For example, the responses to the questions “What type of mutual fund do you invest in?” or “What kind of advertising does your firm use?” yield categorical values. An important special case of categorical variables is one that has only two possible responses (usually “yes” or “no”), which arise naturally from questions like “Do you invest in the stock market?” or “Do you make online purchases from this website?”
Question
Categories or Responses
Do you invest in the stock market?
__ Yes __ No
What kind of advertising do you use?
__ Newspapers __ Internet __ Direct mailings
What is your class at school?
__ Freshman __ Sophomore __ Junior __ Senior
I would recommend this course to another student.
__ Strongly Disagree __ Slightly Disagree __ Slightly Agree __ Strongly Agree
How satisfied are you with this product?
__ Very Unsatisfied __ Unsatisfied __ Satisfied __ Very Satisfied
Table 1.3 Some examples of categorical variables.
Many measurements are quantitative. In a purchase record, price, quantity, and time spent on the website are all quantitative values with units (dollars, count, and seconds). For quantitative variables, the units tell how each value has been measured. Even more important, units such as yen, cubits, carats, angstroms, nanoseconds, miles per hour, or degrees Celsius tell us the scale of measurement, so we know how far apart two values are. Without units, the values of a measured variable have no meaning. It does little good to be promised a raise of 5000 a year if you don’t know whether it will be paid in euros, dollars, yen, or Estonian krooni. An essential part of a quantitative variable is its units. The distinction between categorical and quantitative variables seems clear, but there are reasons to be careful. First, some variables can be considered as either categorical or quantitative, depending on the kind of questions we ask about them. For example, the variable Age would be considered quantitative if the responses were numerical and they had units. A doctor would certainly consider Age to be quantitative. The units could be years, or for infants, the doctor would want even
M01_SHAR8696_03_SE_C01.indd 34
14/07/14 7:27 AM
www.freebookslides.com
Variable Types
Variable Names that Make Sense A tradition still hangs on in some places to name variables with cryptic abbreviations in uppercase letters. This can be traced back to the 1960s, when computer programs were c ontrolled with instructions punched on cards. The earliest punch card equipment used only uppercase letters, and statistics programs limited variable names to six or eight characters, so variables had names like PRSRF3. Modern programs don’t have such restrictive limits, so there is no reason not to use names that make sense.
35
more precise units, like months, or even days. On the other hand, a retailer might lump together the values into categories like “Child (12 years or less),” “Teen (13to 19),” “Adult (20 to 64),” or “Senior (65 or over).” For many purposes, like knowing which song download coupon to send you, that might be all the information needed. Then Age would be a categorical variable. How to classify some variables as categorical or quantitative may seem obvious. But be careful. Area codes may look quantitative, but are really categories. What about ZIP codes? They are categories too, but the numbers do contain information. If you look at a map of the United States with ZIP codes, you’ll see that as you move West, the first digit of ZIP codes increases, so treating them as quantitative might make sense for some questions. Another reason to be careful about classifying variables comes from the analysis of Big Data. When analysts want to decide what advertisement to send to the web page you’re looking at, or what the probability is that you’ll renew your phone contract, they use automatic methods involving dozens or even hundreds of variables. Usually the software used to do the analysis has to guess the type of variable from its values. When the variable contains symbols other than numbers, the software will correctly type the variable as categorical, but just because a variable has numbers doesn’t mean it is quantitative. We’ve seen examples (area code, order number) where that’s just not the case. Data miners spend much of their time going back through data sets to correctly retype variables as categorical or quantitative to avoid silly mistakes of misuse. Chapter2 discusses summaries and displays of categorical variables more fully. Chapter 3 discusses quantitative variables, which require different summaries and displays.
Identifiers A special kind of categorical variable is worth mentioning. Identifier variables are categorical variables whose only purpose is to assign a unique identifier code to each individual in the data set. Your student ID number, social security number, and phone number are all identifiers. Identifier variables are crucial in this era of Big Data because by uniquely identifying the cases, they make it possible to combine data from different sources, protect confidentiality, and provide unique labels. Your school’s grade transcripts are likely in a different relation than your bursar bill records. Your student ID is what links them. Most companies keep such relational databases. The identifier is crucial to linking one data table to another in a relational database. The identifiers in Table 1.2 are the Customer Number, Product ID, and Transaction Number. Variables like UPS Tracking Number and Social Security Number, are other e xamples of identifiers.
Other Data Types Many companies follow up with customers after a service call or sale with an online questionnaire. They might ask: “How satisfied were you with the service you received?” 1) Not satisfied; 2) Somewhat satisfied; 3) Moderately satisfied; or 4) E xtremely satisfied. Is this variable categorical or quantitative? There is certainly an order of p erceived worth; higher numbers indicate higher perceived worth. An employee whose customer responses average around 4 seems to be doing a better job than one whose average is around 2, but are they twice as good? These values are not strictly number, so we can’t really answer that question. When the values of a categorical variable have an intrinsic order, we can say that the variable is o rdinal. By contrast, a categorical variable with unordered categories
M01_SHAR8696_03_SE_C01.indd 35
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 1 Data and Decisions
36
Year
Total Revenue (in $M)
2002
3288.9
2003
4075.5
2004
5294.2
2005
6369.3
2006
7786.9
2007
9441.5
2008
10,383.0
2009
9774.6
2010
10,707
2011
11,700
2012
13,300
Table 1.4 Starbucks’s total revenue (in $M) for the years 2002 to 2012.
is sometimes called nominal. Values can be individually ordered (e.g., the ranks of employees based on the number of days they’ve worked for the company) or ordered in classes (e.g., Freshman, Sophomore, Junior, Senior). Ordering is not absolute; how the values are ordered depends on the purpose of the ordering. For example, are the categories Infant, Youth, Teen, Adult, and Senior ordinal? Well, if we are ordering on age, they surely are and how to order the categories is clear. But if we are ordering on purchase volume, it is likely that either Teen or Adult will be the top group.1
Cross-Sectional and Time Series Data The quantitative variable Total Revenue in Table 1.4 is an example of a time series. A time series is an ordered sequence of values of a single quantitative variable measured at regular intervals over time. Time series are common in business. Typical measuring points are months, quarters, or years, but virtually any consistently-spaced time interval is possible. Variables collected over time hold special challenges for statistical analysis, and Chapter 19 discusses these in more detail. By contrast, most of the methods in this book are better suited for cross- sectional data, where several variables are measured at the same time point. If we collect data on sales revenue, number of customers, and expenses for last month at each Starbucks (more than 20,000 locations as of 2012) at one point in time, this would be cross-sectional data. Cross-sectional data may contain some time information (such as dates), but it isn’t a time series because it isn’t measured at regular intervals. Because different methods are used to analyze these different types of data, it is important to be able to identify both time series and cross-sectional data sets.
For Example
Identifying the types of variables
Question Before she can continue with her analysis, Carly (from the example on page 33) must classify each variable as being quantitative or categorical (or possibly both), and whether the data are a time series or cross-sectional. For quantitative variables, what are the units? For categorical variables, are they nominal or ordinal?
Answer Account ID – categorical (nominal, identifier) Pre Spending – quantitative (units $) Post Spending – quantitative (units $) Age – categorical (ordinal). Could be quantitative if we had more precise information Segment – categorical (nominal) Enroll? – categorical (nominal) Offer – categorical (nominal) Segment Spend – quantitative (units $) The data are cross-sectional. We do not have successive values over time.
1 Some people differentiate quantitative variables according to whether their measured values have a defined value for zero. This is a technical distinction and usually not one we’ll need to make. (For example, it isn’t correct to say that a temperature of 80°F is twice as hot as 40°F because 0° is an arbitrary value. On the Celsius scale those temperatures are 26.67°C and 4.44°C—a ratio of 6.) The term interval scale is sometimes applied to quantitative variables that lack a defined zero, and the term ratio scale is applied to measurements for which such ratios are appropriate.
M01_SHAR8696_03_SE_C01.indd 36
14/07/14 7:27 AM
www.freebookslides.com
Data Sources: Where, How, and When
1.3
37
Data Sources: Where, How, and When We must know who, what, and why to analyze data. Without knowing these three, we don’t have enough to start. Of course, we’d always like to know more because the more we know, the more we’ll understand and the better our decisions will be. If possible, we’d like to know the where, how, and when of data as well. Values recorded in 1947 may mean something different than similar values recorded last year. Values measured in Abu Dhabi may differ in meaning from similar measurements made in Mexico. How the data are collected can make the difference between insight and nonsense. As we’ll see later, data that come from a voluntary survey on the Internet are almost always worthless. In a recent Internet poll, 84% of respondents said “no” to the question of whether subprime borrowers should be bailed out. While it may be true that 84% of those 23,418 respondents did say that, it’s dangerous to assume that that group is representative of any larger group. To make inferences from the data you have at hand to the world at large, you need to ensure that the data you have are representative of the larger group. Chapter 8 discusses sound methods for designing a survey or poll to help ensure that the inferences you make are valid. Another way to collect valid data is by performing an experiment in which you actively manipulate variables (called factors) to see what happens. Most of the “junk mail” credit card offers that you receive are actually experiments done by marketing groups in those companies. They may make different versions of an offer to selected groups of customers to see which one works best before rolling out the winning idea to the entire customer base. Chapter 20 discusses both the design and the analysis of experiments like these. Sometimes, the answer to a question you may have can be found in data that someone or some organization has already collected. Internally, companies may analyze data from their own databases or data warehouse. They may also supplement or rely entirely on data collected by others. Many companies, nonprofit organizations, and government agencies collect vast amounts of data via the Internet. Some organizations may charge you a fee for accessing or downloading their data. The U.S. government collects information on nearly every aspect of life in the United States, both social and economic (see for example www.census.gov, or more generally, www.usa.gov), as the European Union does for Europe (see ec.europa.eu/ eurostat). International organizations such as the World Health Organization (www .who.org) and polling agencies such as Pew Research (www.pewresearch.org) offer information on a variety of current social and demographic trends. Data like these are usually collected for different purposes than to answer your particular business question. So you should be cautious when generalizing from data like these. Unless the data were collected in a way that ensures that they are representative of the population in which you are interested, you may be misled. Chapter 24 discusses data mining, which attempts to use Big Data to make hypotheses and draw insights. There’s a World of Data on the Internet These days, one of the richest sources of data is the Internet. With a bit of practice, you can learn to find data on almost any subject. We found many of the data sets used in this book by searching on the Internet. The Internet has both advantages and disadvantages as a source of data. Among the advantages are the fact that often you’ll be able to find even more current data than we present. One disadvantage is that references to Internet addresses can “break” as sites evolve, move, and die. Another disadvantage is that important metadata—information about the collection, quality, and intent of the data—may be missing. Our solution to these challenges is to offer the best advice we can to help you search for the data, wherever they may be residing. We usually point you to a website. We’ll sometimes suggest search terms and offer other guidance. (continued )
M01_SHAR8696_03_SE_C01.indd 37
14/07/14 7:27 AM
www.freebookslides.com 38
CHAPTER 1 Data and Decisions
Some words of caution, though: Data found on Internet sites may not be formatted in the best way for use in statistics software. Although you may see a data table in standard form, an attempt to copy the data may leave you with a single column of values. You may have to work in your favorite statistics or spreadsheet program to reformat the data into variables. You will also probably want to remove commas from large numbers and such extra symbols as money indicators 1$, ¥, £, :2; few statistics packages can handle these.
Throughout this book, we often provide a margin note for a new dataset listing some of the W’s of the data. When we can, we also offer a reference for the source of the data. It’s a habit we recommend. The first step of any data analysis is to know why you are examining the data (what you want to know), whom each row of your data table refers to, and what the variables (the columns of the table) record. These are the Why, the Who, and the What. Identifying them is a key part of the Plan step of any analysis. Make sure you know all three before you spend time analyzing the data.
For Example
Identifying data sources
On the basis of her initial analysis, Carly asks her colleague Ying Mei to e-mail a sample of customers from the Travel and E ntertainment segment and ask about their card use and household demographics. Carly asks another colleague, Gregg, to design a study about their double miles offer. In this study, a random sample of customers receives one of three offers: the standard double miles offer; a double miles offer good on any airline; or no offer.
Question For each of the three data sets—Carly’s original data set and Ying Mei’s
and Gregg’s sets—state whether they come from a designed survey or a designed experiment or are collected in another way.
Answer Carly’s data set was derived from transactional data, not part of a survey or experiment. Ying Mei’s data come from a designed survey, and Gregg’s data come from a designed experiment.
Jus t C h e c k in g An insurance company that specializes in commercial property insurance has a separate database for their policies that
involve churches and schools. Here is a small portion of that database.
Policy Number
Years Claim Free
Net Property Premium ($)
Net Liability Premium ($)
Total Property Value ($000)
Median Age in ZIP Code
School?
Territory
Coverage
4000174699 8000571997 8000623296 3000495296 5000291199 8000470297 1000042399 4000554596 3000260397 8000333297 4000174699
1 2 1 1 4 2 4 0 0 2 1
3107 1036 438 582 993 433 2461 7340 1458 392 3107
503 261 353 339 357 622 1016 1782 261 351 503
1036 748 344 270 218 108 1544 5121 1037 177 1036
40 42 30 35 43 31 41 44 42 40 40
FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
AL580 PA192 ID60 NC340 OK590 NV140 NJ20 FL530 NC560 OR190 AL580
BLANKET SPECIFIC BLANKET BLANKET BLANKET BLANKET BLANKET BLANKET BLANKET BLANKET BLANKET
1 List as many of the W’s as you can for this data set.
M01_SHAR8696_03_SE_C01.indd 38
2 Classify each variable as to whether you think it should be
treated as categorical or quantitative (or both); if quantitative, identify the units.
14/07/14 7:27 AM
www.freebookslides.com
39
Ethics in Action
What Can Go Wrong? • Don’t label a variable as categorical or quantitative without thinking about the data and what they represent. The same variable can sometimes take on different roles. • Don’t assume that a variable is quantitative just because its values are numbers. Categories are often given numerical labels. Don’t let that fool
you into thinking they have quantitative meaning. Look at the context.
• Always be skeptical. One reason to analyze data is to discover the truth.
Even when you are told a context for the data, it may turn out that the truth is a bit (or even a lot) different. The context colors our interpretation of the data, so those who want to influence what you think may slant the context. A survey that seems to be about all students may in fact report just the opinions of those who visited a fan website. The question that respondents answered may be posed in a way that influences responses.
Ethics in Action
S
arah Potterman, a doctoral student in educational psychology, is researching the effectiveness of various interventions recommended to help children with learning disabilities improve their reading skills. One particularly intriguing approach is an interactive software system that uses analogy-based phonics. Sarah contacted the company that developed this software, RSPT Inc., to obtain the system free of charge for use in her research. RSPT Inc. expressed interest in having her compare its product with other intervention strategies and was quite confident that its approach would be the most effective. Not only did the company provide Sarah with free software, but RSPT Inc. also generously offered to fund her research with a grant to cover her data collection and analysis costs. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • Propose an ethical solution that considers the welfare of all stakeholders. Jim Hopler is operations manager for a local office of a topranked full-service brokerage firm. With increasing competition from both discount and online brokers, Jim’s firm has redirected attention to attaining exceptional customer service through its client-facing staff, namely brokers. In particular, management wished to emphasize the excellent advisory services provided by its brokers. Results from surveying clients about the advice received from brokers at the local office revealed that 20% rated it
M01_SHAR8696_03_SE_C01.indd 39
poor, 5% rated it below average, 15% rated it average, 10% rated it above average, and 50% rated it outstanding. With corporate approval, Jim and his management team instituted several changes in an effort to provide the best possible advisory services at the local office. Their goal was to increase the percentage of clients who viewed their advisory services as outstanding. Sur veys conducted after the changes were imple mented showed the following results: 5% poor, 5% below average, 20% average, 40% above average, and 30% outstanding. In discussing these results, the management team expressed concern that the percentage of clients who considered their advisory services outstanding fell from 50% to 30%. One member of the team suggested an alternative way of summarizing the data. By coding the categories on a scale from 1 = poor to 5 = outstanding and computing the average, they found that the average rating increased from 3.65 to 3.85 as a result of the changes implemented. Jim was delighted to see that their changes were successful in improving the level of advisory services offered at the local office. In his report to corporate, he only included average ratings for the client surveys. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • Propose an ethical solution that considers the welfare of all stakeholders.
14/07/14 7:27 AM
www.freebookslides.com 40
CHAPTER 1 Data and Decisions
What Have We Learned? Learning Objectives
Understand that data are values, whether numerical or labels, together with their context.
• Who, what, why, where, when (and how)—the W’s—help nail down the context of the data. • We must know who, what, and why to be able to say anything useful based on the data. The who are the cases. The what are the variables. A variable gives information about each of the cases. The why helps us decide which way to treat the variables. • Stop and identify the W’s whenever you have data, and be sure you can identify the cases and the variables. Identify whether a variable is being used as categorical or quantitative.
• Categorical variables identify a category for each case. Usually we think about the counts of cases that fall in each category. (An exception is an identifier variable that just names each case.) • Quantitative variables record measurements or amounts of something; they must have units. • Sometimes we may treat the same variable as categorical or quantitative depending on what we want to learn from it, which means some variables can’t be pigeonholed as one type or the other. Consider the source of your data and the reasons the data were collected. That can help you understand what you might be able to learn from the data.
Terms Big Data Business analytics Case Categorical (or qualitative) variable Context Cross-sectional data Data Data mining Data table Data warehouse
The collection and analysis of data sets so large and complex that traditional methods typically brought to bear on the problem would be overwhelmed. The process of using statistical analysis and modeling to drive business decisions. A case is an individual about whom or which we have data. A variable that names categories (whether with words or numerals) is called categorical or qualitative. The context ideally tells who was measured, what was measured, how the data were collected, where the data were collected, and when and why the study was performed. Data taken from situations that vary over time but measured at a single time instant is said to be a cross-section of the time series. Recorded values whether numbers or labels, together with their context. The process of using a variety of statistical tools to analyze large data bases or data warehouses. An arrangement of data in which each row represents a case and each column represents a variable. A large data base of information collected by a company or other organization usually to record transactions that the organization makes, but also used for analysis via data mining.
Experimental unit
An individual in a study for which or for whom data values are recorded. Human experimental units are usually called subjects or participants.
Identifier variable
A categorical variable that records a unique value for each case, used to name or identify it.
Metadata Nominal variable Ordinal variable Participant Quantitative variable Record
M01_SHAR8696_03_SE_C01.indd 40
Auxiliary information about variables in a database, typically including how, when, and where (and possibly why) the data were collected; who each case represents; and the definitions of all the variables. The term “nominal” can be applied to a variable whose values are used only to name categories. The term “ordinal” can be applied to a variable whose categorical values possess some kind of order. A human experimental unit. Also called a subject. A variable in which the numbers are values of measured quantities with units. Information about an individual in a database.
14/07/14 7:27 AM
www.freebookslides.com
41
Technology Help Relational database
A relational database stores and retrieves information. Within the database, information is kept in data tables that can be “related” to each other.
Respondent
Someone who answers, or responds to, a survey.
Spreadsheet
A spreadsheet is layout designed for accounting that is often used to store and manage data tables. Excel is a common example of a spreadsheet program.
Subject Time series Transactional data Units Variable
A human experimental unit. Also called a participant. Data measured over time. Usually the time intervals are equally spaced or regularly spaced (e.g., every week, every quarter, or every year). Data collected to record the individual transactions of a company or organization. A quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams. A variable holds information about the same characteristic for many cases.
Technology Help: Data Most often we find statistics on a computer using a program, or package, designed for that purpose. There are many different statistics packages, but they all do essentially the same things. If you understand what the computer needs to know to do what you want and what it needs to show you in return, you can figure out the specific details of most packages pretty easily. For example, to get your data into a computer statistics package, you need to tell the computer: • Where to find the data. This usually means directing the computer to a file stored on your computer’s disk or to data on a database. Or it might just mean that you have copied the data from a spreadsheet program or Internet site and it is currently on your computer’s clipboard. Usually, the data should be in the form of a data table. Most computer statistics packages prefer the delimiter that marks the division between elements of a data table to be a tab character and the delimiter that marks the end of a case to be a return character.
• You can also copy tables of data from other sources, such as Internet sites, and paste them into an Excel spreadsheet. Excel can recognize the format of many tables copied this way, but this method may not work for some tables. • When opening a data file, Excel may not recognize the format of the data. If data include dates or other special formats ($, :, ¥, etc.), identify the desired format. Select the cells or columns to reformat and choose Format + Cell. Often, the General format is the best option.
• Where to put the data. (Usually this is handled automatically.) • What to call the variables. Some data tables have variable names as the first row of the data, and often statistics packages can take the variable names from the first row automatically.
Excel To open a file containing data in Excel: • Choose File + Open. • Browse to find the file to open. Data files provided with this text are tab-delimited text files (.txt) or comma-delimited text files (.csv). Excel supports many other file formats.
M01_SHAR8696_03_SE_C01.indd 41
14/07/14 7:27 AM
www.freebookslides.com 42
CHAPTER 1 Data and Decisions
Brief Case
Credit Card Bank Like all credit and charge card companies, this company makes money on each of its cardholders’ transactions. Thus, its profitability is directly linked to card usage. To increase customer spending on its cards, the company sends many different offers to its cardholders, and market researchers analyze the results to see which offers yield the largest increases in the average amount charged. On your disk (in the file Credit Card Bank) is part of a database like the one used by the researchers. For each customer, it contains several variables in a spreadsheet. Examine the data in the data file. List as many of the W’s as you can for these data and classify each variable as categorical or quantitative. If quantitative, identify the units.
Exercises Section 1.1 1. A real estate major collected information on some recent local home sales. The first 6 lines of the database appear below. The columns correspond to the house identification number, the community name, the ZIP code, the number of acres of the property, the year the house was built, the market value, and the size of the living area (in square feet).
a) What does a row correspond to in this data table? How would you best describe its role: as a participant, subject, case, respondent, or experimental unit? b) How many variables are measured on each row?
House_ID
Neighborhood
Mail_ZIP
Acres
Yr_Built
413400536 4128001474 412800344 4128001552 412800352 413400322
Greenfield Manor Fort Amherst Dublin Granite Springs Arcady Ormsbee
12859 12801 12309 10598 10562 12859
1.00 0.09 1.65 0.33 2.29 9.13
1967 1961 1993 1969 1955 1997
2. A local bookstore is keeping a database of its customers to find out more about their spending habits so that the store can start to make personal recommendations based on past purchases. Here are the first five rows of their database:
Full_Market_Value $1,00,400 $1,32,500 $1,40,000 $67,100 $1,90,000 $1,26,900
Size 960 906 1620 900 1224 1056
a) What does a row correspond to in this data table? How would you best describe its role: as a participant, subject, case, respondent, or experimental unit? b) How many variables are measured on each row?
Transaction ID
Customer ID
Date
ISBN Number of Purchase
Price
Coupon?
Gift?
Quantity
29784320912 26483589001 26483589002 36429489305 36429489306
4J438 3K729 3K729 3U034 3U034
11/12/2009 9/30/2009 9/30/2009 12/5/2009 12/5/2009
345-23-2355 983-83-2739 102-65-2332 295-39-5884 183-38-2957
$29.95 $16.99 $9.95 $35.00 $79.95
N N Y N N
N N N Y Y
1 1 1 1 1
Section 1.2 3. Referring to the real estate data table of Exercise 1, a) For each variable, would you describe it as primarily c ategorical, or quantitative? If quantitative, what
M01_SHAR8696_03_SE_C01.indd 42
are the units? If categorical, is it ordinal or simply nominal? b) Are these data a time series, or are these cross-sectional? Explain briefly.
14/07/14 7:27 AM
www.freebookslides.com
Exercises 43
4. Referring to the bookstore data table of Exercise 2, a) For each variable, would you describe it as primarily categorical, or quantitative? If quantitative, what are the units? If categorical, is it ordinal or simply nominal? b) Are these data a time series, or are these cross-sectional? Explain briefly.
Section 1.3 5. For the real estate data of Exercise 1, do the data appear to have come from a designed survey or experiment? What concerns might you have about drawing conclusions from this data set? 6. A student finds data on an Internet site that contains financial information about selected companies. He plans to analyze the data and use the results to develop a stock investment strategy. What kind of data source is he using? What concerns might you have about drawing conclusions from this data set?
Chapter Exercises For each description of data in Exercises 7 to 26, identify the W’s, name the variables, specify for each variable whether its use indicates it should be treated as categorical or quantitative, and for any quantitative variable identify the units in which it was measured (if they are not provided, give some possible units in which they might be measured). Specify whether the data come from a designed survey or experiment. Are the variables time series or cross-sectional? Report any concerns you have as well. 7. The news. Find a newspaper or magazine article in which some data are reported (e.g., see The Wall Street Journal, Financial Times, Business Week, or Fortune). For the data discussed in the article, answer the questions above. Include a copy of the article with your report. 8. The Internet. Find an Internet site on which some data are reported. For the data found on the site, answer as many of the questions above as you can. Include a copy of the URL with your report. 9. Survey. An automobile manufacturer wants to know what college students think about electric vehicles. They ask you to conduct a survey that asks students, “Do you think there will be more electric or gasoline powered vehicles on the road in 2025?” and “How likely are you to buy an electric vehicle in the next 10 years?” (scale of 1 = not at all likely to 5 = very likely). 10. Your survey. Think of a question that you’d like to know the answer to that might be answered by a survey. List all the questions included in the survey, identify all the variables, and answer all the questions above. 11. World databank. The World Bank provides economic data on most of the world’s countries at their website (databank.worldbank.org/data/home.aspx). Select 5 indicators that they provide and answer the questions above for these variables.
M01_SHAR8696_03_SE_C01.indd 43
12. Diets R Us Menu. A local food service company, Diets R Us, specializes in providing diet meals for the public. It gives, for each meal it provides to the public, the number of calories, the fat content in grams, and the amount of proteins in grams. The data is intended to inform about the nutritional value of the different meals. 13. MBA admissions. A school in the northeastern United States is concerned with the recent drop in female students in its MBA program. It decides to collect data from the admissions office on each applicant, including: sex of each applicant, age of each applicant, whether or not they were accepted, whether or not they attended, and the reason for not attending (if they did not attend). The school hopes to find commonalities among the female accepted students who have decided not to attend the business program. 14. MBA admissions II. An internationally recognized MBA program outside of Paris intends to also track the GPA of the MBA students and compares MBA performance to standardized test scores over a six-year period (2009–2014). 15. Pharmaceutical firm. Scientists at a major pharmaceutical firm conducted an experiment to study the effectiveness of an herbal compound to treat the common cold. They exposed volunteers to a cold virus, then gave them either the herbal compound or a sugar solution known to have no effect on colds. Several days later they assessed each patient’s condition using a cold severity scale ranging from 0–5. They found no evidence of the benefits of the compound. 16. World Values Survey. The World Values Survey (www. worldvaluessurvey.org) has designed a cultural map of the world, with nine cultural regions, such as Confucian and Orthodox, instead of five continents, to study changing values and their impact on social and political life. Countries are also assigned numerical scores on two important cultural dimensions: self-expression values, and traditional values. 17. Olive Oil Growers. A local farmers association, interested in providing better services to its olive oil growers, sent out a questionnaire to a randomly selected sample of growers requesting information about gross sales, percent profit, unit price, varieties, age, locality, and average production per tree. 18. OECD Better Life Initiative. The Better Life Initiative is an attempt to compare well-being across OECD countries. Central to this initiative is an interactive tool called Your Better Life Index. This tool lets you rank the OECD countries based on criteria such as education and work-life balance. Each of these criteria is described by one or more indicators. For example, education is an average of scores that official statistics assign to educational attainment, reading skills, and years in education (www.oecdbetterlifeindex.org/).
14/07/14 7:27 AM
www.freebookslides.com 44
CHAPTER 1 Data and Decisions
19. EPA. The Environmental Protection Agency (EPA) tracks fuel economy of automobiles. Among the data EPA analysts collect from the manufacturer are the manufacturer (Ford, Toyota, etc.), vehicle type (car, SUV, etc.), weight, horsepower, and gas mileage (mpg) for city and highway driving. 20. Consumer Reports. In 2013, Consumer Reports published an article comparing smart phones. It listed 46 phones, giving brand, price, display size, operating system (Android, iOS, or Windows Phone), camera image size (megapixels), and whether it had a memory card slot. 21. Zagat. Zagat.com provides ratings from customer experiences on restaurants. For each restaurant, the percentage of customers that liked it, the average cost and ratings of the food, decor, and service (all on a 30-point scale) are reported. 22. L.L. Bean. L.L. Bean is a large U.S. retailer that depends heavily on its catalog sales. It collects data internally and tracks the number of catalogs mailed out, the number of square inches in each catalog, and the sales ($ thousands) in the four weeks following each mailing. The company is interested in learning more about the relationship (if any) among the timing and space of their catalogs and their sales. 23. Stock market. An online survey of students in a large MBA Statistics class at a business school in the northeastern United States asked them to report their total personal investment in the stock market ($), total number of different stocks currently held, total invested in mutual funds ($), and the name of each mutual fund in which they have invested. The data were used in the aggregate for classroom illustrations. 24. Theme park sites. A study on the potential for developing theme parks in various locations throughout Europe in 2013 collects the following information: the country where the proposed site is located, estimated cost to acquire site, size of population within a one-hour drive of the site, size of the site, and availability of mass transportation within Date May 17, 1875 May 15, 1876 May 22, 1877 May 21, 1878 May 20, 1879 … May 2, 2008 May 2, 2009 May 1, 2010 May 7, 2011 May 5, 2012
M01_SHAR8696_03_SE_C01.indd 44
Winner
Margin (lengths)
Jockey
Aristides Vagrant Baden-Baden Day Star Lord Murphy
2 2 2 1 1
O. Lewis B. Swim W. Walker J. Carter C. Shauer
Big Brown Mine That Bird Super Saver Animal Kingdom I’ll Have Another
4 3/4 6 3/4 2 3/4 2 1/2 1 1/2
Kent Desormeaux Calvin Borel Calvin Borel John R. Velazquez Mario Gutierrez
five minutes of the site. The data will be used to present to prospective developers. 25. Indy 2009. The 2.5-mile Indianapolis Motor Speedway has been the home to a race on Memorial Day nearly every year since 1911. Even during the first race there were controversies. Ralph Mulford was given the checkered flag first but took three extra laps just to make sure he’d completed 500 miles. When he finished, another driver, Ray Harroun, was being presented with the winner’s trophy, and Mulford’s protests were ignored. Harroun averaged 74.6 mph for the 500 miles. Here are the data for the first few and six recent Indianapolis 500 races. Year
Winner
1911
Ray Harroun
1912 1913 … … 2007 2008 2009 2010 2011 2012
Car
Time (hrs)
Speed (mph)
Car #
6.7022
74.602
32
Joe Dawson Jules Goux
Marmon Model 32 National Peugeot
6.3517 6.5848
78.719 75.933
8 16
Dario Franchitti Scott Dixon Hélio Castroneves Dario Franchitti Dan Wheldon Dario Franchitti
Dallara/Honda Dallara Dallara Dallara/Honda Dallara/Honda Dallara/Honda
3.2943 3.4826 3.3262 3.0936 2.9366 2.9809
151.774 143.567 150.318 161.623 170.265 167.734
27 9 3 10 98 50
26. Kentucky Derby. The Kentucky Derby is a horse race that has been run every year since 1875 at Churchill Downs, Louisville, Kentucky. The race started as a 1.5-mile race, but in 1896 it was shortened to 1.25 miles because experts felt that three-year-old horses shouldn’t run such a long race that early in the season. (It has been run in May every year but one—1901—when it took place on April 29.) The table at the bottom of the page shows the data for the first few and a few r ecent races. http://www.kentuckyderby.ag/ kentuckyderby- results.php and http://horseracing.about. com/od/history/l/blderbywin.htm Winner’s Payoff ($)
Duration (min:sec)
Track Condition
2850 2950 3300 4050 3550
2:37.75 2:38.25 2:38.00 2:37.25 2:37.00
Fast Fast Fast Dusty Fast
2,000,000.00 2,000,000.00 2,000,000.00 2,000,000.00 2,000,000.00
2:01.82 2:02.66 2:04.45 2:02.04 2:01.83
Fast Fast Fast Fast Fast
14/07/14 7:27 AM
www.freebookslides.com
Exercises 45
When you organize data in a spreadsheet, it is important t olay it out as a data table. For each of these examples in Exercises 27 to 30, show how you would lay out these data. Indicate the headings of columns and what would be found in each row. 27. Mortgages. For a study of mortgage loan performance: amount of the loan, the name of the borrower.
33. OECD well-being. Comparison of OECD 2013 well- being indicators for 36 different countries. 34. Developments in well-being. OECD Better Life Initiative data for Spain for the second wave in 2013 compared to the first wave of data in 2011.
Just C hecking Ans wers
28. Employee performance. Data collected to determine performance-based bonuses: employee ID, average contract closed (in $), supervisor’s rating (1–10), years with the company.
1 Who—policies on churches and schools
29. Education in Better Life. The 2013 OECD data file of the Better Life Initiative contains the following data related to education: country name, the country’s topic score for education, educational attainment score, reading skills score, and score for years in education.
How—company records
30. Command performance. Data collected on investments in Broadway shows: number of investors, total invested, name of the show, profit/loss after one year. For the following examples in Exercises 31 to 34, indicate whether the data are time-series or cross-sectional.
What—policy number, years claim free, net property premium ($), net liability premium ($), total property value ($000), median age in ZIP code, school?, territory, coverage When—not given 2 Policy number: identifier (categorical)
Years claim free: quantitative Net property premium: quantitative ($) Net liability premium: quantitative ($) Total property value: quantitative ($) Median age in ZIP code: quantitative
31. Car sales. Number of cars sold by each salesperson in a dealership in September.
School?: categorical (true/false)
32. Motorcycle sales. Number of motorcycles sold by a dealership in each month of 2014.
Coverage: categorical
M01_SHAR8696_03_SE_C01.indd 45
Territory: categorical
14/07/14 7:27 AM
www.freebookslides.com
M01_SHAR8696_03_SE_C01.indd 46
14/07/14 7:27 AM
2
www.freebookslides.com
Displaying and Describing CategoricalData
Keen, Inc. KEEN, Inc. was started to create a sandal designed for a variety of water activities. The sandals quickly became popular due to their unique patented toe protection—a black bumper to protect the toes when adventuring out on rivers and trails. Today, the KEEN brand offers over 300 different outdoor performance and outdoor inspired casual footwear styles as well as bags and socks. Few companies experience the kind of growth that KEEN has in its first nine years. Amazingly, they’ve done this with relatively little advertising and by selling primarily to specialty footwear and outdoor stores, in addition to online outlets. After the 2004 Tsunami disaster, KEEN cut its advertising budget almost completely and donated over $1 million to help the victims and establish the KEEN Foundation to support environmental and social causes. Philanthropy and community projects c ontinue to play an integral part of the KEEN brand values. In fact, KEEN has established a giving program with a philanthropic effort devoted to helping the environment, conservation, and social movements involving the outdoors.
47
M02_SHAR8696_03_SE_C02.indd 47
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 2 Displaying and Describing CategoricalData
48
WHO WHAT WHEN WHERE HOW WHY
Visits to the KEEN, Inc. website Source (search engine or other) thatled to KEEN’s website February 2013 Worldwide Data compiled by KEEN To understand customer use of the website and how they got there
K
EEN, Inc., like most companies, collects data on visits to its website. Each visit to the site and each subsequent action the visitor takes (changing the page, entering data, etc.) is recorded in a file called a usage, or access weblog. These logs contain a lot of potentially worthwhile information, but they are not easy to use. Here’s one line from a log: 245.240.221.71 -- [1/Apr/2013:13:15:08-0800]” GET http:// www.keenfootwear.com/us/en/product/shoes/men/cnx/clear water%20cnx/forest%20night!rust “http://www.google.com/” “Mozilla/5.0WebTV/1.2 (compatible; MSIE 2.0)”
Unless the company has the analytic resources to deal with these files, it must rely on a third party to summarize the data. KEEN, like many other small and mid-sized companies, uses Google Analytics to collect and summarize its logdata. Imagine a whole table of data like the one above—with a line corresponding to every visit. In February 2013, there were 226,925 visits to the KEEN site, which would be a table with as many rows. The problem with a file like this—and in fact even with data tables—is that we can’t see what’s going on. And seeing is exactly what we want to do. We need ways to show the data so that we can see patterns, relationships, trends, and exceptions.
2.1
Source
Visits
Visits by %
130,158
57.36
Direct
52,969
23.34
16,084
7.09
Bing
9,581
4.22
Yahoo
7,439
3.28
2,253
0.99
Mobile
1,701
0.75
Other
6,740
2.97
Total
226,925
100.00
Table 2.1 A frequency table of the Source used by visitors to the KEEN, Inc. website. Notice the label “Other”. When the number of categories gets too large, we often lump together values of the variable into “Other”. When to do that is a judgment call, but it’s a good idea to have fewer than about a dozen categories. (Source: KEEN, Inc., personal communication.)
M02_SHAR8696_03_SE_C02.indd 48
Summarizing a Categorical Variable KEEN might be interested to know how people find their website. They might use the information to allocate their advertising revenue to various search engines, putting ads where they’ll be seen by the most potential customers. The variable Source records, for each visit to KEEN’s website, where the visit came from. The categories are all the search engines used, plus the label “Direct”, which indicates that the customer typed in KEEN’s web address (or URL) directly into the browser. To make sense of the 226,925 visits for which they have data, they’d like to summarize the variable and display the information in a way that can easily communicate the results to others.
Frequency Tables A frequency table records the counts for each of the categories of the variable. Some tables report percentages, and many report both. For example, Table 2.1 shows the ways that customers found their way to the KEEN website.
For Example
Making frequency and relative frequencytables
The Super Bowl, the championship game of the National Football League of the United States, is an important annual social event for Americans, with tens of millions of viewers. The ads that air during the game are expensive: a 30-second ad during the 2013 Super Bowl cost about $4M. The high price of these commercials makes them high-profile and much anticipated, and so the advertisers feel pressure to be innovative, entertaining, and often humorous. Some people, in fact, watch the Super Bowl mainly for the commercials. Polls often ask whether respondents are more interested
14/07/14 7:26 AM
www.freebookslides.com
49
Displaying a Categorical Variable
in the game or the commercials. Here are 40 responses from one such poll. (NA/ Don’t Know = No Answer or Don’t Know):
Won’t Watch Game Commercials Game Won’t Watch Game Won’t Watch NA/Don’t Know
Game Won’t Watch Commercials NA/Don’t Know Game Won’t Watch Commercials Won’t Watch
Commercials Commercials Game Commercials Game Won’t Watch Commercials Game
Won’t Watch Game Won’t Watch Game Won’t Watch Game Game Game
Game Game Commercials Game Game Won’t Watch Won’t Watch Game
Question Make a frequency table for this variable. Include the percentages to display both a frequency and relative frequency table at the same time.
100.01%? Sometimes if you carefully add the percentages of all categories, you will notice the total isn’t exactly 100.00% even though we know that that’s what the total has to be.The discrepancy is due to individual percentages being rounded. You’ll often see this in tables of percents, sometimes with explanatory footnotes.
2.2
Answer There were four different responses to the question about watching the
Super Bowl. Counting the number of participants who responded to each of these gives the following table:
Response Commercials Game Won’t Watch No Answer/Don’t Know Total
Counts
Percentage
8 18 12 2 40
20.0 45.0 30.0 5.0 100.0
Displaying a Categorical Variable The Three Rules of Data Analysis There are three things you should always do with data: 1. Make a picture. A display of your data will reveal things you are not likely to see in a table of numbers and will help you to plan your approach to the analysis and think clearly about the patterns and relationships that may be hiding in your data. 2. Make a picture. A well-designed display will do much of the work of analyzing your data. It can show the important features and patterns. A picture will also reveal things you did not expect to see: extraordinary (possibly wrong) data values or unexpected patterns. 3. Make a picture. The best way to report to others what you find in your data is with a well-chosen picture. These are the three rules of data analysis. These days, technology makes drawing pictures of data easy, so there is no reason not to follow the three rules. Some displays communicate information better than others. We’ll discuss some general principles for displaying information honestly in this chapter. Data visualization has become a special discipline in its own right. A well-designed display can show features of even a large, complex data set. Figure 2.1 on the next page is a specially designed visualization showing the connections between two categorical variables, College major and Career choice, for 15,600 alumni of Williams College. Innovative visualizations such as this one—many of them interactive or animated—are becoming more common as Big Data is mined for unanticipated patterns and relationships.
M02_SHAR8696_03_SE_C02.indd 49
14/07/14 7:26 AM
www.freebookslides.com 50
CHAPTER 2 Displaying and Describing CategoricalData
Copyright © 2012 CereusData LLC. All rights reserved.
Figure 2.1 Visualization of the link between major in college and career of Williams College alumni. Each individual is graphed as an arc connecting his or her major on the left with a career area on the right. Each major is assigned a color: Humanities in the blue range, Social Sciences in the reds and oranges, and Sciences in greens. It is easy to see the expected large arc connecting Biology and Health/Medicine and the spread of Math majors to many careers. Possibly less expected is that Economics majors choose a wide range of careers. Banking/Finance draws many from Economics, but also quite a few from History, Political Science, and the Humanities. (This image was created by Satyan Devadoss, Hayley Brooks, and Kaison Tanabe using the CIRCOS software; an interactive version of this graph can be found at http://cereusdata.com.)
The Area Principle We can’t make just any display; a bad picture can distort our understanding rather than help it. For example, Figure 2.2 is a graph of the frequencies of Table 2.1. What impression do you get of the relative frequencies of visits from each source? You can easily see from both the table and the figure that the most popular source was Figure 2.2 Although the length of each sandal corresponds to the correct number, the impression we get is all wrong because we perceive the entire area of the sandal. In fact, only about 57% of all visitors used Google to get to the website.
Direct E-mail Bing Yahoo Other Facebook Mobile 0
M02_SHAR8696_03_SE_C02.indd 50
25,000
50,000
75,000
100,000
125,000
150,000
14/07/14 7:26 AM
www.freebookslides.com
51
Displaying a Categorical Variable
Google. But the impression given by Figure 2.2 doesn’t seem to correspond well to the numbers in the table. Although it’s true that the majority of people came to KEEN’s website from Google, in Figure 2.2 it looks like nearly all did. That doesn’t seem right. What’s wrong? The lengths of the sandals do match the frequencies in the table. But our eyes tend to be more impressed by the area (or perhaps even the volume) than by other aspects of each sandal image, and it’s that aspect of the image that we notice. Since there were about two and a half as many people who came from Google as those who typed in the URL directly, the sandal depicting the number of Google visitors is about two and a half times longer than the sandal below it, but it occupies more than six times the area. As you can see from the frequency table, that just isn’t a correct impression. The best data displays observe a fundamental principle of graphing data called the area principle, which says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents.
Bar Charts Figure 2.3 gives us a chart that obeys the area principle. It’s not as visually entertaining as the sandals, but it does give a more accurate visual impression of the distribution. The height of each bar shows the count for its category. The bars are the same width, so their heights determine their areas, and the areas are proportional to the counts in each class. Now it’s easy to see that nearly half the site hits came from places other than Google. We can also see that there were about two and a half times as many visits that originated with a Google search as there were visits that came directly. Bar charts make these kinds of comparisons easy and natural. Figure 2.3 Visits to the KEEN, Inc. website by Source. With the area principle satisfied, the true distribution is clear.
140000 120000
Visits
100000 80000 60000 40000 20000 0
Direct
Bing
Yahoo Facebook Mobile
Other
Source
A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison. Bar charts should have small spaces between the bars to indicate that these are freestanding bars that could be rearranged into any order. The bars are lined up along a common base with labels for each category. The variable name is often used as a subtitle for the horizontal axis. 1000
Frequency
800 600 400 200
Bar charts are usually drawn vertically in columns,
Group I Group II Group III Group IV
Group I Group II Group III Group IV
but sometimes they are drawn with horizontal bars, like this.1
200
400 600 800 Frequency
1000
1
xcel refers to this display as a column chart when the bars are vertical and a bar chart when they are E horizontal, but that’s not standard statistics terminology.
M02_SHAR8696_03_SE_C02.indd 51
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 2 Displaying and Describing CategoricalData
52
Figure 2.4 The relative frequency bar chart looks the same as the bar chart (Figure 2.3) but shows the proportion of visits in each category rather than the counts.
60 50
Percentage
40 30 20 10 0
Direct
Bing
Yahoo Facebook Mobile
Other
Source
If we want to draw attention to the relative proportion of visits from each Source, we could replace the counts with percentages and use a relative frequency bar chart, like the one shown in Figure 2.4.
Pie Charts Other Mobile Facebook Yahoo Bing Direct
Figure 2.5 A pie chart shows the proportion of visits by Source.
A pie chart shows how a whole group breaks into several categories. Pie charts show all the cases as a circle sliced into pieces whose areas are proportional to the fraction of cases in each category. Because we’re used to cutting up pies into 2, 4, or 8 pieces, pie charts are good for seeing relative frequencies near 1/2, 1/4, or 1/8. For example, in Figure 2.5, you can easily see that the slice representing Google is just a bit more than half the total. Unfortunately, other comparisons are harder to make with pie charts. For example, Figure 2.6 shows three pie charts that look pretty much alike along with bar charts of the same data. The bar charts show three distinctly different patterns, but it is almost impossible to see those in the pie charts. If you want to make a pie chart or relative frequency bar chart, you’ll need to also make sure that the categories don’t overlap, so that no individual is counted in two categories. If the categories do overlap, it’s misleading to make a pie chart, since the percentages won’t add up to 100%.
Figure 2.6 Patterns that are easy to see in the bar charts are often hard to see in the corresponding pie charts.
A
B 5
1
5
C 5
1
2
1
4 4
2
4
2 3
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
M02_SHAR8696_03_SE_C02.indd 52
3
3
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
14/07/14 7:26 AM
www.freebookslides.com
53
Exploring Two Categorical Variables: Contingency Tables
For Example
Making a bar chart
Q u e s t i o n Make a bar chart for the 40 Super Bowl responses of the example onpage 49.
Answer Use the frequencies in the table in the example on page 49 to produce the heights of the bars:
20
15
10
5
0 Commercials
2.3 WHO WHAT WHEN WHERE HOW
WHY
Respondents in the Pew Research Worldwide Survey Responses to question about social networking 2012 Worldwide Data collected by Pew Research usinga multistage design. For detailssee www.pewglobal .org/2012/12/12/survey-methods-43/ To understand penetration of social networking worldwide
Game
Won’t Watch
NA/ Don’t Know
Exploring Two Categorical Variables: Contingency Tables In 2012 Pew Research conducted surveys in countries across the world (www.pewglobal.org/2012/12/12/social-networking-popular-across-globe/). One question of interest to business decision makers is how common it is for citizens of different countries to use social networking and whether they have it available to them. Table2.2 gives a table of responses for several of the surveyed countries. Note that N/A means “not available” because respondents lacked internet access—a situation that marketers planning for the future might expect to see change. The pie chart (Figure 2.7) shows clearly that fewer than half of respondents said that they had access to social networking and used it. But if we want to target our online customer relations with social networks differently in different countries, wouldn’t it be more interesting to know how social networking use varies from country to country? Use Social Networking
Social Networking
Count
Relative frequency
No
1249
24.787
Yes
2175
43.163
N/A
1615
32.050
Table 2.2 A combined frequency and relativefrequency table for the responses from 5 countries (Britain, Egypt, Germany, Russia, and the U.S.) to the question “Do you use social networking sites?” N/A means “Not Available”.
M02_SHAR8696_03_SE_C02.indd 53
No N/A
Yes
Figure 2.7 Responses to the question “Do you use social networking sites?” N/A means “No Internet Available.”
14/07/14 7:26 AM
www.freebookslides.com 54
CHAPTER 2 Displaying and Describing CategoricalData
Britain
Egypt
Germany
Russia
U.S.
Total
No
336
70
460
90
293
1249
Yes
529
300
340
500
506
2175
N/A
153
630
200
420
212
1615
Total
1018
1000
1000
1010
1011
5039
Table 2.3 Contingency table of Social Networking and Country. The right margin “Totals” are the values that were in Table 2.2.
Percent of What? The English language can be tricky. If asked, “What percent of those answering ‘Yes’ were from Russia?” it’s pretty clear that you should focus only on the Yes row. The question itself seems to restrict the who in the question to that row, so you should look at the number of those in each country among the 2175 people who replied “Yes.” You’d find that in the row percentages. But if you’re asked, “What percent were Russians who replied ‘yes’?” you’d have a different question. Be careful. That question really means “what percent of the entire sample were both from Russia and replying ‘Yes’?”, so the who is all respondents. The denominator should be 5039, and the answer is the table percent. Finally, if you’re asked, “What percent of the Russians replied ‘yes’?” you’d have a third question. Now the who is Russians. So the denominator is the 1010 Russians, and the answer is the column percent.
To find out, we need to look at the two categorical variables Social Networking and Country together, which we do by arranging the data in a two-way table such as Table 2.3. Because they show how individuals are distributed along each variable depending on, or contingent on, the value of the other variable, tables like this are called contingency tables. The margins of a contingency table give totals. The totals in the right-hand column of Table 2.3 show the frequency distribution of the variable Social Networking. We can see, for example, that Internet access is certainly not yet universal. The totals in the bottom row of the table show the frequency distribution of the variable Country—how many respondents Pew obtained in each country. When presented like this, at the margins of a contingency table, the frequency distribution of either one of the variables is called its marginal distribution. The marginal distribution for a variable in a contingency table is the same as its frequency distribution. Each cell of a contingency table (any intersection of a row and column of the table) gives the count for a combination of values of the two variables. For example, in Table 2.3 we can see that 153 respondents did not have internet access in Britain. Looking across the Yes row, you can see that the largest number of responses in that row (529) is from Britain. Are Egyptians less likely to use social media than Britons? Questions like this are more naturally addressed using percentages. We know that 300 Egyptians report that they use social networking. We could display this count as a percentage, but as a percentage of what? The total number of people in the survey? (300 is 5.95% of the total.) The number of Egyptians surveyed? (300 is 30% of the 1000 Egyptians surveyed.) The number of respondents who use social networking? (300 is 13.8% of social networking users.) Most statistics programs offer a choice of total percent, row percent, or column percent for contingency tables. Unfortunately, they often put them all together with several numbers in each cell of the table. The resulting table (Table 2.4) holds lots of information but is hard to understand.
Conditional Distributions The more interesting questions are contingent on something. We’d like to know, for example, whether these countries are similar in use and availability of social networking. That’s the kind of information that could inform a business decision. Table 2.5 shows the distribution of social networking conditional on country. By comparing the frequencies conditional on Country, we can see interesting patterns. For example, Germany stands out as the country in which the largest percentage (46%) have Internet access but don’t use social networking (“No”). R ussia and Egypt may have more respondents with no Internet access, but those who have
M02_SHAR8696_03_SE_C02.indd 54
14/07/14 7:26 AM
www.freebookslides.com
55
Exploring Two Categorical Variables: Contingency Tables
Britain
Egypt
Germany
Russia
U.S.
Total
No
336 26.9 33.0 6.7
70 5.6 7.0 1.4
460 36.8 46.0 9.1
90 7.2 8.9 1.8
293 23.5 29.0 5.8
1249 100 24.8 24.8
Yes
529 24.3 52.0 10.5
300 13.8 30.0 6.0
340 15.6 34.0 6.8
500 23.0 49.5 9.9
506 23.3 50.0 10.0
2175 100 43.2 43.2
N/A
153 9.5 15.0 3.0
630 39.0 63.0 12.5
200 12.4 20.0 4.0
420 26.0 41.6 8.3
212 13.1 21.0 4.2
1615 100 32.1 32.1
Total
1018 20.2 100 20.2
1000 19.8 100 19.8
1000 19.8 100 19.8
1010 20.0 100 20.0
1011 20.1 100 20.1
5039 100 100 100
Table contents: Count Percent of Row Total Percent of Column Total Percent of Table Total
Country
Table 2.4 Another contingency table of Social Networking and Country showing the counts and the percentages these counts represent. For each count, there are three choices for the percentage: by row, by column, and by table total. There’s probably too much information here for this table to be useful.
Britain
Egypt
No
335 33.0
70 7.0
460 46.0
Yes
529 52.0
300 30.0
N/A
153 15.0
630 63.0
Total
1018 100
1000 100
Germany
U.S.
Total
90 8.9
293 29.0
1249 24.8
340 34.0
500 49.5
506 50.0
2175 43.2
200 20.0
420 41.6
212 21.0
1615 32.1
1000 100
Russia
1010 100
1011 100
5039 100
Table 2.5 The conditional distribution of Social Networking conditioned on 2 values of Country. This table shows the column percentages.
access are very likely to use social networking. A distribution like this is called a conditional distribution because it shows the distribution of one variable for just those cases that satisfy a condition on another. In a contingency table, when the distribution of one variable is the same for all categories of another variable, we say that the two variables are independent. That tells us there’s no association between these variables. We’ll see a way to check for independence formally later in the book. For now, we’ll just compare the distributions.
M02_SHAR8696_03_SE_C02.indd 55
14/07/14 7:26 AM
www.freebookslides.com 56
CHAPTER 2 Displaying and Describing CategoricalData
For Example
Contingency tables and side-by-side barcharts
Here is a contingency table of the responses for 1008 adult U.S. respondents to the question about watching the Super Bowl discussed in the previous ForExample.
Sex Game Commercials Won’t Watch NA/Don’t Know Total
Female
Male
Total
198 154 160 4 516
277 79 132 4 492
475 233 292 8 1008
Question Does it seem that there is an association between what viewers are interested in watching and their sex?
Answer First, find the conditional distributions of the four responses for each sex: For Men:
For Women:
Game = 277>492 = 56.3%
Game = 198>516 = 38.4%
Commercials = 79>492 = 16.1%
Commercials = 154>516 = 29.8%
Won>t Watch = 132>492 = 26.8%
Won>t Watch = 160>516 = 31.0%
NA>Don>t Know = 4>492 = 0.8%
NA>Don>t Know = 4>516 = 0.8%
Now display the two distributions with side-by-side bar charts: Super Bowl Poll 60
Men Women
56.3%
50
Percent
40
38.4% 31.0%
29.8%
30
26.8%
20
16.1%
10
0.8% 0.8% Game
Commercials
Won’t Watch Response
NA/ Don’t Know
Based on this poll it appears that women were only slightly less interested than men in watching the Super Bowl telecast: 31% of the women said they didn’t plan to watch, compared to just under 27% of men. Among those who planned to watch, however, there appears to be an association between the viewer’s sex and what the viewer is most looking forward to. While more women are interested in the game (38%) than the commercials (30%), the margin among men is much wider: 56% of men said they were looking forward to seeing the game, compared to only 16% who cited the commercials.
M02_SHAR8696_03_SE_C02.indd 56
14/07/14 7:26 AM
www.freebookslides.com
57
Segmented Bar Charts and Mosaic Plots
Just C hecking So that they can balance their inventory, an optometry shop collects the following data for customers in the shop.
Sex
Eye Condition Nearsighted Farsighted Need Bifocals Total Males Females Total
6 4 10
20 16 36
32 32 64
6 12 18
1 What percent of females are farsighted? 2 What percent of nearsighted customers are female? 3 What percent of all customers are farsighted females? 4 What’s the distribution of Eye Condition? 5 What’s the conditional distribution of Eye Condition formales? 6 Compare the percent who are female among nearsightedcustomers to the percent
of all customers who are female.
7 Does it seem that Eye Condition and Sex might be dependent? Explain.
2.4
Segmented Bar Charts and Mosaic Plots Everyone knows what happened in the North Atlantic on the night of April 14, 1912 as the Titanic, thought by many to be unsinkable, sank, leaving almost 1500 passengers and crew members on board to meet their icy fate. Women and children first was the rule for those commanding the lifeboats, but how did the class of ticket held enter into the order? Here is a contingency table of the 2201 people on board, categorized by Survival and Ticket Class.
Survival
Class
Alive Dead Total
First
Second
Third
Crew
Total
Count % of Column Count % of Column
203 62.5% 122 37.5%
118 41.4% 167 58.6%
178 25.2% 528 74.8%
212 24.0% 673 76.0%
711 32.3% 1490 67.7%
Count
325 100%
285 100%
706 100%
885 100%
2201 100%
Table 2.6 A contingency table of Class by Survival with only counts and column percentages. Each column represents the conditional distribution of Survival for a given category of ticket Class.
Looking at how the percentages change across each row, it sure looks like ticket class mattered in whether a passenger survived. To make it more vivid, we could display the percentages for surviving and not for each Class in a side-by-side bar chart such as the one on the next page. Now it’s easy to compare the risks. Among first-class passengers, 37.5% perished, compared to 58.6% for second-class ticket holders, 74.8% for those in third class, and 76.0% for crew members. We could also display the Titanic information
M02_SHAR8696_03_SE_C02.indd 57
14/07/14 7:26 AM
www.freebookslides.com 58
CHAPTER 2 Displaying and Describing CategoricalData
Figure 2.8 Side-by-side bar chart showing the conditional distribution of Survival for each category of ticket Class.
80 70
Percent
60 50
Survival
40
Alive Dead
30 20 10 0
First
Second
Third
Crew
Ticket Class
by dividing up bars rather than circles (as we did for pie charts). The resulting segmented (or stacked) bar chart treats each bar as the “whole” and divides it proportionally into segments corresponding to the percentage in each group. We can clearly see that the distributions of ticket Class are different, indicating again that survival was not independent of ticket Class.
Figure 2.9 A segmented bar chart for Class by Survival. Notice that although the totals for survivors and nonsurvivors are quite different, the bars are the same height because we have converted the numbers to percentages. Compare this display with the bar chart in Figure 2.8.
100
Class First Second Third Crew
90 80 70
Percent
60 50 40 30 20 10 0
Alive
Dead
A variant of the segmented bar chart, a mosaic plot, looks like a segmented bar chart, but obeys the area principle better by making the bars proportional to the sizes of the groups. Now, each rectangle is proportional to the number of cases in the data set. Mosaic plots are increasingly popular for displaying contingency tables and are found in many software packages.
M02_SHAR8696_03_SE_C02.indd 58
14/07/14 7:26 AM
www.freebookslides.com
59
Segmented Bar Charts and Mosaic Plots
Figure 2.10 A mosaic plot for Class by Survival. The plot is just like the segmented bar chart in Figure 2.9 except that the space has been taken out between the categories on the x-axis and the rectangles are proportional to the number of cases of the x variable as well. We can easily see that the number of survivors was far less than the nonsurvivors, something that we can’t in the bar charts.
1.00 1 2
Percent
0.75
3 0.50
0.25
Crew
0.00
Guided Example
Alive
Dead
Food Safety Food storage and food safety are major issues for multinational food companies. A client wants to know if people of all age groups have the same degree of concern so GfK Roper Consulting asked 1500 people in five countries whether they agree with the following statement: “I worry about how safe the food I buy is.” We would want to report to the client how concerns about food safety are related to age.
Plan
Setup • State the objectives and goals of the study. • Identify and define the variables. • Provide the time frame of the data collection process. Determine the appropriate analysis for data type.
The client wants to examine the distribution of responses to the food safety question and see whether they are related to the age of the respondent. GfK Roper Consulting collected data on this question in the fall of 2005 for their 2006 Worldwide report. We will use the data from that study. The variable is Food Safety. The responses are in nonoverlapping categories ofagreement, from Agree Completely to Disagree Completely (and Don’t Know). There were originally 12 age groups, which we can combine into five: Teen Young Adult Adult Middle Aged Mature
13–19 20–29 30–39 40–49 50 and older
Both variables, Food Safety and Age, are ordered categorical variables. To examine any differences in responses across age groups, it is appropriate to create a contingency table and a side-by-side bar chart. Here is a contingency table of “Food Safety” by “Age”. (continued )
M02_SHAR8696_03_SE_C02.indd 59
14/07/14 7:27 AM
www.freebookslides.com 60
CHAPTER 2 Displaying and Describing CategoricalData
Mechanics For a large data set likethis, we rely on technology to make tables and displays.
Age
Do
Teen Young Adult Adult Middle Aged Mature
Agree Completely
Agree Somewhat
16.19 20.55 22.23 24.79 26.60
27.50 32.68 34.89 35.31 33.85
A side-by-side bar chart is particularly helpful when comparing multiple groups.
Food Safety Neither Disagree Disagree Nor Agree Somewhat 24.32 23.81 23.28 22.02 21.21
Disagree Completely
Don’t Know
Total
10.58 6.98 6.75 5.06 5.82
2.12 1.04 0.59 0.39 0.63
100% 100% 100% 100% 100%
19.30 14.94 12.26 12.43 11.89
A side-by-side bar chart shows the percent of each response to the question by age group.
40
Disagree Somewhat Disagree Completely Don’t Know
Agree Completely Agree Somewhat Neither Disagree Nor Agree
35
Percent Response
30 25 20 15 10 5 0
Teen
Young Adult
Adult
Middle Aged
Mature
Age Group
Report
Summary and Conclusions Summarize the charts and analysis in context. Make recommendations if possible anddiscuss further analysis that is needed.
M02_SHAR8696_03_SE_C02.indd 60
Memo Re: Food safety concerns by age Our analysis of the GfK Roper Reports™ Worldwide survey data shows a weak pattern of concern about food safety that generally increases from youngest to oldest. Our analysis thus far has not considered whether this trend is consistent across countries. If it were of interest to your group, we could perform a similar analysis for each of the countries. The enclosed tables and plots provide support for these conclusions.
14/07/14 7:27 AM
www.freebookslides.com
Simpson’s Paradox
2.5
61
Simpson’s Paradox Here’s an example showing that combining percentages across very different values or groups can give confusing results. Suppose there are two sales representatives, Peter and Katrina. Peter argues that he’s the better salesperson, since he managed to close 83% of his last 120 prospects compared with Katrina’s 78%. But let’s look at the data a little more closely. Here (Table 2.7) are the results for each of their last 120 sales calls, broken down by the product they were selling.
Founded Employees Stock price
1983 8536 12.625
Average
3510.54
Sales Rep
Product Printer Paper
USB Flash Drive
Overall
Peter
90 out of 100 90%
10 out of 20 50%
100 out of 120 83%
Katrina
19 out of 20 95%
75 out of 100 75%
94 out of 120 78%
Table 2.7 Look at the percentages within each Product category. Who has a better success rate closing sales of paper? Who has the better success rate closing sales of Flash Drives? Who has the better performance overall?
Look at the sales of the two products separately. For printer paper sales, K atrina had a 95% success rate, and Peter only had a 90% rate. When selling flash drives, Katrina closed her sales 75% of the time, but Peter only 50%. So Peter has better “overall” performance, but Katrina is better selling each product. How can this be? This problem is known as Simpson’s Paradox, named for the statistician who described it in the 1960s. Although it is rare, there have been a few well-publicized cases of it. As we can see from the example, the problem results from inappropriately combining percentages of different groups. Katrina concentrates on selling flash drives, which is more difficult, so her overall percentage is heavily influenced by her flash drive average. Peter sells more printer paper, which appears to be easier to sell. With their different patterns of selling, taking an overall percentage is misleading. Their manager should be careful not to conclude rashly that Peter is the better salesperson. The lesson of Simpson’s Paradox is to be sure to combine only comparable measurements for comparable individuals. Be especially careful when combining across different levels of a second variable. It’s usually better to compare percentages within each level, rather than across levels.
Discrimination? One famous example of Simpson’s Paradox arose during an investigation of admission rates for men and women at the University of California at Berkeley’s graduate schools. As reported in an article in Science, about 45% of male applicants were admitted, but only about 30% of female applicants got in. It looked like a clear case of discrimination. However, when the data were broken down by school (Engineering, Law, Medicine, etc.), it turned out that within each school, the women were admitted at nearly the same or, in some cases, much higher rates than the men. How could this be? Women applied in large numbers to schools with very low admission rates. (Law and Medicine, for example, admitted fewer than 10%.) Men tended to apply to Engineering and Science. Those schools have admission rates above 50%. When the total applicant pool was combined and the percentages were computed, the women had a much lower overall rate, but the combined percentage didn’t really make sense.
M02_SHAR8696_03_SE_C02.indd 61
14/07/14 7:27 AM
www.freebookslides.com 62
CHAPTER 2 Displaying and Describing CategoricalData
What Can Go Wrong? • Don’t violate the area principle. This is probably the most common mistake in a graphical display. Violations of the area principle are often made for the sake of artistic presentation. Consider this pie chart of ways that respondents said they commute to work.
Would it surprise you to learn that the fraction who “Shared” rides to work is 33%, while the fraction who “Drive Alone” is 41%? This pie chart was made in Excel, but overuse of features that make it look interesting has hurt its ability to convey accurate information. • Keep it honest. Here’s a pie chart that displays data on the percentage of
high school students who engage in specified dangerous behaviors as reported by the Centers for Disease Control. What’s wrong with this plot? Use Marijuana
26.7% 50.0%
Use Alcohol
31.5% Heavy Drinking
Try adding up the percentages. Or look at the 50% slice. Does it look right? Then think: What are these percentages of? Is there a “whole” that has been sliced up? In a pie chart, the proportions shown by each slice of the pie must add up to 100%, and each individual must fall into only one category. Of course, showing the pie on a slant makes it even harder to detect the error. Here’s another one. This chart shows the average number of texts in various time periods by American cell phone customers in the period 1999 to 2013. 400 350 300 250 200 150 100 50 0
M02_SHAR8696_03_SE_C02.indd 62
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 March March March March March 2009 2010 2011 2012 2013
14/07/14 7:27 AM
www.freebookslides.com
63
What Can Go Wrong?
It may look as though text messaging decreased suddenly some time around 2009, which probably doesn’t seem right to you. In fact, this chart has several problems. First, it’s not a bar chart. Bar charts display counts of categories. This bar chart is a plot of a quantitative variable (average number of texts) against time—although to make it worse, some of the time periods are missing. Of course, the real problem is that starting in 2009, they reported the data for only one month instead of for the entire year. • Don’t confuse percentages. Many percentages based on conditional and
joint distributions sound similar, but are different (see Table 2.4):
• The percentage of Russians who answered “Yes”: This is 500/1010 or 49.5%. • The percentage of those who answered “Yes” who were Russian: This is 500/2175 or 23%. • The percentage of those who were Russian and answered “Yes”: This is 500/5039 or 9.92%. In each instance, pay attention to the wording that makes a restriction to a smaller group (those who are French, those who answered “Don’t Know,” and all respondents, respectively) before a percentage is found. This restricts the who of the problem and the associated denominator for the percentage. Your discussion of results must make these differences clear. • Don’t forget to look at the variables separately, too. When you make
a contingency table or display a conditional distribution, be sure to also examine the marginal distributions. It’s important to know how many cases are in each category.
• Be sure to use enough individuals. When you consider percentages, take
care that they are based on a large enough number of individuals (or cases). Take care not to make a report such as this one: We found that 66.67% of the companies surveyed improved their performance by hiring outside consultants. The other company went bankrupt.
• Don’t overstate your case. Independence is an important concept, but it is
rare for two variables to be entirely independent. We can’t conclude that one variable has no effect whatsoever on another. Usually, all we know is that little effect was observed in our study. Other studies of other groups under other circumstances could find different results.
M02_SHAR8696_03_SE_C02.indd 63
14/07/14 7:27 AM
www.freebookslides.com 64
CHAPTER 2 Displaying and Describing CategoricalData
Ethics in Action
M
ount Ashland Promotions Inc. is organizing one of its most popular events, the ZenNaturals Annual Trade Fest. At this trade show, producers, manufacturers, and distributors in the natural foods market display the latest trends in organic foods, herbal supplements, and natural body care products. The Trade Fest attracts a wide variety of participants, from large distributors who display a wide range of products to small, independent companies. As in previous years, Nina Li and her team at Mount Ashland are in charge of managing the event, which includes all advertising and publicity as well as arranging spots for exhibitors. The success of this event depends on Nina’s ability to attract large numbers of small independent retailers in the natural foods market who are looking to expand their product lines. She knows that these small retailers tend to be zealously committed to the principles of healthful lifestyle. Moreover, many are members of the Organic Trade Federation (OTF), an organization that advocates ethical consumerism. The OTF has been known to boycott trade shows that include too many products with controversial ingredients such as ginkgo biloba, hemp, or kava kava. Nina is aware that some herbal diet teas have been receiving lots of negative attention lately in trade publications and the popular press. These teas claim to be “thermogenic” or fat burning, and typically contain ma huang (or ephedra). Ephedra is particularly controversial, not only because it
can be unsafe for people with certain existing health conditions, but because this fast-acting stimulant commonly found in diet and energy products is contrary to the OTF’s principles and values. Worried that too many products at the ZenNaturals Trade Fest may be thermogenic teas, Nina decides to take a closer look at vendors already committed to participate in the event. Based on the data that her team pulled together, she finds that more than 33% of them do indeed include teas in their product lines. She was quite surprised to find that this percentage is so high. She decides to categorize the vendors into four groups: (1) those selling herbal supplements only; (2) those selling organic foods and herbal supplements; (3) those selling organic foods, herbal supplements, and natural body care products; and (4) all others. She finds that only 2% of groups 1, 2, and 4 include tea in their product lines, while 34% of the third group do. Even though group 3 contains most of the vendors, Nina instructs her team to use the average percentage 10% in its communications, especially with the OTF, about the upcoming ZenNaturals Annual Trade Fest. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • Propose an ethical solution that considers the welfare of all stakeholders.
What Have We Learned? Learning Objectives
Make and interpret a frequency table for a categorical variable.
• We can summarize categorical data by counting the number of cases in each category, sometimes expressing the resulting distribution as percentages. Make and interpret a bar chart or pie chart.
• We display categorical data using the area principle in either a bar chart or a pie chart. Make and interpret a contingency table.
• When we want to see how two categorical variables are related, we put the counts (and/or percentages) in a two-way table called a contingency table. Make and interpret bar charts and pie charts of marginal distributions.
• We look at the marginal distribution of each variable (found in the margins of the table). We also look at the conditional distribution of a variable within each category of the other variable. • Comparing conditional distributions of one variable across categories of another tells us about the association between variables. If the conditional distributions of one variable are (roughly) the same for every category of the other, the variables are independent.
Terms Area principle Bar chart (relative frequency bar chart)
M02_SHAR8696_03_SE_C02.indd 64
In a statistical display, each data value is represented by the same amount of area. A chart that represents the count (or percentage) of each category in a categorical variable as a bar, allowing easy visual comparisons across categories.
14/07/14 7:27 AM
www.freebookslides.com
65
Technology Help Cell Column percent
Each location in a contingency table, representing the values of two categorical variables, is called a cell. The proportion of each column contained in the cell of a frequency table.
Conditional distribution
The distribution of a variable restricting the who to consider only a smaller group of individuals.
Contingency table
A table displaying the frequencies (sometimes percentages) for each combination of two or more variables.
Distribution
The distribution of a variable is a list of: • all the possible values of the variable • the relative frequency of each value
Frequency table (relative frequency table) Independent variables Marginal distribution Mosaic plot
Pie chart Row percent Segmented bar chart Simpson’s paradox Total percent
A table that lists the categories in a categorical variable and gives the number (the percentage) of observations for each category. Variables for which the conditional distribution of one variable is the same for each category of the other. In a contingency table, the distribution of either variable alone. The counts or percentages are the totals found in the margins (usually the right-most column or bottom row) of the table. A mosaic plot is a graphical representation of a (usually two-way) contingency table. The plot is divided into rectangles so that the area of each rectangle is proportional to the number of cases in the corresponding cell. Pie charts show how a “whole” divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category. The proportion of each row contained in the cell of a frequency table. A segmented bar chart displays the conditional distribution of a categorical variable within each category of another variable. A phenomenon that arises when averages, or percentages, are taken across different groups, and these group averages appear to contradict the overall averages. The proportion of the total contained in the cell of a frequency table.
Technology Help: Displaying Categorical Data Although every package makes a slightly different bar chart, they all have similar features: May have a box around it or not 1000
You may be able to add color later on in some programs
800
Counts or relative frequencies on this axis
600 400 200 0
First Second Third Crew
Bar order may be arbitrary, alphabetical, or by first occurrence of the category
M02_SHAR8696_03_SE_C02.indd 65
Bar charts should have spaces between the bars
Sometimes the count or a percentage is printed above or on top of each bar to give some additional information. You may find that your statistics package sorts category names in annoying orders by default. For example, many packages sort categories alphabetically or by the order the categories are seen in the data set. Often, neither of these is the best choice.
Excel Excel offers a versatile and powerful tool it calls a PivotTable. A pivot table can summarize, organize, and present data from an Excel spreadsheet. Pivot tables can be used to create frequency distributions and contingency tables. They provide a starting point for several kinds of displays. Pivot tables are linked to data in your Excel spreadsheet so they will update when you make changes to your data. They can also be linked directly to a PivotChart to display the data graphically.
14/07/14 7:27 AM
www.freebookslides.com 66
CHAPTER 2 Displaying and Describing CategoricalData
In a pivot table, all types of data are summarized into a row-by-column table format. Pivot table cells can hold counts, percentages, and descriptive statistics. To create a pivot table: • Open a data file in Excel. At least one of the variables in the dataset should be categorical. • Choose Insert + PivotTable or Data + PivotTable (Mac). If you are using a PC, choose to put the pivot table in a new worksheet. Macintosh users should choose the option to create a custom pivot table. • The PivotTable builder has five boxes: • Field List (top): variables from the data set linked to the PivotTable. (The PivotTable tool calls the variables “fields.”) Fields can be selected using the checkbox or dragged and dropped into one of the areas below in the PivotTable builder. • Report Filter (middle left): Variables placed here filter the data in the pivot table. When selected, the filter variable name appears above the pivot table. Use the drop-down list to the right of the variable name to choose values to display. • Row Labels (bottom left): Values of variables placed here become row labels in the pivot table. • Column Labels (middle right): Values of variables placed here become column labels in the pivot table. • Values (bottom right): Variables placed here are summarized in the cells of the table. Change settings to display count, sum, minimum, maximum, average, and more or to display percentages and ranks. To create a frequency distribution pivot table: • Drag a categorical variable from the Field List into Row Labels. • Choose another variable from the data set and drag it into Values. Use a unique identifier variable (e.g., subject number) if possible. • To change what fact or statistics about the Values variable is displayed, click the arrow next to the variable in the Values box and open the Value Field Settings. For a frequency distribution, select count of [VARIABLE]. When changing Value Field Settings, note the tab Show Values As, which provides other display options (e.g., % of row, % of column). The result will be a frequency table with a column for count. To create a contingency table using PivotTable: • Drag a categorical variable from the Field List into Row Labels. • Drag a second categorical variable from the Field List into Column Labels. • Choose another variable from the dataset and drag it into Values. The resulting pivot table is a row-by-column contingency table.
M02_SHAR8696_03_SE_C02.indd 66
NOTE: As with the frequency distribution, you can use the Value Field Settings to change the type of summary. To create a chart from a pivot table frequency distribution or contingency table: • Place the cursor anywhere on the pivot table. • Click PivotTable Tools + PivotChart. • Choose the type of chart: options include pie chart, bar chart, and segmented bar graph. • Move the chart to a new worksheet by right-clicking the chart and selecting Move chart. • In a bar chart created from a contingency table, by default, rows display on the x-axis and the columns are separate bars. To change this, place cursor in chart and choose PivotChart Tools + Design + Switch Row/Column. • On Macs, choose the Charts tab and select your chart from the ribbon or choose a chart type from the Chart menu.
XLStat To create a contingency table from unsummarized data: • On the XLStat tab, choose Preparing data. • From the menu, choose Create a contingency table. • In the dialog box, enter your data range on the General tab. Your data should be in two columns, one of which is the row variable and the other is the column variable. • On the Outputs tab, check the box next to Contingency table and optionally choose Percentages/Row or Column to see the conditional distributions.
JMP JMP makes a bar chart and frequency table together. • From the Analyze menu, choose Distribution. • In the Distribution dialog, drag the name of the variable into the empty variable window beside the label Y, Columns; click OK. To make a pie chart, • Choose Chart + Graph menu. • In the Chart dialog, select the variable name from the Columns list, click on the button labeled Statistics, and select N from the dropdown menu. • Click the “Categories, X, Levels” button to assign the same variable name to the x-axis.
14/07/14 7:27 AM
www.freebookslides.com
Exercises 67
• Under Options, click on the second button—labeled “Bar Chart”— and select Pie Chart from the drop-down menu.
SPSS
Minitab
• Open the Chart Builder from the Graphs menu.
To make a bar chart, • Click the Gallery tab.
To make a bar chart, • Choose Bar Chart from the Graph menu. • Then select a Simple, Cluster, or Stack chart from the options and click OK. • To make a Simple bar chart, enter the name of the variable to graph in the dialog box. • To make a relative frequency chart, click Chart Options, and choose Show Y as Percent.
• Choose Bar Chart from the list of chart types. • Drag the appropriate bar chart onto the canvas. • Drag a categorical variable onto the x-axis drop zone. • Click OK. Comments A similar path makes a pie chart by choosing Pie chart from the list of chart types.
• In the Chart dialog, enter the name of the variable that you wish to display in the box labeled “Categorical variables.” • Click OK.
Brief Case
Credit Card Bank In Chapter 1, you identified the W’s for the data in the file Credit Card Bank. For the categorical variables in the data set, create frequency tables, bar charts, and pie charts using your software. What might the bank want to know about these variables? Which of the tables and charts do you find most useful for communicating information about the bank’s customers? Write a brief case report summarizing your analysis and results.
Exercises Section 2.1 1. As part of the human resource group of your company you are asked to summarize the educational levels of the 512 employees in your division. From company records, you find that 164 have no college degree (None), 42 have an associate’s degree (AA), 225 have a bachelor’s degree (BA), 52 have a master’s degree (MA), and 29 have PhDs. For the educational level of your division: a) Make a frequency table. b) Make a relative frequency table.
M02_SHAR8696_03_SE_C02.indd 67
2. As part of the marketing group at Pixar, you are asked to find out the age distribution of the audience of Pixar’s latest film. With the help of 10 of your colleagues, you conduct exit interviews by randomly selecting people to question at 20 different movie theaters. You ask them to tell you if they are younger than 6 years old, 6 to 9 years old, 10 to 14 years old, 15 to 21 years old, or older than 21. From 470 responses, you find out that 45 are younger than 6, 83 are 6 to 9 years old, 154 are 10 to 14, 18 are 15 to 21, and 170 are older than 21. For the age distribution: a) Make a frequency table. b) Make a relative frequency table.
14/07/14 7:27 AM
www.freebookslides.com 68
CHAPTER 2 Displaying and Describing CategoricalData
Section 2.2
Section 2.4
3. From the educational level data described in Exercise 1: a) Make a bar chart using counts on the y-axis. b) Make a relative frequency bar chart using percentages on the y-axis. c) Make a pie chart.
9. For the table in Exercise 7: a) Find the column percentages. b) Looking at the column percentages in part a, does the tenure distribution (how long the employee has been with the company) for each educational level look the same? Comment briefly. c) Make a sebmented or stacked bar chart showing the tenure distribution for each educational level. d) Is it easier to see the differences in the distributions using the column percentages or the segmented bar chart? e) How would a mosaic plot help to accurately display these data?
4. From the age distribution data described in Exercise 2: a) Make a bar chart using counts on the y-axis. b) Make a relative frequency bar chart using percentages on the y-axis. c) Make a pie chart. 5. For the educational levels described in Exercise 1: a) Write two to four sentences summarizing the distribution. b) What conclusions, if any, could you make about the educational level at other companies? 6. For the ages described in Exercise 2: a) Write two to four sentences summarizing the distribution. b) What possible problems do you see in concluding that the age distribution from these surveys accurately represents the ages of the national audience for this film?
Section 2.3 7. From Exercise 1, we also have data on how long each person has been with the company (tenure) categorized into three levels: less than 1 year, between 1 and 5 years, and more than 5 years. A table of the two variables together looks like: None
AA
BA
MA
PhD
10 42 112
3 9 30
50 112 63
20 27 5
12 15 2
*1 Year 1–5 Years More Than 5 Years
a) Find the marginal distribution of the tenure. (Hint: Find the row totals.) b) Verify that the marginal distribution of the education level is the same as that given in Exercise 1. 8. In addition to their age levels, the movie audiences in Exercise 2 were also asked if they had seen the movie before (Never, Once, More than Once). Here is a table showing the responses by age group: Never Once More Than Once
Under 6
6–9
10–14
15–21
Over 21
39 3
60 20
84 38
16 2
151 15
3
3
32
4
a) Find the marginal distribution of their previous viewing of the movie. (Hint: Find the row totals.) b) Verify that the marginal distribution of the ages is the same as that given in Exercise 2.
M02_SHAR8696_03_SE_C02.indd 68
10. For the table in Exercise 8: a) Find the column percentages. b) Looking at the column percentages in part a, does the distribution of how many times someone has seen the movie look the same for each age group? Comment briefly. c) Make a segmented bar chart, showing the distribution of viewings for each age level. d) Is it easier to see the differences in the distributions using the column percentages or the segmented bar chart? e) How would a mosaic plot represent these data more appropriately?
Chapter Exercises 11. Graphs in the news. Find a bar graph of categorical data from a business publication (e.g., Bloomberg Businessweek, Fortune, The Wall Street Journal, etc.). a) Is the graph clearly labeled? b) Does it violate the area principle? c) Does the accompanying article tell the W’s of the variable? d) Do you think the article correctly interprets the data? Explain. 12. Graphs in the news, part 2. Find a pie chart of categorical data from a business publication (e.g., Bloomberg Businessweek, Fortune, The Wall Street Journal, etc.). a) Is the graph clearly labeled? b) Does it violate the area principle? c) Does the accompanying article tell the W’s of the variable? d) Do you think the article correctly interprets the data? Explain. 13. Tables in the news. Find a frequency table of categorical data from a business publication (e.g., Bloomberg Businessweek, Fortune, The Wall Street Journal, etc.). a) Is it clearly labeled? b) Does it display percentages or counts? c) Does the accompanying article tell the W’s of the variable? d) Do you think the article correctly interprets the data? Explain.
14/07/14 7:27 AM
Exercises
15. Bottled water market share. A local survey company conducted a survey on the consumption of bottled water in its region of operations. The results of the survey were summarized in the following pie chart: Brand 5 Brand 4
50% 40% 30% 20% 10% 0%
ola
-C
a oc
C
Brand 6
o
Brand 1
le
pp
siC
p Pe
r
pe
ep rP
a Sn
D
Brand 3
69
17. Market share again. Here’s a bar chart of the data in Exercise 15.
Market Share
14. Tables in the news, part 2. Find a contingency table of categorical data from a business publication (e.g., Bloomberg Businessweek, Fortune, The Wall Street Journal, etc.). a) Is it clearly labeled? b) Does it display percentages or counts? c) Does the accompanying article tell the W’s of the variable? d) Do you think the article correctly interprets the data? Explain.
www.freebookslides.com
ge
tt
Co
al
on
ti Na
ra ve Be
a) Compared to the pie chart in Exercise 15, which is better for displaying the relative portions of market share? Explain. b) What is missing from this display that might make it somewhat misleading? 18. World market share again. Here’s a pie chart of the data in Exercise 16. Other
Brand 2
a) Which brand of bottled water has the highest consumption? b) Is this an appropriate method to display this data? 16. World market share. The Wall Street Journal article described in Exercise 15 also indicated the market share of the leading brands of carbonated beverages. The following bar chart displays the values:
Dr Pepper Coke
Mountain Dew
60% Diet Coke
50%
a) Which display of these data is best for comparing the market shares of these brands? Explain. b) Does Mountain Dew or Dr Pepper have a bigger market share? Is that comparison easier to make with the pie chart or the bar chart of Exercise 16?
Market Share
40% 30% 20% 10%
r Pe pp e
.D
ew
Dr
la
Mt
Co si-
Co ke
Pe p
ke
Di et
Co
Ot he
r
0%
a) Is this an appropriate display for these data? Explain. b) Which brand had the largest share of the beverage market? c) Which brand had the larger market share—Mountain Dew or Dr Pepper?
M02_SHAR8696_03_SE_C02.indd 69
Pepsi-Cola
19. Insurance company. An insurance company is updating its payouts and cost structure for their insurance policies. Of particular interest to them is the risk analysis for customers currently on heart or blood pressure medication. The Centers for Disease Control and Prevention (www.cdc.gov) lists causes of death in the United States during one year as follows. Cause of Death Heart disease Cancer Circulatory diseases and stroke Respiratory diseases Accidents
Percent 30.3 23.0 8.4 7.9 4.1
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 2 Displaying and Describing CategoricalData
23. Small business productivity. The Wells Fargo/Gallup Small Business Index asked 604 small business owners in October 2011 “how difficult or easy do you think it will be for your company to obtain credit when you need it?” 22% said “Very difficult,” 21% “Somewhat difficult,” 28% “About Average,” 11% “Somewhat easy,” and 11% “Very Easy.” a) What do you notice about the percentages listed? How could this be? b) Make a bar chart to display the results and label it clearly. c) Would a pie chart be an effective way of communicating this information? Why or why not? d) Write a couple of sentences on how small businesses felt about the difficulty of obtaining credit in late 2011.
M02_SHAR8696_03_SE_C02.indd 70
120 100 80 60 40 20
wn
s Ot
he
r/U
nk
no
ure ail
ing
ll F
nd
Hu
sio
s
ns
ou
22. Soft drinks. A domestic soft drink distributor reported that their market sales for the previous year were broken down as follows: 38.9% Pepsi, 48.2% Sprite, and the rest of their $35 million revenues were due to diet soft drinks. Create an appropriate graphical display of this data and describe the data representing the sales of the company.
140
Gr
21. Numeracy skills. According to the OECD’s Skills Outlook 2013 report, Japanese adults have the highest proficiency in numeracy. Less than 40% of Japanese adults score at the lowest two levels with 8.2% scoring at Level 1 or below and 28.1% at Level 2, which is unique for the OECD. More than 60% score at the highest levels with 43.7% scoring at Level 3 and 18.8% scoring at Level 4 or 5. Create an appropriate graphical display of this information and write a sentence or two that might appear in a newspaper article about numeracy skills in Japan.
160
re
a) Compare the distribution of opinions between first-class passengers and economy-class passengers. b) Is it reasonable to conclude that 10.00% of economy-class passengers think that the services offered are excellent?
ilu
0 65
plo
3 40
Ex
20 150
s&
800 400
s
1200 500
25. Environmental hazard 2012. Data from the International Tanker Owners Pollution Federation Limited (www.itopf .com) give the cause of spillage for 455 large oil tanker accidents from 1970 to 2012. Here are the displays. Write a brief report interpreting what the displays show. Is a pie chart an appropriate display for these data? Why or why not?
Fa
No Answer/ Don’t Know
Fir e
Poor
ion
Only Fair
nt
Good
llis
First Class Economy
Excellent
me
Class
Co
20. College value? An international airline company asked 2023 of its first-class passengers and 1255 of its economyclass passengers whether they would “rate the on-board services offered to passengers while in flight” as Excellent, Good, Only Fair, or Poor.
24. Attack traffic. According to Akamai’s 2013 State of the Internet report (www.akamai.com/stateoftheinternet/), 35% of observed attack traffic originated from China, 20% from Indonesia, 11% from the US, 5.2% from Taiwan and 2.6% from Russia. a) What do you notice about the percentages listed? b) Make a bar chart to display the results and label it clearly. c) Would a pie chart be an effective way of communicating this information? Why or why not? d) Write a couple of sentences on where observed attack traffic originates.
uip
a) Is it reasonable to conclude that heart or respiratory diseases were the cause of approximately 38% of U.S. deaths during this year? b) What percent of deaths were from causes not listed here? c) Create an appropriate display for these data.
Eq
70
Other/Unknown Collisions
41
Hull Failures 60
134
18 149
53
Equipment Failure Fires & Explosions
Groundings
26. Olympic medals. In the history of the modern Olympics, the United States has won more medals than any other country. But the United States has a large population. Perhaps a better measure of success is the number of m edals won per capita—that is the number of medals divided by the population. By that measure, the leading countries are Liechtenstein (255.42 medals/cap), Norway (95.271), Finland (86.514), and Sweden (66.455). The following table
14/07/14 7:27 AM
www.freebookslides.com
Exercises 71
summarizes the medals/capita counts for the 100 countries with the most medals. a) Try to make a display of these data. What problems do you encounter? b) Can you find a way to organize the data so that the graph is more successful? Medals/capita
# Countries
Medals/capita
# Countries
0 10 20 30 40 50 60 70 80 90 100 110 120
72 13 3 3 2 0 1 0 1 1 0 0 0
130 140 150 160 170 180 190 200 210 220 230 240 250
0 0 0 0 0 0 0 0 0 0 0 0 1
27. Importance of wealth. GfK Roper Reports Worldwide surveyed people, asking them “How important is acquiring wealth to you?” The percent who responded that it was of more than average importance were: 71.9% China, 59.6% France, 76.1% India, 45.5% U.K., and 45.3% U.S. There were about 1500 respondents per country. A report showed the following bar chart of these percentages. 80%
you?” The percent who responded that it was of more than average importance are given in the following table: China France India U.K. U.S.
Here’s a pie chart of the data: U.S.
France India
a) List the errors you see in this display. b) Make an appropriate display for the percentages. c) Write a few sentences describing what you have learned about attitudes toward acquiring power. 29. Google financials. Google Inc. derives revenue from three major sources: advertising revenue from their websites, advertising revenue from the thousands of third-party websites that comprise the Google Network, and licensing and miscellaneous revenue. The following table shows the percentage of all revenue derived from these sources for the period from 2008 to 2012. Google Websites Google Network Members’ Websites Other Revenues
65% 60%
China
U.K.
75% 70%
49.1% 44.1% 74.2% 27.8% 36.0%
2008
2009
2010
2011
2012
66%
67%
66%
69%
68%
31%
30%
30%
28%
27%
3%
3%
4%
3%
5%
55%
a) Are these row or column percentages? b) Make an appropriate display of these data. c) Write a brief summary of this information.
45%
S. U.
K U.
Ind ia
nc e Fr a
Ch
ina
40%
a) How much larger is the proportion of those who said acquiring wealth was important in India than in the United States? b) Is that the impression given by the display? Explain. c) How would you improve this display? d) Make an appropriate display for the percentages. e) Write a few sentences describing what you have learned about attitudes toward acquiring wealth. 28. Importance of power. In the same survey as that discussed in Exercise 27, GfK Roper Consulting also asked “How important is having control over people and resources to
M02_SHAR8696_03_SE_C02.indd 71
30. Gender at work. According to the World Bank’s 2014 Gender at Work report, women are underrepresented in every type of employment, with greater gaps in developing countries. The following table shows percent distribution of employment type by country and gender. Developing Countries
Employment type
50%
Business owners Self-employed Employed for an employer Unemployed Out of workforce
High-income Countries
Women
Men
Women
Men
11% 19% 16%
20% 24% 30%
8% 14% 25%
11% 7% 45%
5% 49%
5% 21%
5% 47%
6% 31%
14/07/14 7:27 AM
www.freebookslides.com 72
CHAPTER 2 Displaying and Describing CategoricalData
a) Are these column, row, or total percentages? How do you know? b) What percent of women from developing countries were self-employed? c) From this table, can you determine what percent of all self-employed women are from developing countries? d) Among women from developing countries, what percent were not working? e) Write a few sentences describing the association b etween employment type and country together with gender. 31. New product. A company started and managed by business students is selling campus calendars. The students have conducted a market survey with the various campus constituents to determine sales potential and identify which market segments should be targeted. (Should they advertise in the Alumni Magazine and/or the local newspaper?) The following table shows the results of the market survey. Buying Likelihood
Campus Group
Unlikely
Moderately Likely
Very Likely
197 103 20 13
388 137 18 58
320 98 18 45
905 338 56 116
Total
333
601
481
1415
a) What percent of all these respondents are alumni? b) What percent of these respondents are very likely to buy the calendar? c) What percent of the respondents who are very likely to buy the calendar are alumni? d) Of the alumni, what percent are very likely to buy the calendar? e) What is the marginal distribution of the campus constituents? f) What is the conditional distribution of the campus constituents among those very likely to buy the calendar? g) Does this study present any evidence that this company should focus on selling to certain campus constituents? 32. Stock performance. The following table displays information for 470 of the S&P 500 stocks, on how their one-day change on October 24, 2011 (a day on which the S&P 500 index gained 1.23%) compared with their year to date change.
October 24, 2011
Year to Date
Positive Change Negative Change
Negative Change
164
233
48
25
a) What percent of the companies reported a positive change in their stock price over the year to date? b) What percent of the companies reported a positive change in their stock price over both time periods?
M02_SHAR8696_03_SE_C02.indd 72
33. Real estate. The Greenville, South Carolina Real Estate Hub keeps track of home sales in their area. They reported that sales were down in 2010 by about 3.7% from the previous year. Here are the number of homes sold in Greenville for the last 5 months of 2009 and 2010:
Total
Students Faculty/Staff Alumni Town Residents
Positive Change
c) What percent of the companies reported a negative change in their stock price over both time periods? d) What percent of the companies reported a positive change in their stock price over one period and a negative change in the other period? e) Among those companies reporting a positive change in their stock price on October 24 over the prior day what percentage also reported a positive change over the year to date? f) Among those companies reporting a negative change in their stock price on October 24 over the prior day what percentage reported a positive change over the year to date? g) What relationship, if any, do you see between the performance of a stock on a single day and its year-to-date performance?
2010 2009
August 475 607
September 466 597
October 502 596
November December 423 495 581 447
a) What percent of all homes in these ten months were sold in October of 2009? b) What percent of all homes in these 10 months were sold in 2010? c) What percent of all homes in these 10 months were sold in December? d) How did the percent of homes sold in November change from 2009 to 2010? 34. Google financials, part 2. Google Inc. divides their total costs and expenses into five categories: Costs of Revenues, Research and Development, Sales and Marketing, General and Administrative, and Dept of Justice charges (amounts in $Millions). Cost of Revenues Research and Development Sales and Marketing General and Administrative Dept of Justice Total Costs and Expenses
2008
2009
2010
2011
2012
$8612
$8844
$10,417
$13,188
$20,634
$2793
$2843
$3762
$5162
$6793
$1946
$1984
$2799
$4589
$6143
$1803
$1667
$1962
$2724
$3845
$0
$0
$0
$500
$0
$15,154
$15,338
$18,940
$26,163
$37,415
a) What percent of total costs and expenses were sales and marketing in 2008? In 2012? b) What percent of total costs and expenses were due to research and development in 2008? In 2012? c) Have general and administrative costs grown as a percentage of total costs and expenses over this time period?
14/07/14 7:27 AM
www.freebookslides.com
Exercises 73
Action/Adventure Comedy Drama Thriller/Suspense Total
PG
PG-13
R or NC-17
Total
3 1 2 0 6
19 8 13 1 41
23 13 36 20 92
18 34 65 23 140
63 56 116 44 279
a) Find the conditional distribution (in percentages) of movie ratings for action/adventure films. b) Find the conditional distribution (in percentages) of movie ratings for thriller/suspense films. c) Create a graph comparing the ratings for the four genres. d) Are Genre and Rating independent? Write a brief summary of what these data show about movie ratings and the relationship to the genre of the film. 36. CyberShopping. It has become more common for shoppers to “comparison shop” using the Internet. Respondents to a Pew survey in 2013 who owned cell phones were asked whether they had, in the past 30 days, looked up the price of a product while they were in a store to see if they could get a better price somewhere else. Here is a table of their responses by income level. *$30K
$30K–$49.9K
$50K–$74.9K
+$75K
207 625
115 406
134 260
204 417
Yes No
(Source: www.pewinternet.org/Reports/2012/In-store-mobile-commerce.aspx)
a) Find the conditional distribution (in percentages) of income distribution for those who do not compare prices on the Internet. b) Find the conditional distribution (in percentages) of income distribution for shoppers who do compare prices. c) Create a graph comparing the income distributions of those who compare prices with those who don’t. d) Do you see any differences between the conditional distributions? Write a brief summary of what these data show about Internet use and its relationship to income. 37. MBAs. A survey of the entering MBA students at a university in the United States classified the country of origin of the students, as seen in the table.
M02_SHAR8696_03_SE_C02.indd 73
Origin
Two-Year MBA
Evening MBA
Total
Asia/Pacific Rim Europe Latin America Middle East/Africa North America
31 5 20 5 103
33 0 1 5 65
64 5 21 10 168
Total
164
104
268
a) What percent of all MBA students were from North America? b) What percent of the Two-Year MBAs were from North America? c) What percent of the Evening MBAs were from North America? d) What is the marginal distribution of origin? e) Obtain the column percentages and show the conditional distributions of origin by MBA Program. f) Do you think that origin of the MBA student is independent of the MBA program? Explain. 38. MBAs, part 2. The same university as in Exercise 37 reported the following data on the gender of their students in their two MBA programs. Type
Sex
Genre
Rating G
MBA Program
Two-Year
Evening
Total
Men Women
116 48
66 38
182 86
Total
164
104
268
a) What percent of all MBA students are women? b) What percent of Two-Year MBAs are women? c) What percent of Evening MBAs are women? d) Do you see evidence of an association between the Type of MBA program and the percentage of women students? If so, why do you believe this might be true? 39. Top-producing movies. The following table shows the Motion Picture Association of America (MPA; www.mpaa .org) ratings for the top 20 grossing films in the United States for each of the 10 years from 2003 to 2012. (Data are number of films.) Rating
Year
35. Movie ratings. The movie ratings system is a voluntary system operated jointly by the Motion Picture Association of America (MPAA) and the National Association of Theatre Owners (NATO). The ratings themselves are given by a board of parents who are members of the Classification and Ratings Administration (CARA). The board was created in response to outcries from parents in the 1960s for some kind of regulation of film content, and the first ratings were introduced in 1968. Here is information on the ratings of 279 movies that came out in 2011, also classified by their genre.
2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 Total
G
PG
PG-13
R/NC-17
Total
0 1 1 0 2 1 1 1 1 1 9
6 4 9 7 4 5 4 4 6 3 52
12 11 9 12 10 11 13 13 10 11 112
2 4 1 1 4 3 2 2 3 5 27
20 20 20 20 20 20 20 20 20 20 200
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 2 Displaying and Describing CategoricalData
a) What percent of all these top 20 films are G rated? b) What percent of all top 20 films in 2005 were G rated? c) What percent of all top 20 films were PG-13 and came out in 2010? d) What percent of all top 20 films produced in 2008 or later were PG-13? e) What percent of all top 20 films produced from 2003 to 2007 were rated PG-13 or R/NC-17? f) Compare the conditional distributions of the ratings for films produced in 2008 or later to those produced from 2003 to 2008. Write a couple of sentences summarizing what you see. 40. Movie admissions 2011. The following table shows attendance data collected by the Motion Picture Association of America during the period 2009 to 2011. Figures are the number (in millions) of frequent moviegoers in each age group. Age Frequent Moviegoers (M)
2–11 12–17 18–24 25–39 40–49 50–59
60+
2011
2.5
5.7
6.6
9.7
3.3
3.1
4.1
2010
3.1
6.1
7.4
7.7
3.5
3.0
4.3
2009
2.8
5.7
6.3
6.3
4.5
2.9
3.4
a) What percent of all frequent moviegoers during this period were people between the ages of 12 and 24? b) What percent of the frequent moviegoers in 2011 were people between the ages of 12 and 39? c) What percent of all frequent moviegoers during this period were people between the ages of 18 and 24 who went to the movies in 2009? d) What percent of frequent moviegoers in 2010 were people 60 years old and older? e) What percent of all frequent moviegoers in this period were people 60 years old and older who went to the movies in 2010? f) Compare the conditional distributions of the age groups across years. Write a couple of sentences summarizing what you see. 41. Tattoos. A study by the University of Texas Southwestern Medical Center examined 626 people to see if there was an increased risk of contracting hepatitis C associated with having a tattoo. If the subject had a tattoo, researchers asked whether it had been done in a commercial tattoo parlor or elsewhere. Write a brief description of the association between tattooing and hepatitis C, including an appropriate graphical display. Tatto Done in Commercial Parlor
Tattoo Done Elsewhere
No Tattoo
17 35
8 53
18 495
Has Hepatitis C No Hepatitis C
M02_SHAR8696_03_SE_C02.indd 74
42. Poverty and region 2012. In 2012, the following data were reported by the U.S. Census Bureau. The data show the number of people (in thousands) living above and below the poverty line in each of the four regions of the United States. Based on these data do you think there is an association between region and poverty? Explain. Northeast Midwest South West
Below Poverty Level
Above Poverty Level
12,728 13,055 17,287 17,031
86,932 82,459 85,913 93,753
43. Being successful. In a random sample of U.S. adults surveyed in December 2011, Pew research asked how important it is “to you personally” to be successful in a highpaying career or profession. Here is a table reporting the responses. (Percentages may not add to 100% due to rounding.) (Data from www.pewsocialtrends.org/files/2012/04 /Women-in-the-Workplace.pdf) Women
Men
Age
18–34
35–64
18–34
35–64
One of the most important things Very important, but not the most Somewhat important Not important
18% 48% 26% 8% 100%
7% 35% 34% 24% 100%
11% 47% 31% 10% 100%
9% 34% 37% 20% 100%
a) What percent of young women consider it very important or one of the most important things for them personally to be successful? b) How does that compare with young men? c) From this table, can you determine what percent of all women responding felt this way? Explain. d) Write a few sentences describing the association between the sex of young respondents and their attitudes toward the importance of financial or professional success. 44. Minimum wage workers. The U.S. Department of Labor (www.bls.gov) collects data on the number of U.S. workers who are employed at or below the minimum wage. Here is a table showing the number of hourly workers by Age and Sex and the number who were paid at or below the prevailing minimum wage:
Age
74
16–24 25–34 35–44 45–54 55–64 65+
Hourly Workers (in thousands)
At or Below Minimum Wage (in thousands)
Men
Women
Men
Women
7978 9029 7696 7365 4092 1174
7701 7864 7783 8260 4895 1469
384 150 71 68 35 22
738 332 170 134 72 50
14/07/14 7:27 AM
www.freebookslides.com
Exercises 75
a) What percent of the women were ages 16–24? b) Using side-by-side bar graphs, compare the proportions of the men and women who worked at or below minimum wage at each Age group. Write a couple of sentences summarizing what you see. 45. Moviegoers and ethnicity. The Motion Picture Association of America studies the ethnicity of moviegoers to understand changes in the demographics of moviegoers over time. Here are the numbers of moviegoers (in millions) classified as to whether they were Hispanic, AfricanAmerican, Caucasian, and Other for the year 2010. Also included are the numbers for the general U.S. population and the number of tickets sold. Caucasian
Hispanic
AfricanAmerican
204.6 88.8 728
49.6 26.8 338
1021.4
414.4
Population Moviegoers Tickets Total
Other
Total
37.2 16.9 143
18.6 8.5 91
310 141 1300
197.1
118.1
1751
a) Compare the conditional distribution of Ethnicity for all three groups: the entire population, moviegoers, and ticket holders. b) Write a brief description of the association between population groups and Ethnicity. 46. Department store. A department store is planning its next advertising campaign. Since different publications are read by different market segments, they would like to know if they should be targeting specific age segments. The results of a marketing survey are summarized in the following table by Age and Shopping Frequency at their store.
Women Age Count
18–34 610
27 48 23 98
30–49 37 91 51 179
50 and Over 31 93 73 197
35–64 605
Percent Distribution of Adults’ Literacy Skills
Total 95 232 147 474
18–34 703
48. Labor market skills. OECD’s Skills Outlook 2013 distinguishes three crucial labor market skills: numeracy, literacy, and problem-solving. For all three skills, the OECD report scores the proficiency of 16- to 65-years-olds at five different levels, ranging from Level 1 (able to read short, simple texts) to Level 5 (searching and integrating information from multiple, dense texts). The following table contains scores from a subset of countries, including Japan (scoring highest) and Italy (scoring lowest). a) Would you expect the distribution of literacy skills to be roughly the same over different countries? Why or why not?
Skills Category
Frequency
Low Moderate High Total
Under 30
35–64 571
With this additional information you should be able to answer these questions. (Note: Percentages were rounded to whole numbers, so estimated cell counts will have fractions. You need not round estimated cell counts to whole numbers for the purpose of answering these questions.) a) What percentage of 18–34 year olds (both male and female) reported that being successful in a high-paying career or profession was “one of the most important things” to them personally? b) What percentage of 18–34 year olds who said that such success was “one of the most important things” were women? c) Write a few sentences describing how the opinions of young women differ from those of older female respondents.
Age Shopping
Men
Below Level 1 Level 1 Level 2 Level 3 Level 4/5 No information
Japan
Finland
Australia
United States
Italy
1% 4% 23% 49% 23% 1%
3% 8% 27% 41% 22% 0%
3% 9% 29% 39% 17% 2%
4% 14% 33% 34% 12% 4%
6% 22% 42% 26% 3% 1%
a) Find the marginal distribution of Shopping Frequency. b) Find the conditional distribution of Shopping Frequency within each age group. c) Compare these distributions with a segmented bar graph. d) Write a brief description of the association between Age and Shopping Frequency among these respondents. e) Does this prove that customers ages 50 and over are more likely to shop at this department store? Explain.
b) The table shows the percentages of skill levels for each country. Are these row percentages, column percentages, or total percentages? c) Do the data support the OECD’s advice that some countries (amongst them several southern-European ones) would profit from school reforms? Explain.
47. Success II. Look back at the table in exercise 43 concerning desires for success and a high-paying career. That table presented only the percentages, but Pew Research reported the numbers of respondents in the major categories:
49. Insurance company, part 2. An insurance company that provides medical insurance is concerned with recent data. They suspect that patients who undergo surgery at large hospitals have their discharges delayed for various reasons—which results in increased medical costs to the
M02_SHAR8696_03_SE_C02.indd 75
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 2 Displaying and Describing CategoricalData
76
insurance company. The recent data for area hospitals and two types of surgery (major and minor) are shown in the following table.
Procedure
Discharge Delayed Large Hospital
Small Hospital
Major Surgery
120 of 800
10 of 50
Minor Surgery
10 of 200
20 of 250
a) Overall, for what percent of patients was discharge delayed? b) Were the percentages different for major and minor surgery? c) Overall, what were the discharge delay rates at each hospital? d) What were the delay rates at each hospital for each kind of surgery? e) The insurance company is considering advising their clients to use large hospitals for surgery to avoid postsurgical complications. Do you think they should do this? f) Explain, in your own words, why this confusion occurs. 50. Delivery service. A company must decide which of two delivery services they will contract with. During a recent trial period, they shipped numerous packages with each service and have kept track of how often deliveries did not arrive on time. Here are the data. Delivery Service Pack Rats Boxes R Us
51. Graduate admissions. A 1975 article in the magazine Science examined the graduate admissions process at Berkeley for evidence of gender bias. The following table shows the number of applicants accepted to each of four graduate programs. Program
Males Accepted (of Applicants)
Females Accepted (of Applicants)
1 2 3 4
511 of 825 352 of 560 137 of 407 22 of 373
89 of 108 17 of 25 132 of 375 24 of 341
Total
1022 of 2165
262 of 849
a) What percent of total applicants were admitted? b) Overall, were a higher percentage of males or females admitted? c) Compare the percentage of males and females admitted in each program. d) Which of the comparisons you made do you consider to be the most valid? Why? 52. Simpson’s Paradox. Develop your own table of data that is a business example of Simpson’s Paradox. Explain the conflict between the conclusions made from the conditional and marginal distributions.
Just C hecking Answers
Type of Service
Number of Deliveries
Number of Late Packages
1 50.0%
Regular Overnight Regular Overnight
400 100 100 400
12 16 2 28
3 25.0%
a) Compare the two services’ overall percentage of late deliveries. b) Based on the results in part a, the company has decided to hire Pack Rats. Do you agree they deliver on time more often? Why or why not? Be specific. c) The results here are an instance of what phenomenon?
M02_SHAR8696_03_SE_C02.indd 76
2 40.0% 4 15.6% Nearsighted, 56.3% Farsighted, 18.8% Need
Bifocals
5 18.8% Nearsighted, 62.5% Farsighted, 18.8% Need
Bifocals
6 40% of the nearsighted customers are female, while
50% of customers are female.
7 Since nearsighted customers appear less likely to be
female, it seems that they may not be independent. (Butthe numbers are small.)
14/07/14 7:27 AM
3
www.freebookslides.com
Displaying and Describing Quantitative Data
AIG The American International Group (AIG) was once the 18th largest corporation in the world. AIG was founded nearly 100 years ago by Cornelius Vander Starr who opened an insurance agency in Shanghai, China. As the first Westerner to sell insurance to the Chinese, Starr grew his business rapidly until 1949 when Mao Zedong and the People’s Liberation Army took over Shanghai. Starr moved the company to New York City, where it continued to grow, expanding its markets worldwide. In 2004, AIG stock hit an all-time high of $76.77, putting its market value at nearly $300 billion. According to its own website, “By early 2007 AIG had assets of $1 trillion, $110 billion in revenues, 74 million customers and 116,000 employees in 130 countries and jurisdictions. Yet just 18months later, AIG found itself on the brink of failure and in need of emergency government assistance.” AIG was one of the largest beneficiaries of the U.S. government’s Troubled Asset Relief Program (TARP), established in 2008 during the financial crisis to purchase assets and equity from financial institutions. TARP was an attempt to strengthen the financial sector and avoid a repeat of a depression as severe as the 1930s. Many banks quickly repaid the government part or all of the money given to them under the TARP program, but AIG, which received $170 billion, took until the end of 2012 to repay the government completely. Even though AIG stock today is on solid financial footing, its stock price is only a fraction (adjusted for splits) of what it was before the 2008 crisis. Between 2007 and 2009 AIG stock lost more than 99% of its value, hitting $0.35 in early March. That same month AIG became embroiled in controversy when it disclosed that it had paid $218 million in bonuses to employees of its financial services division. AIG’s drop in stock price represented a loss of 77
M03_SHAR8696_03_SE_C03.indd 77
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
78
nearly $300 billion for investors. Portfolio managers typically examine stock prices and volumes to determine stock volatility and to help them decide which stocks to buy and sell. Were there early warning signs in AIG’s data?
T 2002
able 3.1 gives the monthly average stock price (in dollars) for the six years leading up to the company’s crisis. Were there clues to warn of problems?
Jan.
Feb.
Mar.
Apr.
May
June
July
Aug.
Sept.
Oct.
Nov.
Dec.
77.26
72.95
73.72
71.57
68.42
65.99
61.22
64.10
58.04
60.26
65.03
59.96
2003
59.74
49.57
49.41
54.38
56.52
57.88
59.80
61.51
59.39
60.93
58.73
62.37
2004
69.02
73.25
72.06
74.21
70.93
72.61
69.85
69.58
70.67
62.31
62.17
65.33
2005
66.74
68.96
61.55
51.77
53.81
55.66
60.27
60.86
60.54
62.64
67.06
66.72
2006
68.33
67.02
67.15
64.29
63.14
59.74
59.40
62.00
65.25
67.02
69.86
71.35
2007
70.45
68.99
68.14
68.25
71.78
71.75
68.64
65.21
66.02
66.12
56.86
58.13
Table 3.1 Monthly stock price in dollars of AIG stock for the period 2002 through 2007.
It’s hard to tell very much from tables of values like this. You might get a rough idea of how much the stock cost—usually somewhere around $60 or so, but that’s about it.
3.1 WHO WHAT WHEN WHERE WHY
Months Monthly average price for AIG’s stock (in dollars) 2002 through 2007 New York Stock Exchange To examine AIG stock volatility
Displaying Quantitative Variables The first rule of data analysis is to make a picture. AIG’s stock price is a quantitative variable, whose units are dollars, so a bar chart or pie chart won’t work. For quantitative variables, there are no categories. Instead, we usually slice up all the possible values into bins and then count the number of cases that fall into each bin. The bins, together with these counts, give the distribution of the quantitative variable and provide the building blocks for the display of the distribution, called a histogram.
Histograms Here are the monthly prices of AIG stock displayed in a histogram.
Figure 3.1 Monthly average prices of AIG stock. The histogram displays the distribution of prices by showing for each “bin” of prices, the number of months having prices in that bin.
25
# of Months
20 15 10 5
45
M03_SHAR8696_03_SE_C03.indd 78
50
55 60 65 70 Monthly Average Price
75
80
14/07/14 7:26 AM
www.freebookslides.com
79
Displaying Quantitative Variables
A histogram plots the bin counts as the heights of bars. It counts the number of cases that fall into each bin, and displays that count as the height of the corresponding bar. In this histogram of monthly average prices, each bin has a width of $5, so, for example, the height of the tallest bar says that there were 24 months whose average price of AIG stock was between $65 and $70. In this way, the histogram displays the entire distribution of prices. Unlike a bar chart, which puts gaps between bars to separate the categories, there are no gaps between the bars of a histogram unless there are actual gaps in the data. Gaps indicate a region where there are no values. Gaps can be important features of the distribution so watch out for them and point them out. For categorical variables, each category got its own bar. The only choice was whether to combine categories for ease of display. For quantitative variables, we have to choose the width of the bins. It isn’t hard to make a histogram by hand, but we almost always use technology. Many statistics programs allow you to adjust the bin width yourself. From the histogram, we can see that in these months the AIG stock price was typically near $65 and usually between $55 and $75. Keep in mind that this histogram is a static picture. We have treated these prices simply as a collection of months, not as a time series, and shown their distribution. Later in the chapter we will discuss when this is appropriate and add time to the story. Does the distribution look as you expected? It’s often a good idea to imagine what the distribution might look like before making a display. That way you’re less likely to be fooled by errors either in your display or in the data themselves. The vertical axis of a histogram shows the number of cases falling in each bin. An alternative is to report the percentage of cases in each bin, creating a relative frequency histogram. The shape of the two histograms is the same; only the vertical axis and labels are different. A relative frequency histogram is faithful to the area principle by displaying the percentage of cases in each bin instead of the count.
30 Percent
Figure 3.2 A relative frequency histogram looks just like a frequency histogram except that the y-axis now shows the percentage of months in each bin.
20
10
45
For Example
50
55
70 65 60 Monthly Average Price
75
80
Creating a histogram
As the chief financial officer of a music download site, you’ve just secured the rights to offer downloads of a new album. You’d like to see how well it’s selling, soyou collect the number of downloads per hour for the past 24 hours: (continued )
M03_SHAR8696_03_SE_C03.indd 79
14/07/14 7:26 AM
www.freebookslides.com 80
CHAPTER 3 Displaying and Describing Quantitative Data
Hour
Downloads
Hour
Downloads
12:00 a.m. 1:00 a.m. 2:00 a.m. 3:00 a.m. 4:00 a.m. 5:00 a.m. 6:00 a.m. 7:00 a.m. 8:00 a.m. 9:00 a.m. 10:00 a.m. 11:00 a.m.
36 28 19 10 5 3 2 6 12 14 20 18
12:00 p.m. 1:00 p.m. 2:00 p.m. 3:00 p.m. 4:00 p.m. 5:00 p.m. 6:00 p.m. 7:00 p.m. 8:00 p.m. 9:00 p.m. 10:00 p.m. 11:00 p.m.
25 22 17 18 20 23 21 18 24 30 27 30
Question Make a histogram for this variable. Answer Create a frequency table of bins of width five from 0 to 40 and put values at the ends of bins into the right bin: Downloads
Number of Hours
0–5 5–10 10–15 15–20 20–25 25–30 30–35 35–40 Total
2 2 3 5 6 3 2 1 24
The histogram looks like this: 6
# of Hours
5 4 3 2 1 0
10
20 30 Downloads per Hour
40
*
Stem-and-Leaf Displays
Stem-and-leaf displays are like histograms, but they also show the individual values. They are easy to make by hand for data sets that aren’t too large, so they’re a great way to look at a small batch of values quickly.1 Here’s a stem-and-leaf display for the AIG stock data, alongside a histogram of the same data. * Sections marked with an asterisk may be optional. Check with your Instructor. 1 The authors like to make stem-and-leaf displays whenever data are presented (without a suitable display) at committee meetings or working groups. The insights from just that quick look at the distribution are often quite valuable.
M03_SHAR8696_03_SE_C03.indd 80
14/07/14 7:26 AM
www.freebookslides.com
Shape 81
25 20 # of Months
Figure 3.3 The AIG monthly average stock prices displayed both by a histogram (left) and stem-and-leaf display (right). Stem-and-leaf displays are typically made by hand, so we are most likely to use them for small data sets. For much larger data sets, we use a histogram.
15 4 99 5 134 5 5667888999999 6 0000011122222344 6 555556666777788888889999 7 0001111222334 77
10 5
45
50
55
70 65 60 Monthly Average Price
75
80
How Stem-And-Leaf Displays Work A stem-and-leaf display breaks each number into two parts: the stem shown to the left of the solid line and the leaf, to the right. For the AIG data, each price, for example $67.02, is first truncated to two digits, $67. Then it is split into two components: 6 7. The line 5 134 displays the values $51, $53, and $54 and corresponds to the histogram bin from $50 to $55. The stem-and-leaf in Figure 3.3 uses a bin width of 5. Another choice would be to increase the bin size and put all the prices from $50 to $60 on one line: 5 1345667888999999 That would decrease the number of bins to 4, but makes the bin from $60 to $70 too crowded: 4 99 5 1345667888999999 6 0000011122222344555556666777788888889999 7 00011112223347 Before making a stem-and-leaf display, or a histogram, you should check the Quantitative Data Condition: The data must be values of a quantitative variable whose units are known. Although a bar chart and a histogram may look similar, they’re not the same display. You can’t display categorical data in a histogram or quantitative data in a bar chart. Always check the condition that confirms what type of data you have before making your display.
3.2
Shape Once you’ve displayed the distribution in a histogram or stem-and-leaf display, what can you say about it? When you describe a distribution, you should pay attention to three things: its shape, its center, and its spread. We describe the shape of a distribution in terms of its modes, its symmetry, and whether it has any gaps or outlying values.
Mode Does the histogram have a single, central hump (or peak) or several, separated humps? These humps are called modes. Formally, the mode is the single, most frequent value, but we rarely use the term that way.2 The AIG stock prices have a single mode around $65. We 2
Technically, the mode is the value on the x-axis of the histogram below the highest peak, but when asked to identify the mode, most people would point to the peak itself.
M03_SHAR8696_03_SE_C03.indd 81
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
The mode is typically defined as the single value that appears most often. That definition is fine for categorical variables because we only need to count the number of cases for each category. For quantitative variables, the meaning of mode is more ambiguous. For example, what’s the mode of the AIG data? No two prices were exactly the same, but 7 months had prices between $68 and $69. Should that be the mode? Probably not—that seems a little arbitrary. For quantitative data, it makes more sense to use the word mode in the more general sense of “peak in a histogram,” rather than as a single summary value.
Pie à la Mode Is there a connection between pie and the mode of a distribution? Actually, there is! The mode of a distribution is a popular value near which a lot of data values gather. And à la mode means “in style”— not “with ice cream.” That just happened to be a popular way to have pie in Paris around 1900.
often use modes to describe the shape of the distribution. A distribution whose histogram has one main hump, such as the one for the AIG stock prices, is called unimodal; distributions whose histograms have two humps are bimodal, and those with three or more are called multimodal. For example, here’s a bimodal distribution. 15
Counts
Where’s the Mode?
10
5
70
110
150
Figure 3.4 A bimodal distribution has two apparent modes.
A bimodal histogram is often an indication that there are two groups in the data. It’s a good idea to investigate when you see bimodality. But don’t get overly excited by minor fluctuations in the histogram, which may just be artifacts of where the bin boundaries fall. To be a true mode, the hump should still be there when you display the histogram with slightly different bin widths. A distribution whose histogram doesn’t appear to have any mode and in which all the bars are approximately the same height is called uniform. (Chapter 5 gives a more formal definition.) 60
Counts
82
40 20 0.0
0.5
1.0
Figure 3.5 In a uniform distribution, bars are all about the same height. The histogram doesn’t appear to have a mode.
Symmetry Could you fold the histogram along a vertical line through the middle and have the edges match pretty closely, as in Figure 3.6, or are more of the values on one side, as in the histograms in Figure 3.7? A distribution is symmetric if the halves on either side of the center look, at least approximately, like mirror images. 60
Fold along dotted line
60
40
40
20
20
–3.0 –2.0 –1.0 0.0 1.0 2.0 A symmetric histogram …
3.0
–3.0 0.1 00.0 .0 0.3 –2.0 0.2 –1.0 … can fold in the middle so that the two sides almost match.
Figure 3.6 A symmetric histogram can fold in the middle so that the two sides almost match.
M03_SHAR8696_03_SE_C03.indd 82
14/07/14 7:26 AM
www.freebookslides.com
Shape 83
# of Female Cardiac Patients
Figure 3.7 Two skewed histograms showing the age (left) and hospital charges (right) for all female heart attack patients in New York State in one year. The histogram of Age (in blue) is skewed to the left, while the histogram of Charges (in purple) is skewed to the right.
# of Female Cardiac Patients
The (usually) thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the distribution is said to be skewed to the side of the longer tail. 600 400 200
20
Skewed Right or Left? Amounts of things (dollars, employees, waiting times) can’t be negative so they run up against zero. But they have no natural upper limit. So, they often have distributions that are skewed to the right. Grades on a test where most students do well are often skewed to the left with many scoring near the top, but a few straggling off to the low end.
35
50
65 Age (yr)
80
95
600 400 200
7500
22500 Charge ($)
37500
Outliers Do any values appear to stick out? Often such values tell us something interesting or exciting about the data. You should always point out any stragglers or outliers that stand off away from the body of the distribution. For example, if you’re studying the personal wealth of Americans and Bill Gates is in your sample, he would certainly be an outlier. Because his wealth would be so obviously atypical, you’d want to point it out as a special feature. Outliers can affect almost every method we discuss in this book, so we’ll always be on the lookout for them. An outlier can be the most informative part of your data, or it might just be an error. Either way, you shouldn’t throw it away without comment. Treat it specially and discuss it when you report your conclusions about your data. (Or find the error and fix it if you can.)
Using Your Judgment How you characterize a distribution is often a judgment call. Do the two humps in the histogram really reveal two subgroups, or will the shape look different if you change the bin width slightly? Are those observations at the high end of the histogram truly unusual, or are they just the largest ones at the end of a long tail? These are matters of judgment on which different people can legitimately disagree. There’s no automatic calculation or rule of thumb that can make the decision for you. Understanding your data and how they arose can help. What should guide your decisions is an honest desire to understand what is happening in the data. That’s what you’ll need to make sound business decisions. Viewing a histogram at several different bin widths can help you to see how persistent some of the features are. Some technologies offer ways to change the bin width interactively to get multiple views of the histogram. If the number of observations in each bin is small enough so that moving a couple of values to the next bin changes your assessment of how many modes there are, be careful. Be sure to think about the data, where they came from, and what kinds of questions you hope to answer from them.
For Example
Describing the shape of a distribution
Question Describe the shape of the distribution of downloads from the example on page 79.
Answer It is symmetric and unimodal with no outliers.
M03_SHAR8696_03_SE_C03.indd 83
14/07/14 7:26 AM
www.freebookslides.com 84
CHAPTER 3 Displaying and Describing Quantitative Data
3.3
N otat i on A l ert
Center Look again at the AIG prices in Figure 3.1. If you had to pick one number to describe a typical price, what would you pick? When a histogram is unimodal and fairly symmetric, most people would point to the center of the distribution, where the histogram peaks. The typical price is around $65.00. If we want to be more precise and calculate a number, we can average the data. In the AIG example, the average monthly price is $64.48, about what we might expect from the histogram. You probably know how to average values, but this is a good place to introduce notation that we’ll use throughout the book. We’ll call the generic variable x, and use the Greek capital letter sigma, g , to mean “sum” (sigma is “S” in Greek), and write3:
A bar over any symbol indicates the mean of that quantity.
x =
gx Total = . n n
According to this formula, we add up all the values of the variable, x, and divide that sum (Total, or g x) by the number of data values, n. We call the resulting value the mean of x.4 Although the mean is a natural summary for unimodal, symmetric distributions, it can be misleading for skewed data or for distributions with gaps or outliers. The histogram of AIG monthly prices in Figure 3.1 is unimodal, and nearly symmetric, with a slight left skew. A look at the total volume of AIG stock sold each month for the same 6 years tells a very different story. Figure 3.8 shows a unimodal but strongly right-skewed distribution with two gaps. The mean monthly volume was 170.1 million shares. Locate that value on the histogram. Does it seem a little high as a summary of a typical month’s volume? In fact, more than two out of three months have volumes that are less than that value. It might be better to use themedian—the value that splits the histogram into two equal areas. The median is commonly used for variables such as cost or income, which are likely to be skewed. Figure 3.8 The median splits the area of the histogram in half at 135.9 million shares. The mean is the point on which the histogram would balance. Because the distribution is skewed to the right, the mean 170.1 million shares is higher than the median. The points at the right have pulled the mean toward them, away from the median.
40
# of Months
30
20
10
100
200 300 400 Total Monthly Volume
500
Balancing Point
gy Total = . We actually prefer n n to call a single variable y instead of x, because in the next chapter we’ll need x to name a variable that predicts another (which we’ll call y), but when you have only one variable either name is common. Most calculators call a single variable x. 4 Once you’ve averaged the data, you might logically expect the result to be called the average. But average is used too colloquially as in the “average” home buyer, where we don’t sum up anything. Even though average is sometimes used in the way we intend, as in the Dow Jones Industrial Average (which is actually a weighted average) or a batting average, we’ll usually use the term mean throughout the book. 3
M03_SHAR8696_03_SE_C03.indd 84
You may also see the variable called y and the equation written y =
14/07/14 7:26 AM
www.freebookslides.com
Center 85
That’s because the median is resistant to unusual observations and to the shape of the distribution. For the AIG monthly trading volumes, the median is 135.9 million shares, which seems like a more appropriate summary. Does it really make a difference whether we choose a mean or a median? The mean monthly price for the AIG stock is $64.48. Because the distribution of the prices is roughly symmetric, we’d expect the mean and median to be close. In fact, we compute the median to be $65.23. But for variables with skewed distributions, the story is quite different. For a right-skewed distribution like the monthly volumes in Figure 3.8, the mean is larger than the median: 170.1 compared to 135.9. The two give quite different summaries. The difference is due to the overall shape of the distributions.
By Hand Finding the Median Finding the median of a batch of n numbers is easy as long as you remember to order the values first. If n is odd, the median is the middle value. Counting n + 1 in from the ends, we find this value in the position. 2 When n is even, there are two middle values. So, in this case, the median n n is the average of the two values in positions and + 1. 2 2 Here are two examples: Suppose the batch has the values 14.1, 3.2, 25.3, 2.8, -17.5, 13.9, and 45.8. First we order the values: -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, and 45.8. There are 7 values, so the median is the 17 + 12 >2 = 4th value counting from the top or bottom: 13.9. Suppose we had the same batch with another value at 35.7. Then the ordered values are -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 35.7, and 45.8. The median is the average of the 8>2, or 4th, and the 18>22 + 1, or 5th, values. So the median is 113.9 + 14.12 >2 = 14.0. The mean is the point at which the histogram would balance. A value far from the center has more leverage, pulling the mean in its direction. It’s hard to argue that a summary that’s been pulled aside by only a few outlying values or by a long tail is what we mean by the center of the distribution. That’s why the median is usually a better choice for skewed data. However, when the distribution is unimodal and symmetric, the mean offers better opportunities to calculate useful quantities and draw interesting conclusions. It will be the summary value we work with throughout the rest of the book.
For Example
Finding the mean and median
Question From the data on page 80, what is a typical number of downloads perhour?
Answer The mean number is 18.7 downloads per hour. The median is 19.5 downloads per hour. Because the distribution is unimodal and roughly symmetric, we shouldn’t be surprised that the two are close. There are a few more hours (in the middle of the night) with small numbers of downloads that pull the mean lower than the median, but either one seems like a reasonable summary to report.
M03_SHAR8696_03_SE_C03.indd 85
14/07/14 7:26 AM
www.freebookslides.com 86
CHAPTER 3 Displaying and Describing Quantitative Data
3.4
Spread of the Distribution We know that the typical price of the AIG stock is around $65, but knowing the mean or median alone doesn’t tell us about the entire distribution. A stock whose price doesn’t move away from its center isn’t very interesting. 5 The more the data vary, the less a measure of center can tell us. We need to know how spread out the data are as well. One simple measure of spread is the range, defined as the difference between the extremes: Range = max - min. For the AIG price data, the range is +77.26 - +49.41 = +27.85. Notice that the range is a single number that describes the spread of the data, not an interval of values—as you might think from its use in common speech. If there are any unusual observations in the data, the range is not resistant and will be influenced by them. Concentrating on the middle of the data avoids this problem. The lower quartile, Q1, is defined as the value for which one quarter of the data lie below it and the upper quartile, Q3, is the value for which one quarter of the data lie above it. In this way, the quartiles frame the middle 50% of the data. The interquartile range (IQR) summarizes the spread by focusing on the middle half of the data. It’s defined as the difference between the two quartiles: IQR = Q3 - Q1.
By Hand Finding Quartiles Quartiles are easy to find in theory, but more difficult in practice. The three quartiles, Q1 (lower quartile), Q2 (the median), and Q3 (the upper quartile) split the sorted data values into quarters. So, for example, 25% of the data values will lie at or below Q1. The problem lies in the fact that unless your sample size divides nicely by 4, there isn’t just one way to split the data into quarters. The statistical software package SAS offers at least five different ways to compute quartiles. The differences are usually small, but can be annoying. Here are two of the most common methods for finding quartiles by hand or with a calculator: 1. The Tukey Method Split the sorted data at the median. (If n is odd, include the median with each half). Then find the median of each of these halves—use these as the quartiles. Example: The data set 514.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.86 First we order the values: 5 -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.86. We found the median to be 13.9, so form two data sets: 5 -17.5, 2.8, 3.2, 13.96 and 513.9, 14.1, 25.3, 45.86. The medians of these are 3.0 = 12.8 + 3.22 >2 and 19.7 = 114.1 + 25.32 >2. So we let Q1 = 3.0 and Q3 = 19.7. 2. The TI calculator method The same as the Tukey method, except we don’t include the median with each half. So for {14.1, 3.2, 25.3, 2.8, -17.5, 13.9, and 45.8} we find the two data sets: 5 -17.5, 2.8, 3.26 and {14.1, 25.3, 45.8} by not including the median in either. 5
M03_SHAR8696_03_SE_C03.indd 86
And not much of an investment, either.
14/07/14 7:26 AM
www.freebookslides.com
87
Spread of the Distribution
Now the medians of these are Q1 = 2.8 and Q3 = 25.3. Notice the effect on the IQR. For Tukey: IQR = Q3-Q1 = 19.7-3.0 = 16.7, but for TI, IQR = 25.3-2.8 = 22.5. For both of these methods, notice that the quartiles are either data values, or the average of two adjacent values. In Excel and other software, the quartiles are interpolated, so they may not be simple averages of two values. Be aware that there may be differences, but the idea is the same: the quartiles Q1, Q2, and Q3 split the data roughly into quarters.
Waiting in Line Why do banks favor a single line that feeds several teller windows rather than separate lines for each teller? It does make the average waiting time slightly shorter, but that improvement is very small. The real difference people notice is that the time you can expect to wait is less variable when there is a single line, and people prefer consistency.
For the AIG data, Q1 = +60.11, Q3 = +69.01.6 So the IQR = Q3 - Q1 = +69.01 - +60.11 = +8.90. The IQR is a reasonable summary of spread, but because it uses only the two quartiles of the data, it ignores much of the information about how individual values vary. By contrast, the standard deviation, takes into account how far each value is from the mean. Like the mean, the standard deviation is appropriate only for symmetric data and can be influenced by outlying observations. As the name implies, the standard deviation uses the deviations of each data value from the mean. The average7 of the squared deviations is called the variance and is denoted by s 2: s2 =
2 a 1x - x2 . n - 1
The variance plays an important role in statistics, but as a measure of spread, it has a problem. Whatever the units of the original data, the variance is in squared units. We want measures of spread to have the same units as the data, so we usually take the square root of the variance. That gives the standard deviation. s =
2 a 1x - x2 . B n - 1
For the AIG stock prices, s = +6.12.
For Example
Describing the spread
Question For the data on page 80, describe the spread of the number of downloads per hour.
Answer The range of downloads is 36 - 2 = 34 downloads per hour. The quartiles are 13 and 24.5, so the IQR is 24.5 - 13 = 11.5 downloads per hour. The standard deviation is 8.94 downloads per hour.
6
In general, we use the Tukey method in this book unless stated otherwise. For technical reasons, we divide by n - 1 instead of n to take this average. We’ll discuss this more in Chapter 11.
7
M03_SHAR8696_03_SE_C03.indd 87
14/07/14 7:26 AM
www.freebookslides.com 88
CHAPTER 3 Displaying and Describing Quantitative Data
Jus t C h e c k in g Thinking about Variation 1 The U.S. Census Bureau reports the median family income
in its summary of census data. Why do you suppose they use the median instead of the mean? What might be the disadvantages of reporting the mean?
2 You’ve just bought a new car that claims to get a highway
fuel efficiency of 31 miles per gallon (mpg). Of course, your mileage will “vary.” If you had to guess, would you expect
3.5
the IQR of gas mileage attained by all cars like yours to be 30mpg, 3 mpg, or 0.3 mpg? Why? 3 A company selling a new MP3 player advertises that the
player has a mean lifetime of 5 years. If you were in charge of quality control at the factory, would you prefer that the standard deviation of life spans of the players you produce be 2 years or 2 months? Why?
Shape, Center, and Spread—A Summary What should you report about a quantitative variable? Report the shape of its distribution, and include a center and a spread. But which measure of center and which measure of spread? The guidelines are pretty easy. • If the shape is skewed, point that out and report the median and IQR. You may want to include the mean and standard deviation as well, explaining why the mean and median differ. The fact that the mean and median do not agree is a sign that the distribution may be skewed. A histogram will help you make the point. • If the shape is unimodal and symmetric, report the mean and standard deviation and possibly the median and IQR as well. For unimodal symmetric data, the IQR is usually a bit larger than the standard deviation. If that’s not true for your data set, look again to make sure the distribution isn’t skewed or multimodal and that there are no outliers. • If there are multiple modes, try to understand why. If you can identify a reason for separate modes, it may be a good idea to split the data into separate groups. • If there are any clearly unusual observations, point them out. If you are reporting the mean and standard deviation, report them computed with and without the unusual observations. The differences may be revealing. • Always pair the median with the IQR and the mean with the standard deviation. It’s not useful to report a measure of center without a corresponding measure of spread. Reporting a center without a spread can lead you to think you know more about the distribution than you do. Reporting only the spread omits important information.
For Example
Summarizing data
Question Report on the shape, center, and spread of the downloads data; see page80.
Answer The distribution of downloads per hour over the past 24 hours is unimodal and roughly symmetric. The mean number of downloads per hour is 18.7 and the standard deviation is 8.94. There are several hours in the middle of the night with very few downloads, but none seem to be so unusual as to be considered outliers.
3.6
Standardizing Variables A real estate agent in California covers two markets. One is near Stanford University, filled with old homes and tree lined streets in an area called Old Palo Alto. The other is a newer neighborhood, in Foster City, created when part of the San Francisco Bay was filled in to create space for more housing.
M03_SHAR8696_03_SE_C03.indd 88
14/07/14 7:26 AM
www.freebookslides.com
89
Standardizing Variables
Here are summaries of the prices of a sample of houses in these two neighborhoods as found on Zillow.com: Figure 3.9 Prices of houses from samples from Old Palo Alto (left) and Foster City (right). Note that the horizontal scale is quite differentfor the two neighborhoods. Prices are in $1000’s.
Old Palo Alto
Foster City
10 12 10 Frequency
Frequency
8 6 4
8 6 4
2
2
500
1500
2500
3500
500
Price ($000)
1000
1500
2000
Price ($000)
The average house in Old Palo Alto was worth $1,930,436 with a standard deviation of $914,523, while the average Foster City house cost $711,400 with a standard deviation of $318,177. So, a $1,000,000 home in Foster City is on the expensive side, but for Old Palo Alto, that’s just over half the average cost. Which would be more unusual, a $2M home in Foster City, or a $3.5M home in Old Palo Alto? Using the standard deviation as a way to measure distance helps us answer this question. In Foster City, a $2M home is $1,288,600 over the average. In Old Palo Alto, a $3.5M home is $1,569,564 over its average. It might seem to be more unusual. But look at the standard deviations. That excess of $1,288,600 is over 4 standard deviations above the mean for Foster City. But in Old Palo Alto, the standard deviation is nearly $1M, so the $3.5M home is “only” 1.72 standard deviations above the mean. If you look at the histogram, you can see that $2M (2000) is off the right side for Foster City, but for Old Palo Alto, $3.5M (3500) is still in the histogram. Using the mean and standard deviation this way gives us a way to standardize values in different distributions to compare them.
How Does Standardizing Work? We first need to find the mean and standard deviation of each variable for the prices in each town.
Old Palo Alto Foster City
Mean ($000)
SD ($000)
1930.4
914.5
711.40
318.2
Next we measure how far each of our values is from the mean of its variable. We subtract the mean and then divide by the standard deviation: z = 1x - x2 >s
We call the resulting value a standardized value and denote it with the letter z. Usually, we just call it a z-score. The z-score tells us how many standard deviations a value is from its mean.
M03_SHAR8696_03_SE_C03.indd 89
14/07/14 7:26 AM
www.freebookslides.com 90
CHAPTER 3 Displaying and Describing Quantitative Data
Let’s look at Old Palo Alto first. To compute the z-score for the $3.5M house, take its value (3500 in $000 units), subtract the mean (1930.4) and divide by the standard deviation, 914.5: z = 13500 - 1930.42 >914.5 = 1.72
So this house’s price is 1.72 standard deviations above the mean price of all the houses we sampled in Old Palo Alto. How about that $2M home in Foster City? Standardizing it for the mean and standard deviation of Foster City prices, we find z = 12000 - 711.42 >318.2 = 4.05
Standardizing into z-Scores: • Shifts the mean to 0. • Changes the standard deviation to 1. • Does not change the shape. • Removes the units.
So this house’s price is over 4 standard deviations above the mean for its location! Standardizing enables us to compare values from different distributions to see which is more unusual in context.
For Example
Comparing values by standardizing
Question A real estate analyst finds from data on 350 recent sales, that the average
price was $175,000 with a standard deviation of $55,000. The size of the houses (in square feet) averaged 2100 sq. ft. with a standard deviation of 650 sq. ft. Which is more unusual, a house in this town that costs $340,000, or a 5000 sq. ft. house?
Answer Compute the z-scores to compare. For the $340,000 house: z =
1340,000 - 175,0002 x - x = 3.0 = s 55,000
The house price is 3 standard deviations above the mean. For the 5000 sq. ft. house: z =
15000 - 21002 x - x = = 4.46 s 650
This house is 4.46 standard deviations above the mean in size. That’s more unusual than the house that costs $340,000.
3.7
Five-Number Summary and Boxplots The five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum). The five-number summary of the monthly trading volumes of AIG stock for the period 2002–2007 looks like this (in millions of shares).
Max
515.62
Q3
182.32
Median
135.87
Q1
121.04
Min
83.91
Table 3.2 The five-number summary of monthly trading volume of AIG shares (in millions of shares) for the period 2002–2007.
M03_SHAR8696_03_SE_C03.indd 90
14/07/14 7:26 AM
www.freebookslides.com
91
Five-Number Summary and Boxplots
Monthly Volumes (in millions of shares)
500
400
*
* ** *
300
200
100
Figure 3.10 Boxplot of monthly volumes ofAIG stock traded in the period 2002–2007 (in millions of shares).
The five-number summary provides a good overall look at the distribution. We can see that on half of the days the volume was between 121.04 and 182.32 million shares, and that it was never above 515.62 or below 83.91 million shares. We can display the information from a five-number summary in a boxplot (see Figure 3.10). A boxplot highlights several features of the distribution of a variable. The central box shows the middle half of the data, between the quartiles. Because the top of the box is at the upper quartile (Q3) and the bottom is at Q1, the height of the box is equal to Q3 - Q1 which is the IQR. (For the AIG data, it’s 61.28.) The median is displayed as a horizontal line. If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric. If it is not centered, the distribution is skewed. In extreme cases, the median can coincide with one of the quartiles. The whiskers reach out from the box to the most extreme values that are not considered outliers. The boxplot nominates points as outliers if they fall farther than 1.5 IQRs beyond either quartile (for the AIG data, 1.5 IQR = 1.5 * 61.28 = 91.92). Outliers are displayed individually, both to keep them out of the way for judging skewness and to encourage you to give them special attention. They may be mistakes or they may be the most interesting cases in your data. This rule is not a definition of what makes a point an outlier. It just nominates cases for special attention. And it is not a substitute for careful analysis and thought about whether an extreme value deserves to be treated specially. Boxplots are especially useful for comparing several distributions side by side. From the shape of the box in Figure 3.10, we can see that the central part of the distribution of volume is skewed to the right (upward here) and the dissimilar length of the two whiskers shows that the skewness continues into the tails of the distribution. We also see several high-volume and some extremely high-volume months. Those months may warrant some inquiries into why trading volume was so high.
Why Use 1.5 IQRs for Nominating Outliers? Nominate a point as a potential outlier if it lies farther than 1.5 IQRs beyond either the lower (Q1) or upper (Q3) quartile. Some boxplots also designate points as “far” outliers if they lie more than 3 IQRs from the quartiles (as in Figure 3.10). The prominent statistician John W. Tukey, the originator of the boxplot, was asked (by one of the authors) why the outlier nomination rule cut at 1.5 IQRs beyond each quartile. He answered that the reason was simple—1IQR would be too small and 2 IQRs would be too large.
For Example
The boxplot rule for nominating outliers
Question From the histogram on page 80, we saw that no download times seemed
to be so far from the center as to be considered outliers. Use the 1.5 IQR rule to see if it nominates any points as outliers.
Answer The quartiles are 13 and 24.5 and the IQR is 11.5, and 1.5 * IQR = 17.25.
A value would have to be larger than 24.5 + 17.25 = 41.75 downloads per hour or smaller than 13 - 17.25 = -4.25. The largest value was 36 downloads per hour and all values must be nonnegative, so there are no points nominated as outliers.
M03_SHAR8696_03_SE_C03.indd 91
14/07/14 7:26 AM
www.freebookslides.com 92
CHAPTER 3 Displaying and Describing Quantitative Data
Guided Example
Credit Card Bank Customers To focus on the needs of particular customers, companies often segment their customers into groups with similar needs or spending patterns. A major credit card bank wanted to see how much a particular group of cardholders charged per month on their cards to understand the potential growth in their card use. The data for each customer was the amount he or she spent using the card during a recent three-month period. Boxplots are especially useful for one variable when combined with a histogram and numerical summaries. Let’s summarize the spending of this market segment.
Plan
Setup Identify the variable, the time frame of the data, and the objective of the analysis.
Do
Mechanics Select an appropriate
We want to summarize the average monthly charges (in dollars) made by 500 cardholders from a market segment of interest during a three-month period. The data are quantitative, so we’ll use histograms and boxplots, as well as numerical summaries.
display based on the nature of the data and what you want to know about it.
Note that outliers are often easier to see with boxplots than with histograms, but the histogram provides more details about the shape of the distribution. The computer program that made this boxplot “jitters” the outliers in the boxplot so they don’t lie on top of each other, making them easier to see.
# of Cardholders
It is always a good idea to anticipate the shape of the distribution so you can check whether the histogram is close to what you expected. Are these data reasonable amounts for customers to charge on their cards in a month? A typical value is a few hundred dollars. That seems like the right ballpark.
300
200
100
1000
2000
3000 4000 Charges ($)
5000
6000
7000
Both graphs show a distribution that is highly skewed to the right with several outliers and an extreme outlier near $7000. Summary of Monthly Charges
Count Mean Median StdDev IQR Q1 Q3
500 544.749 370.65 661.244 624.125 114.54 738.665
The mean is much larger than the median. The data have a skewed distribution.
M03_SHAR8696_03_SE_C03.indd 92
14/07/14 7:26 AM
www.freebookslides.com
93
Comparing Groups
Report
Interpretation Describe the shape, center, and spread of the distribution. Be sure to report on the symmetry, number of modes, and any gaps or outliers. Recommendation State a conclusion and any recommended actions or analysis.
3.8
Memo Re: Report on segment spending The distribution of charges for this segment during this time period is unimodal and skewed to the right. For that reason, we have summarized the data with the median and interquartile range (IQR). The median amount charged was $370.65. Half of the cardholders charged between $114.54 and $738.67. In addition, there are several high outliers, with one extreme value at $6745. There are also a few negative values. We suspect that these are people who returned more than they charged in a month, but because the values might be data errors, we suggest that they be checked. Future analyses should look at whether charges during these three months were similar to charges in the rest of the year. We would also like to investigate if there is a seasonal pattern and, if so, whether it can be explained by our advertising campaigns or by other factors.
Comparing Groups
50
50
40
40
30
30
# of Days
# of Days
Stock prices can sometimes reflect turmoil within a company. Could an investor have seen signs of trouble in the AIG stock prices? Figure 3.11 shows the daily closing prices for the first two years of our data, 2002 and 2003:
20 10
20 10
50
60 70 Daily Closing Prices 2002
80
50
60 70 Daily Closing Prices 2003
80
Figure 3.11 Daily closing prices of AIG on the NYSE for the two years 2002 and 2003. How do the two distributions differ?
Prices were generally lower in 2003 than 2002. The price distribution for 2002 appears to be symmetric with a center in the high $60s while the 2003 distribution is left skewed with a center below $60. For comparison, we displayed the two histograms on the same scale. Histograms with very different centers and spreads can appear similar unless you do that. When we compare several groups, boxplots usually do a better job. Boxplots offer an ideal balance of information and simplicity, hiding the details while displaying the overall summary information. And we can plot them side by side, making it easy to compare multiple groups or categories.
M03_SHAR8696_03_SE_C03.indd 93
14/07/14 7:26 AM
www.freebookslides.com 94
CHAPTER 3 Displaying and Describing Quantitative Data
When we place boxplots side by side, we can compare their centers and spreads. We can see past any outliers in making these comparisons because the outliers are displayed individually. We can also begin to look for trends in both the centers and the spreads.
Guided Example
AIG Stock Price What really happened to the AIG stock price from the beginning of the period we’ve been studying through the financial crisis of 2008/2009? We will use the daily closing prices of AIG stock for these nine years.
Plan
We want to compare the daily prices of AIG shares traded on the NYSE from 2002 through 2009. The daily price is quantitative and measured in dollars. We can partition the values by year and use side-by-side boxplots to compare the daily prices across years.
Mechanics Plot the side-by-side
80
boxplots of the data. Daily Closing Prices
Do
Setup Identify the variables, report the time frame of the data, and state the objective.
60 40 20 0 2002
2003
2004
2005
2006
2007
2008
2009
Year
What happened in 2008? We’d better look there with a finer partition. Here are boxplots by month for 2008. Display any other plots suggested by the analysis so far.
60
Daily Closing Prices
50 40 30 20 10 0 1
M03_SHAR8696_03_SE_C03.indd 94
2
3
4
5
6 7 Month
8
9
10
11
12
14/07/14 7:26 AM
www.freebookslides.com
95
Identifying Outliers
Report
Conclusion Report what you’ve learned about the data and any recommended action or analysis.
Memo Re: Research on price of AIG stock We have examined the daily closing prices of AIG stock on the NYSE for the period 2002 through 2009. Prices were relatively stable for the period 2002 through 2007. Prices were a bit lower in 2003 but recovered and stayed generally above $60 for 2004 through 2007. Then throughout the first 9 months of 2008, prices dropped dramatically, and remained low throughout 2009. A boxplot by month during 2008 shows that the decline in price was sharpest in September. Most analysts point to that month as the beginning of the financial meltdown, but clearly there were signs in the price of AIG that trouble had been brewing for much longer. By October, and for the rest of the year, the price was very low with almost no variation.
For Example
Comparing boxplots
Question For the data on page 80, compare the a.m. downloads to the p.m. downloads by displaying the two distributions side-by-side with boxplots. 35 30 Downloads per hour
Answer There are generally more downloads in the afternoon than in the morning. The median number of afternoon downloads is around 22 as compared with 14 for the morning hours. The p.m. downloads are also much more consistent. The entire range of the p.m. hours, 15, is about the size of the IQR for a.m. hours. Both distributions appear to be fairly symmetric, although the a.m. hour distribution has some high points which seem to give some asymmetry.
25 20 15 10 5 a.m.
3.9
p.m.
Identifying Outliers We’ve just seen that the price of AIG shares dropped precipitously during the year 2008. Figure 3.12 (next page) shows the daily sales volume by month. Cases that stand out from the rest of the data deserve our attention. Boxplots have a rule for nominating extreme cases to display as outliers, but that’s just a rule of thumb—not a definition. The rule doesn’t tell you what to do with them. It’s never a substitute for careful thinking about the data and their context. So, what should we do with outliers? The first thing to do is to try to understand them in the context of the data. Once you’ve identified likely outliers, you should always investigate them. Some outliers are not plausible and may simply be errors. A decimal point may have been misplaced, digits transposed, digits repeated
M03_SHAR8696_03_SE_C03.indd 95
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
Figure 3.12 In January, there was a high-volume day of 38 million shares that is nominated as an outlier for that month. In February there were three outliers with a maximum of over 100 million shares. In most months one or more high-volume days are identified as outliers for their month. But none of these high-volume days would have been considered unusual during September, when the median daily volume of AIG stock was 170 million shares. Days that may have seemed ordinary for September if placed in another month would have seemed extraordinary and vice versa. That high-volume day in January certainly wouldn’t stand out in September or even October or November, but for January it was remarkable.
Daily Volume (in millions of shares)
96
1200 1000 800 600 400 200 0
1
2
3
4
5
6 7 Month
8
9 10 11 12
or omitted, or the wrong value transcribed. Or, the units may be wrong. If you saw the number of AIG shares traded on the NYSE listed as 2 shares for a particular day, you’d know something was wrong. It could be that it was meant as 2 million shares, but you’d have to check to be sure. If you can identify the error, then you should certainly correct it. Other outliers are not wrong; they’re just different. These are the cases that often repay your efforts to understand them. You may learn more from the extraordinary cases than from summaries of the overall dataset. What about those two days in September that stand out as extreme even during that volatile month? Those were September 15 and 16, 2008. On the 15th, 740million shares of AIG stock were traded. That was followed by an incredible volume of over 1 billion shares of stock from a single company traded the following day. Here’s how Barron’s described the trading of September 16:
Record Volume for NYSE Stocks, Nasdaq Trades Surge Beats Its July Record Yesterday’s record-setting volume of 8.14 billion shares traded of all stocks listed on the New York Stock Exchange was pushed aside today by 9.31 billion shares in NYSE Composite volume. The biggest among those trades was the buying and selling of American International Group, with 1.11 billion shares traded as of 4 p.m. today. The AIG trades were 12% of all NYSE Composite volume.
For Example
Identifying outliers and summarizing data
Question A real estate report lists the following prices for sales of single family
homes in a small town in Virginia (rounded to the nearest thousand). Write a couple of sentences describing house prices in this town.
155,000 139,000 158,000 149,000
M03_SHAR8696_03_SE_C03.indd 96
329,000 178,000 194,000 160,000
172,000 339,435,000 279,000 231,000
122,000 136,000 167,000 136,000
260,000 330,000 159,000 128,000
14/07/14 7:26 AM
www.freebookslides.com
97
Time Series Plots
Answer A box plot shows an extreme outlier:
350
That extreme point is a home whose sale price is listed at $339.4 M.
300
A check on the Internet shows that the most expensive homes ever sold are less than $200 M. This is clearly a mistake.
Frequency
8
Price in $M
Setting aside this point, we find the following histogram and summary statistics:
250 200 150 100
6 4
50
2
0 100000
200000
300000
Price
The distribution of prices is strongly skewed to the right. The median price is $160,000. The minimum is $122,000 and the maximum (without the outlier) is $330,000. The middle 50% of house prices lie between $144,000 and $212,500 with an IQR of $68,500. (Calculated using the Tukey method.)
3.10
Time Series Plots A histogram can provide information about the distribution of a variable, but it can’t show any pattern over time. Whenever we have time series data, it is a good idea to look for patterns by plotting the data in time order. The histogram we saw in the beginning of the chapter (Figure 3.1) was an appropriate display for the distribution of prices because during that period from 2002 to 2007 the monthly prices were fairly stable. When a time series has no strong trend or change in variability we say that it is stationary.8 A histogram can provide a useful summary of a stationary series. When your data are measured over time, you should look for patterns by plotting the data in time order. For example, when we examine the daily prices for the year 2007 (Figure 3.13 on the next page), we see that prices started to change during the last quarter. A display of values against time is called a time series plot. This plot reveals a pattern that we were unable to see in either a histogram or a boxplot. Now we can see that although the price rallied in the spring of 2007, after July there were already signs that the price might not stay above $60. By October, that pattern was clear.
8 Sometimes we separate the properties and say the series is stationary with respect to the mean (if there is no trend) or stationary with respect to the variance (if the spread doesn’t change). But unless otherwise noted, we’ll assume that all the statistical properties of a stationary series are constant over time.
M03_SHAR8696_03_SE_C03.indd 97
14/07/14 7:26 AM
www.freebookslides.com 98
CHAPTER 3 Displaying and Describing Quantitative Data
Figure 3.13 A time series plot of daily closing Price of AIG stock for the year 2007 shows the overall pattern and changes in variation. Closing Price
70 65 60 55
01/01
04/01
07/01 2007
10/01
01/01
Time series plots often show a great deal of point-to-point variation, as igure3.13 does, so you’ll often see time series plots drawn with all the points F connected (as in Figure 3.14), especially in financial publications.
Figure 3.14 The Daily Prices of Figure 3.13, drawn with lines connecting all the points. Sometimes this can help us see an underlying pattern. Price
70
60
07
07
/20 12
07
/20 11
07 /20
/20 10
07
09
07
/20 08
07
/20 07
07
/20 06
07
/20 05
07
/20
/20
04
07 /20
03
/20 01
02
07
50
Date
Often it is better to try to smooth out the local point-to-point variability. After all, we usually want to see past this variation to understand any underlying trend and think about how the values vary around that trend—the time series version of center and spread. There are many ways for computers to find a smooth trace through a time series plot. A smooth trace can highlight long-term patterns and help us see them through the more local variation. Figure 3.15 shows the daily prices of Figures 3.13 and 3.14 with a typical smoothing function, available in many statistics programs. With the smooth trace, it’s a bit easier to see a pattern. The trace helps our eye follow the main trend and alerts us to points that don’t fit the overall pattern. It is always tempting to try to extend what we see in a timeplot into the future. Sometimes that makes sense. Most likely, the NYSE volume follows some regular patterns throughout the year. It’s probably safe to predict more volume on triple witching days (when contracts expire) and less activity in the week between Christmas and New Year’s Day.
M03_SHAR8696_03_SE_C03.indd 98
14/07/14 7:26 AM
www.freebookslides.com
Time Series Plots
Figure 3.15 The 2007 Daily Prices of Figure3.13, with a smooth trace added to help your eye see the long-term pattern.
99
Price
70
60
07
07
/20 12
07
/20
/20
11
07
10
07
/20 09
07
/20 08
07
07
/20
07
/20 06
07
/20
/20
05
07 /20
04
07
03
/20 02
01
/20
07
50
Date
Other patterns are riskier to extend into the future. If a stock’s price has been rising, how long will it continue to go up? No stock has ever increased in value indefinitely, and no stock analyst has consistently been able to forecast when a stock’s value will turn around. Stock prices, unemployment rates, and other economic, social, or psychological measures are much harder to predict than physical quantities. The path a ball will follow when thrown from a certain height at a given speed and direction is well-understood. The path that interest rates will take is much less clear. Unless you have strong (nonstatistical) reasons for doing otherwise, you should resist the temptation to think that any trend you see will continue indefinitely. Statistical models often tempt those who use them to think beyond the data. We’ll pay close attention later in this book to understanding when, how, and how much we can justify doing that. Look at the prices in Figures 3.13 through 3.15 and try to guess what happened in the subsequent months. Was that drop from October to December a sign of trouble ahead, or was the increase in December back to around $60 where the stock had comfortably traded for several years a sign that stability had returned to AIG’s stock price? Perhaps those who picked up the stock for $51 in early November really got a bargain. Let’s look ahead to 2008:
Figure 3.16 A time series plot of daily AIG Price in 2008 shows a general decline followed by a sharp collapse in September.
60 50
Price
40 30 20 10
/20 08
12
/20 08
11
/20 08
10
08
09 /20
08
08 /20
08
07 /20
08
06 /20
08 /20
05
08 /20
04
08 /20
03
08 /20
/20 01
02
08
Date
M03_SHAR8696_03_SE_C03.indd 99
14/07/14 7:26 AM
www.freebookslides.com 100
CHAPTER 3 Displaying and Describing Quantitative Data
Even through the spring of 2008, although the price was gently falling, nothing prepared traders following only the time series plot for what was to follow. In September the stock lost nearly all of its value. But, by 2012, it was trading in the mid $30’s.
For Example
Plotting time series data
Question The download times from the example on page 79 are a time series. Plot the data by hour of the day and describe any patterns you see.
Answer For this day, downloads were highest at midnight with about 36 downloads per hour then dropped sharply until about 5–6 a.m. when they reached their minimum at 2–3 per hour. They gradually increased to about 20 per hour by noon, and then stayed in the twenties until midnight, with a slight increase during the evening hours. When we ignored the time order, as we did earlier, we missed this pattern entirely. 40 35
Downloads
30 25 20 15 10 5
12.00 a.m. 1.00 a.m. 2.00 a.m. 3.00 a.m. 4.00 a.m. 5.00 a.m. 6.00 a.m. 7.00 a.m. 8.00 a.m. 9.00 a.m. 10.00 a.m. 11.00 a.m. 12.00 p.m. 1.00 p.m. 2.00 p.m. 3.00 p.m. 4.00 p.m. 5.00 p.m. 6.00 p.m. 7.00 p.m. 8.00 p.m. 9.00 p.m. 10.00 p.m. 11.00 p.m.
Hour
The histogram we saw in the beginning of the chapter (Figure 3.1) summarized the distribution of prices fairly well because during that period the prices were fairly stable; the price series appears to be stationary. However, when the time series is not stationary as was the case for AIG prices after 2007, be careful. A histogram is unlikely to capture what is really of interest. Then, a time series plot is the best graphical display to use to display the behavior of the data.
*3.11
Transforming Skewed Data When a distribution is skewed, it may not be appropriate to summarize the data simply with a center and spread, and it can be hard to decide whether the most extreme values are outliers or just part of the stretched-out tail. How can we say anything useful about such data? The secret is to apply a simple function to each data value. One function that can change the shape of a distribution is the logarithm function. Let’s examine an example in which a set of data is severely skewed. In 1980, the average CEO made about 42 times the average worker’s salary. In the two decades that followed, CEO compensation soared when compared with
M03_SHAR8696_03_SE_C03.indd 100
14/07/14 7:26 AM
www.freebookslides.com
Transforming Skewed Data
101
the average worker’s pay; by 2000, the multiple had jumped to 525.9 What does the distribution of the largest 500 companies’ CEOs look like? Figure 3.17 shows a boxplot and a histogram of the CEO compensation from a recent year.
# of CEOs
Figure 3.17 The total compensation for CEOs (in $M) of the 500 largest companies is skewed and includes some extraordinarily large values.
350 300 250 200 150 100 50 0
20
40 60 80 CEO compensation ($M)
100
These values are reported in millions of dollars. The boxplot indicates that some of the 500 CEOs received extraordinarily high compensation. The first bin of the histogram, containing more than half the CEOs, covers the range $0 to $10,000,000. The reason that the histogram seems to leave so much of the area blank is that the largest observations are so far from the bulk of the data, as we can see from the boxplot. Both the histogram and boxplot make it clear that this distribution is very skewed to the right. Total compensation for CEOs consists of their base salaries, bonuses, and extra compensation, usually in the form of stock or stock options. Data that add together several variables, such as the compensation data, can easily have skewed distributions. It’s often a good idea to separate the component variables and examine them individually, but we don’t have that information for the CEOs. Skewed distributions are difficult to summarize. It’s hard to know what we mean by the “center” of a skewed distribution, so it’s not obvious what value to use to summarize the distribution. What would you say was a typical CEO total compensation? The mean value in 2011 was $9,027,780, while the median is “only” $5,955,000. Each tells something different about how the data are distributed. One way to make a skewed distribution more symmetric is to re-express, or transform, the data by applying a simple function to all the data values. It is common to take the logarithm of variables like income, corporate earnings, and prices, which tend to be skewed to the right. Economists do this as a matter of course in many of their models.
Dealing with Logarithms You may think of logarithms as something technical, but they are just a function that can make some values easier to work with. You have probably already seen logarithmic scales in decibels, Richter scale values, pH values, and others. You may not have realized that logs had been used. Base 10 logs are the easiest to understand, but natural logs are often used as well. (Either one is fine.) You can think of the base 10 log of a number as roughly one less than the number of digits you need to write that number. So 100, which is the smallest number to require 3 digits, has a log 10 of 2. And 1000 has a log 10 of 3. The log 10 of 500 is between 2 and 3, but you’d need a calculator to find that it’s approximately 2.7. All salaries of “six figures” have log 10 between 5 and 6. Fortunately, with technology, it is easy to re-express data by logs.
9
Sources: United for a Fair Economy, Business Week annual CEO pay surveys, Bureau of Labor Statistics, “Average Weekly Earnings of Production Workers, Total Private Sector.” Series ID: EEU00500004.
M03_SHAR8696_03_SE_C03.indd 101
14/07/14 7:26 AM
www.freebookslides.com 102
CHAPTER 3 Displaying and Describing Quantitative Data
The histogram of the logs of the total CEO compensations in Figure 3.18 is nearly symmetric, so we can say that a typical log compensation is between 6.0 and 7.0, which means that it lies between $1 million and $10 million. To be more precise, the mean log10 value is 6.73, while the median is 6.67 (that’s $5,370,318 and $4,677,351, respectively). Note that nearly all the values are between 6.0 and 8.0—in other words, between $1,000,000 and $100,000,000 per year. Logarithmic transformations are common, but other transformations like square root and reciprocal are also used. Because computers and calculators are available to do the calculating, you should consider transformation as a helpful tool whenever you have skewed data. Figure 3.18 Taking logs makes the histogram of CEO total compensation nearly symmetric.
# of CEOs
125 100 75 50 25 4.5
For Example
5
5.5 6 6.5 7 7.5 Logarithm (base 10) of CEO salary
8
Transforming skewed data
Question Fortune magazine publishes a list of the 100 best companies to work for (money.cnn.com/magazines/fortune/bestcompanies/2010/). One statistic often looked at is the average annual pay for the most common job title at the company. Can we characterize those pay values? Here is a histogram of the average annual pay values and a histogram of the logarithm of the pay values. Which would provide the better basis for summarizing pay? 25
30
20 20
15 10
10
5 35000
160000 Pay
285000
4.5
5.0 Lpay
5.5
Answer The pay values are skewed to the high end. The logarithmic transformation makes the distribution more nearly symmetric, making it more appropriate to summarize with a mean and standard deviation.
M03_SHAR8696_03_SE_C03.indd 102
14/07/14 7:26 AM
www.freebookslides.com
What Can Go Wrong?
103
What Can Go Wrong? A data display should tell a story about the data. To do that it must speak in a clear language, making plain what variable is displayed, what any axis shows, and what the values of the data are. And it must be consistent in those decisions. The task of summarizing a quantitative variable requires that we follow a set of rules. We need to watch out for certain features of the data that make summarizing them with a number dangerous. Here’s some advice: • Don’t make a histogram of a categorical variable. Just because the variable
contains numbers doesn’t mean it’s quantitative. Here’s a histogram of the insurance policy numbers of some workers. It’s not very informative because the policy numbers are categorical. A histogram or stem-and-leaf display of a categorical variable makes no sense. A bar chart or pie chart may do better.
Figure 3.19 It’s not appropriate to display categorical data like policy numbers with a histogram.
# of Policies
4000 3000 2000 1000
10000
30000
50000 70000 Policy Number
90000
• Choose a scale appropriate to the data. Computer programs usually do
a pretty good job of choosing histogram bin widths. Often, there’s an easy way to adjust the width, sometimes interactively. Figure 3.20 shows the AIG price histogram with two other choices for the bin size. Neither seems to be the best choice.
7
40
6 30 Frequency
5 Frequency
Figure 3.20 Changing the bin width changes how the histogram looks. The AIG stock prices look very different with these two choices.
4 3 2
20
10
1 0
0 50
55
60
65
70
75
Price
40
50
60 Price
70
80
• Avoid inconsistent scales. Parts of displays should be mutually consistent—
it’s not fair to change scales in the middle or to plot two variables on different scales on the same display. When comparing two groups, be sure to draw them on the same scale.
• Label clearly. Variables should be identified clearly and axes labeled so a
reader knows what the plot displays.
M03_SHAR8696_03_SE_C03.indd 103
14/07/14 7:26 AM
www.freebookslides.com 104
CHAPTER 3 Displaying and Describing Quantitative Data
Here’s a remarkable example of a plot gone wrong. It illustrated a news story about rising college costs. It uses time series plots, but it gives a misleading impression. First, think about the story you’re being told by this display. Then try to figure out what has gone wrong.
What’s wrong? Just about everything. • The horizontal scales are inconsistent. Both lines show trends over
time, but for what years? The tuition sequence starts in 1965, but rankings are graphed from 1989. Plotting them on the same (invisible) scale makes it seem that they’re for the same years.
• The vertical axis isn’t labeled. That hides the fact that it’s using two
different scales. Does it graph dollars (of tuition) or ranking (of Cornell University)?
This display violates every rule we can think of. And it’s even worse than that. It violates a rule that we didn’t even consider. The two inconsistent scales for the vertical axis don’t point in the same direction! The line for Cornell’s rank shows that it has “plummeted” from 15th place to 6th place in academic rank. Most of us think that’s an improvement, but that’s not the message of this graph. • Do a reality check. Don’t let the computer (or calculator) do your think-
ing for you. Make sure the calculated summaries make sense. For example, does the mean look like it is in the center of the histogram? Think about the spread. An IQR of 50 mpg would clearly be wrong for a family car. And no measure of spread can be negative. The standard deviation can take the value 0, but only in the very unusual case that all the data values equal the same number. If you see the IQR or standard deviation equal to 0, it’s probably a sign that something’s wrong with the data.
M03_SHAR8696_03_SE_C03.indd 104
14/07/14 7:27 AM
www.freebookslides.com
105
What Have We Learned?
• Don’t compute numerical summaries of a categorical variable. The
mean ZIP code or the standard deviation of Social Security numbers is not meaningful. If the variable is categorical, you should instead report summaries such as percentages. It is easy to make this mistake when you let technology do the summaries for you. After all, the computer doesn’t care what the numbers mean.
• Watch out for multiple modes. If the distribution—as seen in a histogram,
for example—has multiple modes, consider separating the data into groups. If you cannot separate the data in a meaningful way, you should not summarize the center and spread of the variable.
• Beware of outliers. If the data have outliers but are otherwise unimodal,
consider holding the outliers out of the further calculations and reporting them individually. If you can find a simple reason for the outlier (for instance, a data transcription error), you should remove or correct it. If you cannot do either of these, then choose the median and IQR to summarize the center and spread.
Ethics in Action
B
eth Tully owns Zenna’s Café, an independent coffee shop located in a small Midwestern city. Since opening Zenna’s in 2002, she has been steadily growing her business and now distributes her custom coffee blends to a number of regional restaurants and markets. She operates a microroaster that offers specialty grade Arabica coffees recognized by some as the best in the area. In addition to providing the highest quality coffees, Beth also wants her business to be socially responsible. Toward that end, she pays fair prices to coffee farmers and donates funds to help charitable causes in Panama, Costa Rica, and Guatemala. In addition, she encourages her employees to get involved in the local community. Recently, one of the well-known multinational coffeehouse chains announced plans to locate shops in her area. This chain is one of the few to offer Certified Free-Trade coffee products and work toward social justice in the global community. Consequently, Beth thought it might be a good idea for her to
begin communicating Zenna’s socially-responsible efforts to the public, but with an emphasis on their commitment to the local community. Three months ago she began collecting data on the number of volunteer hours donated by her employees per week. She has a total of 12 employees, of whom 10 are full time. Most employees volunteered less than 2 hours per week, but Beth noticed that one part-time employee volunteered more than 20 hours per week. She discovered that her employees collectively volunteered an average of 15 hours per month (with a median of 8 hours). She planned to report the average number and believed most people would be impressed with Zenna’s level of commitment to the local community. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • Propose an ethical solution that considers the welfare of all stakeholders.
What Have We Learned? Learning Objectives
Make and interpret histograms to display the distribution of a variable.
• We understand distributions in terms of their shape, center, and spread. Describe the shape of a distribution.
• A symmetric distribution has roughly the same shape reflected around the center. • A skewed distribution extends farther on one side than on the other. • A unimodal distribution has a single major hump or mode; a bimodal distribution has two; multimodal distributions have more. • Outliers are values that lie far from the rest of the data.
M03_SHAR8696_03_SE_C03.indd 105
14/07/14 7:27 AM
www.freebookslides.com 106
CHAPTER 3 Displaying and Describing Quantitative Data ompute the mean and median of a distribution, and know when it is best to use each to C summarize the center.
• The mean is the sum of the values divided by the count. It is a suitable summary for unimodal, symmetric distributions. • The median is the middle value; half the values are above and half are below the median. Itis a better summary when the distribution is skewed or has outliers. ompute the standard deviation and interquartile range (IQR), and know when it is best to use C each to summarize the spread.
• The standard deviation is the square root of the average squared difference between each data value and the mean. It is the summary of choice for the spread of unimodal, symmetric variables. • The IQR is the difference between the quartiles. It is often a better summary of spread for skewed distributions or data with outliers. Standardize values and use them for comparisons of otherwise disparate variables.
• We standardize by finding z-scores. To convert a data value to its z-score, subtract the mean and divide by the standard deviation. • z-scores have no units, so they can be compared to z-scores of other variables. • The idea of measuring the distance of a value from the mean in terms of standard deviations is a basic concept in Statistics and will return many times later in the course. ind a five-number summary and, using it, make a boxplot. Use the boxplot’s outlier F nominationrule to identify cases that may deserve special attention.
• A five-number summary consists of the median, the quartiles, and the extremes of the data. • A boxplot shows the quartiles as the upper and lower ends of a central box, the median as a line across the box, and “whiskers” that extend to the most extreme values that are not nominated as outliers. • Boxplots display separately any case that is more than 1.5 IQRs beyond each quartile. These cases should be considered as possible outliers. Use boxplots to compare distributions.
• Boxplots facilitate comparisons of several groups. It is easy to compare centers (medians) and spreads (IQRs). • Because boxplots show possible outliers separately, any outliers don’t affect comparisons. Make and interpret time plots for time series data.
• Look for the trend and any changes in the spread of the data over time.
Terms Bin
In a histogram, the range of possible values are split into intervals called bins, over which the frequencies are displayed.
Bimodal
Distributions with two modes.
Boxplot
A boxplot displays the 5-number summary as a central box with whiskers that extend to the nonoutlying values. Boxplots are particularly effective for comparing groups.
Center Distribution
The middle of the distribution, usually summarized numerically by the mean or the median. The distribution of a variable gives:
• possible values of the variable • frequency or relative frequency of each value or range of values
M03_SHAR8696_03_SE_C03.indd 106
14/07/14 7:27 AM
www.freebookslides.com
107
What Have We Learned? Five-number summary
A five-number summary for a variable consists of: • The minimum and maximum • The quartiles Q1 and Q3 • The median
Gap Histogram (relative frequency histogram) Interquartile range (IQR) Mean Median Mode Multimodal
A region of a distribution where there are no values A histogram uses adjacent bars to show the distribution of values in a quantitative variable. Each bar represents the frequency (relative frequency) of values falling in an interval of values. The difference between the lower and upper quartiles. IQR = Q3 - Q1.
A measure of center found as x = gx>n.
The middle value with half of the data above it and half below it. A peak or local high point in the shape of the distribution of a variable. The apparent location of modes can change as the scale of a histogram is changed. Distributions with more than two modes.
Outliers
Extreme values that don’t appear to belong with the rest of the data. They may be unusual values that deserve further investigation or just mistakes; there’s no obvious way to tell.
Quartile
The lower quartile (Q1) is the value with a quarter of the data below it. The upper quartile (Q3) has a quarter of the data above it. The median and quartiles divide the data into four equal parts.
Range *Re-express or transform
Shape
The difference between the lowest and highest values in a data set: Range = max - min. To re-express or transform data, take the logarithm, square root, reciprocal, or some other mathematical operation on all values of the data set. Re-expression can make the distribution of a variable more nearly symmetric and the spread of groups more nearly alike. The visual appearance of the distribution. To describe the shape, look for:
• single vs. multiple modes • symmetry vs. skewness Skewed Spread
A distribution is skewed if one tail stretches out farther than the other. The description of how tightly clustered the distribution is around its center. Measures of spread include the IQR and the standard deviation. 2 a 1x - x2 . B n - 1
Standard deviation
A measure of spread found as s =
Standardized value
We standardize a value by subtracting the mean and dividing by the standard deviation for the variable. These values, called z-scores, have no units.
Stationary *Stem-and-leaf display Symmetric Tail Time series plot Uniform Unimodal Variance z-Score
M03_SHAR8696_03_SE_C03.indd 107
A time series is said to be stationary if its statistical properties don’t change over time. A stem-and-leaf display shows quantitative data values in a way that sketches the distribution of the data. It’s best described in detail by example like the one on page 81. A distribution is symmetric if the two halves on either side of the center look approximately like mirror images of each other. The tails of a distribution are the parts that typically trail off on either side. A time series plot displays the values of a time series plotted against time. Often, successive values are connected with lines to show trends more clearly. A distribution that’s roughly flat is said to be uniform. Having one mode. This is a useful term for describing the shape of a histogram when it’s generally mound-shaped. The standard deviation squared. A standardized value that tells how many standard deviations a value is from the mean; z-scores have a mean of 0 and a standard deviation of 1.
14/07/14 7:27 AM
www.freebookslides.com 108
CHAPTER 3 Displaying and Describing Quantitative Data
Technology Help: D isplaying and Summarizing Quantitative Variables
Almost any program that displays data can make a histogram, but some will do a better job of determining where the bars should start and how they should partition the span of the data (see the figure below).
It is usually easy to read the results and identify each computed summary statistic. You should be able to read the summary statistics produced by any computer package.
Many statistics packages offer a prepackaged collection of s ummary measures. The result might look like this:
Packages often provide many more summary statistics than you need. Of course, some of these may not be appropriate when the data are skewed or have outliers. It is your responsibility to check a histogram or stem-and-leaf display and decide which summary statistics to use.
Variable: Weight N = 234 Mean = 143.3 Median = 139 St. Dev = 11.1 IQR = 14
Alternatively, a package might make a table for several variables and summary measures: Variable Weight Height Score
N 234 234 234
mean 143.3 68.3 86
median 139 68.1 88
stdev 11.1 4.3 9
IQR 14 5 5
The vertical scale may be counts or proportions. Sometimes it isn't clear which. But the shape of the histogram is the same either way.
The axis should be clearly labeled so you can tell what "pile" each bar represents. You should be able to tell the lower and upper bounds of each bar.
To make a histogram in Excel 2010 or 2013, use the Data Analysis add-in. If you have not installed that, you must do that first: • On the File tab, click Options, and then click Add-Ins. • Near the bottom of the Excel Options dialog box, select Excel addins in the Manage box, and then click Go. • In the Add-Ins dialog box, select the check box for Analysis ToolPak, and then click OK. • If Excel displays a message that states it can’t run this add-in and prompts you to install it, click Yes to install the add-in. • From Data, select the Data Analysis add-in. • From its menu, select Histograms.
M03_SHAR8696_03_SE_C03.indd 108
Displays and summaries of quantitative variables are among the simplest things you can do in most statistics packages.
Most packages choose the number of bars for you automatically. Often you can adjust that choice.
28.0 29.0 30.0 31.0 32.0 33.0 34.0 35.0 Run Times
Excel
To make a histogram,
It is common for packages to report summary statistics to many decimal places of “accuracy.” Of course, it is rare to find data that have such accuracy in the original measurements. The ability to calculate to six or seven digits beyond the decimal point doesn’t mean that those digits have any meaning. Generally, it’s a good idea to round these values, allowing perhaps one more digit of precision than was given in the original data.
• Indicate the range of the data whose histogram you wish to draw. • Indicate the bin ranges that are up to and including the right end points of each bin. • Check Labels if your columns have names in the first cell. • Check Chart output and click OK. • Right-click on any bar of the resulting graph and, from the menu that drops down, select Format Data Series … • In the dialog box that opens, select Series Options from the sidebar. • Slide the Gap Width slider to No Gap, and click Close. • In the pivot table on the left, use your pointing tool to slide the bottom of the table up to get rid of the “more” bin. • Edit the bin names in Column A to properly identify the contents of each bin.
14/07/14 7:27 AM
www.freebookslides.com
109
Technology Help • In the Distribution dialog, drag the name of the variable that you wish to analyze into the empty window beside the label “Y,Columns.” • Click OK. JMP computes standard summary statistics along with displays of the variables. To make boxplots:
• You can right click on the legend or axis names to edit or remove them. • Following these instructions, you can reproduce Figure 3.1 using the data set AIG stock series. Alternatively, you can set up your own bin boundaries and count the observations falling within each bin using an Excel function such as FREQUENCY (Data array, Bins array). Consult your Excel manual or help files for details of how to do this. To create a time series plot in Excel: • Open a time-series data file sorted in ascending order by time. • Highlight the column(s) holding the values of the quantitative variable(s) measured over a period of time. • Choose Insert + Charts + Line + 2-D Line. • Click the chart and choose Select Data from the Chart Tools + Design menu. • Choose Select Data Source + Edit under Horizontal (Category) Axis Labels and select the time data.
• Choose Fit Y By X. Assign a continuous response variable to Y, Response and a nominal group variable holding the group names to X, Factor, and click OK. JMP will offer (among other things) dotplots of the data. Click the red triangle and, under Display Options, select Box Plots. Note: If the variables are of the wrong type, the display options might not offer boxplots. Alternatively • Chose Graph + Graph Builder. Drag the quantitative variable to Y and the categorical variable to X. Right click on the points and change Points to Box Plots To make a time series plot in JMP: • From the Analyze menu, choose Fit Y by X. • Move the y variable (measured over time) into the Y, Response box. • Move the x variable (time) into the X, Factor box. • Press OK. • To connect the points select Fit Each Value from the red triangle next to Bivariate Fit. • To put a smooth through the points, select either Kernel Smoother or Fit Spline under the red triangle. Comments For either the Kernel Smoother or Spline a slider bar appears to adjust the amount of smoothing. Alternatively, • Select Graph + Graph Builder. • Drag the y variable into the Y window and the x variable (time) into the X window. • The default shows the points and a smoother. • Right click on the graph and select Smoother + Change to + Line to connect the points instead.
Minitab To make a histogram:
XLStat To make a box plot: • Choose Visualizing data, and then select Univariate plots. • Enter the cell range of your data in the Quantitative or Qualitative data field. • Select the type of chart on the Charts tab. Note: XLStat scales side-by-side boxplots individually, so they are not suitable for comparing groups.
JMP To make a histogram and find summary statistics:
• Choose Histogram from the Graph menu. • Select Simple for the type of graph and click OK. • Enter the name of the quantitative variable you wish to display in the box labeled “Graph variables.” Click OK. To make a boxplot: • Choose Boxplot from the Graph menu and specify your data format. To calculate summary statistics: • Choose Basic Statistics from the Stat menu. From the Basic Statistics submenu, choose Display Descriptive Statistics. • Assign variables from the variable list box to the Variables box. MINITAB makes a Descriptive Statistics table.
• Choose Distribution from the Analyze menu.
M03_SHAR8696_03_SE_C03.indd 109
14/07/14 7:27 AM
www.freebookslides.com 110
CHAPTER 3 Displaying and Describing Quantitative Data
SPSS
• Click OK.
To make a histogram or boxplot in SPSS open the Chart Builder from the Graphs menu. • Click the Gallery tab. • Choose Histogram or Boxplot from the list of chart types. • Drag the icon of the plot you want onto the canvas. • Drag a scale variable to the y-axis drop zone.
Brief Case
To make side-by-side boxplots, drag a categorical variable to the x-axis drop zone and click OK. To calculate summary statistics: • Choose Explore from the Descriptive Statistics submenu of theAnalyze menu. In the Explore dialog, assign one or more variables from the source list to the Dependent List and click theOK button.
Detecting the Housing Bubble The S&P/Case-Shiller Home Price Indices provide measures of the U.S. residential housing market. They track changes in the value of residential real estate nationally and in 20 metropolitan regions. (Some of these indices are actually traded on the Chicago Mercantile Exchange). The dataset Case-Shiller gives the monthly index values for each of the 20 cities tracked by the Case-Shiller index and two national composite series. Examine these values and write a report on them. Some suggestions: First consider the Composite.20 series, which combines (seasonally adjusted) data for the 20 cities. Describe the distribution of prices overall, then look at a time series plot and discuss the trend over time especially the period from 2007 to 2008. Then select several cities to compare. For example, you might compare Miami, Boston, and Detroit. Write a report discussing how trends in housing prices changed over time and how these changes differed from city to city.
Socio-Economic Data on States The dataset States contains various educational and economic measures of the 50 U.S. states, including the District of Columbia. Examine the variables, commenting on their shape, center, and spread and any unusual features. If you see any unusual cases, set them aside if appropriate, comment on why you took them out, and redo the analysis without them.
Exercises Section 3.1 1. As part of the marketing team at an Internet music site, you want to understand who your customers are. You send out a survey to 25 customers (you use an incentive of $50 worth of downloads to guarantee a high response rate) asking for demographic information. One of the variables is the customer’s age. For the 25 customers the ages are: 20 30 38 25 35
M03_SHAR8696_03_SE_C03.indd 110
32 30 22 22 42
34 14 44 32 44
29 29 48 35 44
30 11 26 32 48
a) Make a histogram of the data using a bar width of 10 years. b) Make a histogram of the data using a bar width of 5 years. c) Make a relative frequency histogram of the data using a bar width of 5 years. d) *Make a stem-and-leaf plot of the data using 10s as the stems and putting the youngest customers on the top of the plot. 2. As the new manager of a small convenience store, you want to understand the shopping patterns of your customers. You randomly sample 20 purchases from yesterday’s records (all purchases in U.S. dollars):
14/07/14 7:27 AM
www.freebookslides.com
Exercises 111
39.05 37.91 56.95 21.57 75.16
2.73 34.35 81.58 40.83 74.30
32.92 64.48 47.80 38.24 47.54
47.51 51.96 11.72 32.98 65.62
a) Make a histogram of the data using a bar width of $20. b) Make a histogram of the data using a bar width of $10. c) Make a relative frequency histogram of the data using a bar width of $10. d) *Make a stem-and-leaf plot of the data using $10 as the stems and putting the smallest amounts on top and round the data to the nearest $.
Section 3.5 9. The histogram shows the December charges (in $) for 5000 customers from one marketing segment from a credit card company. (Negative values indicate customers who received more credits than charges during the month.) a) Write a short description of this distribution (shape, center, spread, unusual features). b) Would you expect the mean or the median to be larger? Explain. c) Which would be a more appropriate summary of the center, the mean or the median? Explain. 800
Section 3.2
4. For the histogram you made in Exercise 2a: a) Is the distribution unimodal or multimodal? b) Where is (are) the mode(s)? c) Is the distribution symmetric? d) Are there any outliers?
Section 3.3 5. For the data in Exercise 1: a) Would you expect the mean age to be smaller than, bigger than, or about the same size as the median? Explain. b) Find the mean age. c) Find the median age. 6. For the data in Exercise 2: a) Would you expect the mean purchase to be smaller than, bigger than, or about the same size as the median? Explain. b) Find the mean purchase. c) Find the median purchase.
600 Frequency
3. For the histogram you made in Exercise 1a: a) Is the distribution unimodal or multimodal? b) Where is (are) the mode(s)? c) Is the distribution symmetric? d) Are there any outliers?
200
–1000
8. For the data in Exercise 2: a) Find the quartiles using your calculator. b) Find the quartiles using Tukey’s method (page 86). c) Find the IQR using the quartiles from part b. d) Find the standard deviation.
M03_SHAR8696_03_SE_C03.indd 111
1000 2000 3000 December Charge
4000
5000
10. Adair Vineyard is a 10-acre vineyard in New Paltz, New York. The winery itself is housed in a 200-year-old historic Dutch barn, with the wine cellar on the first floor and the tasting room and gift shop on the second. Since they are relatively small and considering an expansion, they are curious about how their size compares to that of other vineyards. The histogram shows the sizes (in acres) of 36 wineries in upstate New York. a) Write a short description of this distribution (shape, center, spread, unusual features). b) Would you expect the mean or the median to be larger? Explain. c) Which would be a more appropriate summary of the center, the mean or the median? Explain. 15
# of Vineyards
Section 3.4 7. For the data in Exercise 1: a) Find the quartiles using your calculator. b) Find the quartiles using Tukey’s method (page 86). c) Find the IQR using the quartiles from part b. d) Find the standard deviation.
400
10
5
120 Size (acres)
240
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
Section 3.6 11. Using the ages from Exercise 1: a) Standardize the minimum and maximum ages using the mean from Exercise 5b and the standard deviation from Exercise 7d. b) Which has the more extreme z-score, the min or the max? c) How old would someone with a z-score of 3 be? 12. Using the purchases from Exercise 2: a) Standardize the minimum and maximum purchase using the mean from Exercise 6b and the standard deviation from Exercise 8d. b) Which has the more extreme z-score, the min or the max? c) How large a purchase would a purchase with a z-score of 3.5 be?
Section 3.7 13. For the data in Exercise 1: a) Draw a boxplot using the quartiles from Exercise 7b. b) Does the boxplot nominate any outliers? c) What age would be considered a high outlier? 14. For the data in Exercise 2: a) Draw a boxplot using the quartiles from Exercise 8b. b) Does the boxplot nominate any outliers? c) What purchase amount would be considered a high outlier? 15. Here are summary statistics for the sizes (in acres) of upstate New York vineyards from Exercise 10. Variable Acres
N
Mean
StDev
Minimum
Q1
Median
Q3
Maximum
36
46.50
47.76
6
18.50
33.50
55
250
a) From the summary statistics, would you describe this distribution as symmetric or skewed? Explain. b) From the summary statistics, are there any outliers? Explain. c) Using these summary statistics, sketch a boxplot. What additional information would you need to complete the boxplot? 16. Local airport authorities were asked what percentage of flights arrive on time. Use the summary statistics given to answer these questions. % on time Count Mean Median St. Dev Min Max Range 25th %tile 75th %tile
M03_SHAR8696_03_SE_C03.indd 112
480 69 74 15 60 88 45 60 75
a) Comment on the distribution of the data. b) Are there any outliers? Explain. c) Create a boxplot of these data.
Section 3.8 17. The survey from Exercise 1 had also asked the customers to say whether they were male or female. Here are the data: Age
Sex
Age
Sex
Age
Sex
Age
Sex
Age
Sex
20
M
32
F
34
F
29
M
30
M
30
F
30
M
14
M
29
M
11
M
38
F
22
M
44
F
48
F
26
F
25
M
22
M
32
F
35
F
32
F
35
F
42
F
44
F
44
F
48
F
Construct boxplots to compare the ages of men and women and write a sentence summarizing what you find. 18. The store manager from Exercise 2 has collected data on purchases from weekdays and weekends. Here are some summary statistics (rounded to the nearest dollar): Weekdays: n = 230 Min = 4, Q1 = 28, Median = 40, Q3 = 68, Max = 95 Weekends: n = 150 Min = 10, Q1 = 35, Median = 55, Q3 = 70, Max = 100 From these statistics, construct side-by-side boxplots and write a sentence comparing the two distributions. 19. Here are boxplots of the weekly sales (in $ U.S.) over a two-year period for a regional food store for two locations. Location #1 is a metropolitan area that is known to be residential where shoppers walk to the store. Location #2 is a suburban area where shoppers drive to the store. Assume that the two towns have similar populations and that the two stores are similar in square footage. Write a brief report discussing what these data show. 350,000 300,000 Weekly Sales ($)
112
250,000 200,000 150,000 100,000 Location #1
Location #2
20. Recall the distributions of the weekly sales for the regional stores in Exercise 19. Following are boxplots of weekly sales for this same food store chain for three stores
14/07/14 7:27 AM
www.freebookslides.com
Exercises 113
of similar size and location for two different states: Massachusetts (MA) and Connecticut (CT). Compare the distribution of sales for the two states and describe in a report.
Section 3.11 25. The histogram of the total revenues (in $M) of the movies in exercise 21 looks like this:
225,000 70
175,000
60
150,000
50
125,000
Count
Weekly Sales ($)
200,000
100,000 75,000 50,000
40 30 20
MA Stores
CT Stores
10
Section 3.9
Min
Q1
Q2
Q3
Max
29.2
45.1
65.2
129.9
180
Are there any outliers in these data? How can you tell? What might your next steps in the analysis be? 22. The five-number summary for the ages of respondents to a survey on laptop use looks like this: Min
Q1
Q2
Q3
Max
14
25
39
51
82
Are there any outliers in these data? How can you tell? What might your next steps in the analysis be?
Section 3.10 23. Are the following data time series? If not, explain why. a) Quarterly earnings of Microsoft Corp. b) Unemployment in August 2010 by education level. c) Time spent in training by workers in NewCo. d) Numbers of e-mails sent by employees of SynCo each hour in a single day. 24. Are the following data time series? If not, explain why. a) Reports from the Bureau of Labor Statistics on the number of U.S. adults who are employed full time in each major sector of the economy. b) The quarterly Gross Domestic Product (GDP) of France from 1980 to the present. c) The dates on which a particular employee was absent from work due to illness over the past two years. d) The number of cases of flu reported by the CDC each week during a flu season.
M03_SHAR8696_03_SE_C03.indd 113
200
300
400
500
600
700
What might you suggest for the next step of the analysis? 26. The histogram of the ages of the respondents in exercise 22 looks like this:
15 Count
21. The five-number summary for the total sales (in $B) of the top 20 world corporations described by the table below:
100
10 5
40
80
120
160
200
240
What might you suggest for the next step of the analysis?
Chapter Exercises 27. Statistics in business. Find a histogram that shows the distribution of a quantitative variable in a business publication (e.g., The Wall Street Journal, Business Week, etc.). a) Does the article identify the W’s? b) Discuss whether the display is appropriate for the data. c) Discuss what the display reveals about the variable and its distribution. d) Does the article accurately describe and interpret the data? Explain. 28. Statistics in business, part 2. Find a graph other than a histogram that shows the distribution of a quantitative variable in a business publication (e.g., The Wall Street Journal, Business Week, etc.). a) Does the article identify the W’s? b) Discuss whether the display is appropriate for the data. c) Discuss what the display reveals about the variable and its distribution. d) Does the article accurately describe and interpret the data? Explain.
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
114
29. Years in education. According to the OECD’s 2013 How’s life? study, the longer a person has been educated, the higher their well-being. The following histogram shows the distribution of years of education for 34 OECD countries, Russia, and Brazil.
10 # of Funds
14 12 10 Count
15
5
8 6 4 2
–5 14
15
16 17 18 19 Years of Education
a) Write a short description of this distribution (shape, center, spread, unusual features). b) Which two bins hold the most values for years of education? c) If both primary and high school take six years in most countries, how many years of higher education do inhabitants of the OECD countries enjoy? 30. Gas prices 2013. The website LosAngelesGasPrices.com has current gasoline prices all over the United States. In the week of February 5, 2013, the following histogram shows the gas prices at 55 stations in the San Francisco Bay Area. Describe the shape of this distribution (shape, center, spread, unusual features).
10
10
4
15
14 12
6
10
32. Car discounts. A researcher, interested in studying gender differences in negotiations, collects data on the prices that men and women pay for new cars. Here is a histogram of the discounts (the amount in $ below the list price) that men and women received at one car dealership for the last 100 transactions (54 men and 46 women). Give a short summary of this distribution (shape, center, spread, unusual features). What do you think might account for this particular shape?
12
8
5 Year to Date Return (in %)
a) From the histogram, give a short summary of the distribution (shape, center, spread, unusual features). b) In general, how did these funds perform compared to the S&P 500?
Number of Shoppers
# of Stations
20
8 6 4 2
2
0 0
0 3.70
3.80
3.90 4.00 Price per Gallon ($)
4.10
31. Mutual funds 2013. In 2013, the Standard & Poor’s (S&P) 500 stock index reached a new all time high in early April. For the first quarter of 2013, the index was up 10.0%. Here is a histogram of the returns for Money Magazine’s top 70 mutual funds for the same period (money .cnn.com/magazines/moneymag/bestfunds/index.html).
M03_SHAR8696_03_SE_C03.indd 114
500
1000 1500 2000 Amount of Discount
2500
33. Mutual funds 2013, part 2. Use the actual data set of Exercise 31 to answer the following questions. a) Find the five-number summary for these data. b) Find appropriate measures of center and spread for these data. c) Create a boxplot for these data. d) What can you see, if anything, in the histogram that isn’t clear in the boxplot?
14/07/14 7:27 AM
www.freebookslides.com
Exercises 115
34. Car discounts, part 2. Use the data set of Exercise 32 to answer the following questions.
*36. Gas prices 2013, again. The data set provided contains the data from Exercise 30 on the price of gas for 55 stations around San Francisco in January 2013. Create a stem-andleaf display of the data. Point out any unusual features of the data that you can see from the stem-and-leaf. 37. Gretzky. During his 20 seasons in the National Hockey League, Wayne Gretzky scored 50% more points than anyone else who ever played professional hockey. He accomplished this amazing feat while playing in 280 fewer games than Gordie Howe, the previous record holder. Here are the number of games Gretzky played during each season: 79, 80, 80, 80, 74, 80, 80, 79, 64, 78, 73, 78, 74, 45, 81, 48, 80, 82, 82, 70 *a) Create a stem-and-leaf display. b) Sketch a boxplot. c) Briefly describe this distribution. d) What unusual features do you see in this distribution? What might explain this? 38. McGwire. In his 16-year career as a player in major league baseball, Mark McGwire hit 583 home runs, placing him eighth on the all-time home run list (as of 2008). Here are the number of home runs that McGwire hit for each year from 1986 through 2001: 3, 49, 32, 33, 39, 22, 42, 9, 9, 39, 52, 58, 70, 65, 32, 29 * a) Create a stem-and-leaf display. b) Sketch a boxplot. c) Briefly describe this distribution. d) What unusual features do you see in this distribution? What might explain this? 39. Gretzky returns. Look once more at data of hockey games played each season by Wayne Gretzky, seen in Exercise 37. a) Would you use the mean or the median to summarize the center of this distribution? Why? b) Without actually finding the mean, would you expect it to be lower or higher than the median? Explain. c) A student was asked to make a histogram of the data in Exercise 33 and produced the following. Comment.
M03_SHAR8696_03_SE_C03.indd 115
40
20
1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 Year
40. McGwire, again. Look once more at data of home runs hit by Mark McGwire during his 16-year career as seen in Exercise 38. a) Would you use the mean or the median to summarize the center of this distribution? Why? b) Find the median. c) Without actually finding the mean, would you expect it to be lower or higher than the median? Explain. d) A student was asked to make a histogram of the data in Exercise 38 and produced the following. Comment. 70 60 50 Home Runs
*35. Vineyards. The data set provided contains the data from Exercises 10 and 15. Create a stem-and-leaf display of the sizes of the vineyards in acres. Point out any unusual features of the data that you can see from the stem-andleaf.
60 Games Played
a) Find the five-number summary for these data. b) Create a boxplot for these data. c) What can you see, if anything, in the histogram of Exercise 32 that isn’t clear in the boxplot?
80
40 30 20 10 0
1986 1988 1990 1992 1994 1996 1998 2000 Year
41. Pizza prices. The weekly prices of one brand of frozen pizza over a three-year period in Dallas are provided in the data file. Use the price data to answer the following questions. a) Find the five-number summary for these data. b) Find the range and IQR for these data. c) Create a boxplot for these data. d) Describe this distribution. e) Describe any unusual observations.
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
42. Pizza prices, part 2. The weekly prices of one brand of frozen pizza over a three-year period in Chicago are provided in the data file. Use the price data to answer the following questions. a) Find the five-number summary for these data. b) Find the range and IQR for these data. c) Create a boxplot for these data. d) Describe the shape (center and spread) of this distribution. e) Describe any unusual observations. 43. Student skills. Student skills relevant for the labor market contribute to a country’s level of well-being, according to the OECD’s 2013 How’s life? study. Higher skill levels lead to better well-being. The following data show the scores for student skills in 2013 for the OECD countries, Russia, and Brazil. Write a report on the student skills by country in 2013, being sure to include appropriate graphical displays and summary statistics. Country
Student Skills
Country
Student Skills
Australia Austria Belgium Canada Chile Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Israel Italy Japan
519 487 509 527 439 490 499 514 543 497 510 473 496 501 497 459 486 529
Korea Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Slovak Republic Slovenia Spain Sweden Switzerland Turkey United Kingdom United States Brazil Russian Federation
541 482 420 519 524 500 501 490 488 499 484 496 517 455 500 496 401 469
44. OECD 2011. Established in Paris in 1961, the Organisation for Economic Co-operation and Development (OECD) (www.oecd.org) collects information on many economic and social aspects of countries around the world. Here are the 2011 GDP growth rates (in percentages) of 35 industrialized countries. Write a brief report on the
2011 GDP growth rates of these countries being sure to include appropriate graphical displays and summary statistics. Country
2011 GDP Growth Rate
Australia Austria Belgium Canada Chile Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Israel Italy Japan Korea
2.25 2.7 1.78 2.41 5.99 1.89 1.1 8.28 2.74 1.7 3.03 -7.11 1.65 2.56 4.6 0.44 -0.75 3.63
Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Slovak Republic Slovenia Spain Sweden Switzerland Turkey United Kingdom United States Russian Federation South Africa
1.66 3.92 0.99 1.08 1.22 4.32 -1.55 3.23 0.6 0.42 3.71 1.93 8.5 0.92 1.8 4.34 3.46
8 6 4 2
5000
5500 6000 Total Length (yd)
Count Mean StdDev Min Q1 Median Q3 Max
M03_SHAR8696_03_SE_C03.indd 116
2011 GDP Growth Rate
Country
45. Golf courses. A start-up company is planning to build a new golf course. For marketing purposes, the company would like to be able to advertise the new course as one of the more difficult courses in the state of Vermont. One measure of the difficulty of a golf course is its length: the total distance (in yards) from tee to hole for all 18 holes. Here are the histogram and summary statistics for the lengths of all the golf courses in Vermont.
# of VT Golf Courses
116
6500
45 5892.91 yd 386.59 5185 5585.75 5928 6131 6796
14/07/14 7:27 AM
www.freebookslides.com
Exercises 117
a) What is the range of these lengths? b) Between what lengths do the central 50% of these courses lie? c) What summary statistics would you use to describe these data? d) Write a brief description of these data (shape, center, and spread). 46. Real estate. A real estate agent has surveyed houses in 20 nearby ZIP codes in an attempt to put together a comparison for a new property that she would like to put on the market. She knows that the size of the living area of a house is a strong factor in the price, and she’d like to market this house as being one of the biggest in the area. Here is a histogram and summary statistics for the sizes of all the houses in the area. 200
Frequency
150 100 50 0 1000
2000 3000 4000 Living Space Area (sq. ft)
Count Mean StdDev Min Q1 Median Q3 Max Missing
5000
1057 1819.498 sq. ft 662.9414 672 1342 1675 2223 5228 0
a) What is the range of these sizes? b) Between what sizes do the central 50% of these houses lie? c) What summary statistics would you use to describe these data? d) Write a brief description of these data (shape, center, and spread). 47. Food sales. Sales (in $) for one week were collected for 18 stores in a food store chain in the northeastern United States. The stores and the towns they are located in vary in size. a) Make a suitable display of the sales from the data provided. b) Summarize the central value for sales for this week with a median and mean. Why do they differ?
M03_SHAR8696_03_SE_C03.indd 117
c) Given what you know about the distribution, which of these measures does the better job of summarizing the stores’ sales? Why? d) Summarize the spread of the sales distribution with a standard deviation and with an IQR. e) Given what you know about the distribution, which of these measures does the better job of summarizing the spread of stores’ sales? Why? f) If we were to remove the outliers from the data, how would you expect the mean, median, standard deviation, and IQR to change? 48. Insurance profits. Insurance companies don’t know whether a policy they’ve written is profitable until the policy matures (expires). To see how they’ve performed recently, an analyst looked at mature policies and investigated the net profit to the company (in $). a) Make a suitable display of the profits from the data provided. b) Summarize the central value for the profits with a median and mean. Why do they differ? c) Given what you know about the distribution, which of these measures might do a better job of summarizing the company’s profits? Why? d) Summarize the spread of the profit distribution with a standard deviation and with an IQR. e) Given what you know about the distribution, which of these measures might do a better job of summarizing the spread in the company’s profits? Why? f) If we were to remove the outliers from the data, how would you expect the mean, median, standard deviation, and IQR to change? 49. iPod failures. In the early days of the iPod, MacInTouch (www.macintouch.com/reliability/ipodfailures.html) surveyed readers about reliability. Of the 8926 iPods owned at that time, 7510 were problem-free while the other 1416 failed. From the data on the CD, compute the failure rate for each of the 17 iPod models. Produce an appropriate graphical display of the failure rates and briefly describe the distribution. (To calculate the failure rate, divide the number failed by the sum of the number failed and the number OK for each model and then multiply by 100.) 50. Unemployment 2012. The data set provided contains 2012 (2nd quarter) unemployment rates for 34 developed countries (www.oecd.org). Produce an appropriate graphical display and briefly describe the distribution of unemployment rates. Report and comment on any outliers you may see. 51. Gas prices 2012. A driver has recorded and posted on the Internet (www.randomuseless.info/gasprice/gasprice. html) the price he paid for gasoline at every purchase from 1979 to 2012. Since 1984 all purchases were self-serve and all were for premium (92-93 octane) gas. He has also standardized the prices to April 1979 dollars. Here are boxplots for 2003, 2006, 2009, and 2012:
14/07/14 7:27 AM
www.freebookslides.com 118
CHAPTER 3 Displaying and Describing Quantitative Data
a) Which lake region produced the most expensive wine? b) Which lake region produced the cheapest wine? c) In which region were the wines generally more expensive? d) Write a few sentences describing these prices.
Price (1979 $)
3.75
54. Ozone. Historic ozone levels (in parts per billion, ppb) were recorded at sites in New Jersey monthly. Here are boxplots of the data for each month (over 46 years) lined up in order (January = 1).
3.00
2.25
440 2006
2009
2012
Year
a) Compare the distribution of prices over the four years. b) In which year were the prices least stable (most volatile)? Explain. 52. Fuel economy. American automobile companies are becoming more motivated to improve the fuel efficiency of the automobiles they produce. It is well known that fuel efficiency is impacted by many characteristics of the car. Describe what these boxplots tell you about the relationship between the number of cylinders a car’s engine has and the car’s fuel economy (mpg).
Fuel Efficiency (mpg)
35
400 Ozone (ppb)
2003
360 320 280 1
2
3
4
5
6 7 Month
8
9
10
11
12
a) In what month was the highest ozone level ever recorded? b) Which month has the largest IQR? c) Which month has the smallest range? d) Write a brief comparison of the ozone levels in January and June. e) Write a report on the annual patterns you see in the ozone levels. 55. Derby speeds. How fast do horses run? Kentucky Derby winners top 30 mph, as shown in the graph. This graph shows the percentage of Kentucky Derby winners that have run slower than a given speed. Note that few have won running less than 33 mph, but about 95% of the winning horses have run less than 37 mph. (A cumulative frequency graph like this is called an ogive.)
30
25
20 100 15 6 5 Cylinders
80
8
53. Wine prices 2013. The boxplots display bottle prices (in dollars) of dry Riesling wines produced by vineyards along three of the Finger Lakes in upstate New York.
% Below
4
60 40 20
30
Price
25
32
34 36 Winning Speed (mph)
20 15 10
M03_SHAR8696_03_SE_C03.indd 118
Cayuga Keuka Region
Seneca
a) Estimate the median winning speed. b) Estimate the quartiles. c) Estimate the range and the IQR. d) Create a boxplot of these speeds. e) Write a few sentences about the speeds of the Kentucky Derby winners.
14/07/14 7:27 AM
www.freebookslides.com
Exercises 119
56. Mutual funds, historical. Here is an ogive of the distribution of monthly returns for a group of aggressive (or high growth) mutual funds over a period of 25 years. (Recall from Exercise 55 that an ogive, or cumulative relative frequency graph, shows the percent of cases at or below a certain value. Thus this graph always begins at 0% and ends at 100%.)
Cumulative Percent
100 80 60 40
a) Which class had the highest mean score? b) Which class had the highest median score? c) For which class are the mean and median most different? Which is higher? Why? d) Which class had the smallest standard deviation? e) Which class had the smallest IQR? 58. Test scores, again. Look again at the histograms of test scores for the three Statistics classes in Exercise 57. a) Overall, which class do you think performed better on the test? Why? b) How would you describe the shape of each distribution? c) Match each class with the corresponding boxplot. 100
20 0
80 −10 0 10 Mutual Fund Returns (%)
20 Scores
−20
60
a) Estimate the median. b) Estimate the quartiles. c) Estimate the range and the IQR.
40
57. Test scores. Three Statistics classes all took the same test. Here are histograms of the scores for each class.
20
# of Students
6 4 2
30
60 Class 1
90
A
# of Students
4 3 2 1 60 Class 2
90
# of Students
8
Mean StdDev
6 4 2
30
M03_SHAR8696_03_SE_C03.indd 119
60 Class 3
90
C
59. Quality control holes. Engineers at a computer production plant tested two methods for accuracy in drilling holes into a PC board. They tested how fast they could set the drilling machine by running 10 boards at each of two different speeds. To assess the results, they measured the distance (in inches) from the center of a target on the board to the center of the hole. The data and summary statistics are shown in the table.
5
30
B
Fast
Slow
0.000101 0.000102 0.000100 0.000102 0.000101 0.000103 0.000104 0.000102 0.000102 0.000100 0.000102 0.000001
0.000098 0.000096 0.000097 0.000095 0.000094 0.000098 0.000096 0.975600 0.000097 0.000096 0.097647 0.308481
Write a report summarizing the findings of the experiment. Include appropriate visual and verbal displays of the distributions, and make a recommendation to the engineers if they are most interested in the accuracy of the method.
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
60. Fire sale. A real estate agent notices that houses with fireplaces often fetch a premium in the market and wants to assess the difference in sales price of 60 homes that recently sold. The data and summary are shown in the table. No Fireplace 142,212 206,512 50,709 108,794 68,353 123,266 80,248 135,708 122,221 128,440 221,925 65,325 87,588 88,207 148,246 205,073 185,323 71,904 199,684 81,762 45,004 62,105 79,893 88,770 115,312 118,952
Mean Median
116,597.54 112,053
Fireplace 134,865 118,007 138,297 129,470 309,808 157,946 173,723 140,510 151,917 235,105,000 259,999 211,517 102,068 115,659 145,583 116,289 238,792 310,696 139,079 109,578 89,893 132,311 131,411 158,863 130,490 178,767 82,556 122,221 84,291 206,512 105,363 103,508 157,513 103,861 7,061,657.74 136,581
Write a report summarizing the findings of the investigation. Include appropriate visual and verbal displays of the distributions, and make a recommendation to the agent about the average premium that a fireplace is worth in this market. 61. Customer database. A philanthropic organization has a database of millions of donors that they contact by mail to raise money for charities. One of the variables in the database, Title, contains the title of the person or persons printed on the address label. The most common are Mr., Ms., Miss, and Mrs., but there are also Ambassador and Mrs., Your Imperial Majesty, and Cardinal, to name a few others. In all, there are over 100 different titles, each with a corresponding numeric code.
M03_SHAR8696_03_SE_C03.indd 120
Code
Title
Code
Title
000 001 1002 003 004 005 006 009 010 126
MR. MRS. MR. and MRS. MISS DR. MADAME SERGEANT RABBI PROFESSOR PRINCE
127 128 129 130 131 132 135 210 f
PRINCESS CHIEF BARON SHEIK PRINCE AND PRINCESS YOUR IMPERIAL MAJESTY M. ET MME. PROF. f
An intern who was asked to analyze the organization’s fundraising efforts presented these summary statistics for the variable Title. Mean StdDev Median IQR n
54.41 957.62 1 2 94,649
a) What does the mean of 54.41 mean? b) What are the typical reasons that cause measures of center and spread to be as different as those in this table? c) Is that why these are so different? 62. CEOs. For each CEO, a code is listed that corresponds to the industry of the CEO’s company. Here are a few of the codes and the industries to which they correspond: Industry
Industry Code
Financial services Food/drink/tobacco Health
1 2 3
Insurance
4
Retailing Forest products Aerospace/defense
6 9 11
Industry Code
Industry Energy Capital goods Computers/ communications Entertainment/ information Consumer non-durables Electric utilities
12 14 16 17 18 19
A recently hired investment analyst has been assigned to examine the industries and the compensations of the CEOs. To start the analysis, he produces the following histogram of industry codes. 200 # of Companies
120
150 100 50 0.00
3.75
7.50 11.25 Industry Code
15.00
18.75
14/07/14 7:27 AM
www.freebookslides.com
Exercises 121
a) What might account for the gaps seen in the histogram? b) What advice might you give the analyst about the appropriateness of this display? 63. Respected jobs. Values, in general, do have a cultural aspect. Using the World Values Survey project data, we can investigate how important it is for a person to have a respected job. Compare the importance of a respected job for the nine cultural regions using an appropriate display and write a brief summary of the differences. 64. Importance of pay. Using data from the World Values Survey project, we can investigate how culturally important it is to have job that pays well. Compare the importance of a well-paying job for the nine cultural regions using an appropriate display and write a brief summary of the differences. 65. Houses for sale. Each house listed on the multiple listing service (MLS) is assigned a sequential ID number. A recently hired real estate agent decided to examine the MLS numbers in a recent random sample of homes for sale by one real estate agency in nearby towns. To begin the analysis, the agent produces the following histogram of ID numbers. a) What might account for the distribution seen in the histogram? b) What advice might you give the analyst about the appropriateness of this display? 14
67. Hurricanes. Buying insurance for property loss from hurricanes has become increasingly difficult since Hurricane Katrina caused record property loss damage. Many companies have refused to renew policies or write new ones. The data set provided contains the total number of hurricanes by every full decade from 1851 to 2000 (from the National Hurricane Center). Some scientists claim that there has been an increase in the number of hurricanes in recent years. a) Create a histogram of these data. b) Describe the distribution. c) Create a time series plot of these data. d) Discuss the time series plot. Does this graph support the claim of these scientists, at least up to the year 2000? 68. Hurricanes, part 2. Using the hurricanes data set, examine the number of major hurricanes (category 3, 4, or 5) by every full decade from 1851 to 2000. a) Create a histogram of these data. b) Describe the distribution. c) Create a timeplot of these data. d) Discuss the timeplot. Does this graph support the claim of scientists that the number of major hurricanes has been increasing (at least up through the year 2000)? 69. Productivity study. The National Center for Productivity releases information on the efficiency of workers. In a recent report, they included the following graph showing a rapid rise in productivity. What questions do you have about this?
12
4
8 Productivity
Frequency
10 6 4 2 0 70440000
70500000
70560000 ID
70620000
70680000
66. ZIP codes. Holes-R-Us, an Internet company that sells piercing jewelry, keeps transaction records on its sales. At a recent sales meeting, one of the staff presented the following histogram and summary statistics of the ZIP codes of the last 500 customers, so that the staff might understand where sales are coming from. Comment on the usefulness and appropriateness of this display.
# of Customers
80 60 40 20 15,000
M03_SHAR8696_03_SE_C03.indd 121
40,000
65,000 ZIP Code
90,000
3.5 3 2.5
70. Productivity study revisited. A second report by the National Center for Productivity analyzed the relationship between productivity and wages. They used the graph from Exercise 69, with the x-axis labeled “wages”. Comment on any problems you see with their analysis. 71. Finnish education. Finland has a longstanding reputation for its national educational system. According to the OECD’s most recent PISA study, average proficiency in mathematics for youth in 65 OECD countries equals 467.6, with a standard deviation of 59.8, whereas average reading proficiency equals 464.4, with a standard deviation of 51.6. Finnish youth scored 532 and 538 respectively. Which score is more unusual? Explain. 72. Tuition, 2008. The data set provided contains the average tuition of private four-year colleges and universities as well as the average 2007–2008 tuitions for each state. The mean tuition charged by a public two-year college was $2763, with a standard deviation of $988. For private
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 3 Displaying and Describing Quantitative Data
122
four-year colleges the mean was $21,259, with a standard deviation of $6241. Which would be more unusual: a state whose average public two-year college is $700 or a state whose average private four-year college tuition was $10,000? Explain. 73. Food consumption. FAOSTAT, the Food and Agriculture Organization of the United Nations, collects information on the production and consumption of more than 200 food and agricultural products for 200 countries around the world. Here are two tables, one for meat consumption (per capita in kg per year) and one for alcohol consumption (per capita in gallons per year). The United States leads in meat consumption with 267.30 pounds, while Ireland is the largest alcohol consumer at 55.80 gallons. Using z-scores, find which of these two countries is the larger consumer of both meat and alcohol together. Country
Alcohol
Meat
Country
Alcohol
Meat
Australia Austria Belgium Canada Czech Republic Denmark Finland France Germany Greece Hungary Iceland Ireland Italy Japan
29.56 40.46 34.32 26.62 43.81 40.59 25.01 24.88 37.44 17.68 29.25 15.94 55.80 21.68 14.59
242.22 242.22 197.34 219.56 166.98 256.96 146.08 225.28 182.82 201.30 179.52 178.20 194.26 200.64 93.28
Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Slovakia South Korea Spain Sweden Switzerland Turkey United Kingdom United States
34.32 13.52 23.87 25.22 17.58 20.70 33.02 26.49 17.60 28.05 20.07 25.32 3.28 30.32 26.36
197.34 126.50 201.08 228.58 129.80 155.10 194.92 121.88 93.06 259.82 155.32 159.72 42.68 171.16 267.30
a) Use z-scores to combine the three measures. b) Which country has the best environment after combining the three measures? Be careful—a lower rank indicates a better environment to start up a business. 75. Youth unemployment rate. Global youth unemployment rate is increasing at such a pace over time that the International Labour Organization (ILO), opted for ‘A generation at risk’ in its recent report Global Employment Trends for Youth 2013. Data on youth unemployment is based on the ILO database. a) Create a histogram of the data and describe the distribution. b) Create a time series plot of the data and describe the trend. c) Which graphical display seems the more appropriate for these data? Explain. 76. Youth-to-adult unemployment rate ratio. The ILO’s concern about employment status of youth worldwide is based not only on the development of youth unemployment itself, but also the unfavorable comparison to the development in adult unemployment. In order to compare both unemployment rates, ILO use the youth-to-adult ratio of unemployment rate over time. a) Create a histogram of this ratio and describe the distribution. b) Create a time series plot of the ratio and describe the trend. c) Which graphical display seems the more appropriate for these data? Explain. *77. Unemployment rate, 2013. The histogram shows the monthly U.S. unemployment rate from January 2003 to January 2013 (data.bls.gov/timeseries/LNS14000000). 25
74. World Bank. The World Bank, through their Doing Business project (www.doingbusiness.org), ranks nearly 200 economies on the ease of doing business. One of their rankings measures the ease of starting a business and is made up (in part) of the following variables: number of required start-up procedures, average start-up time (in days), and average start-up cost (in % of per capita income). The following table gives the mean and standard deviations of these variables for 95 economies. Time (Days)
Cost (%)
7.9 2.9
27.9 19.6
14.2 12.9
Mean SD
Here are the data for three countries. Spain Guatemala Fiji
M03_SHAR8696_03_SE_C03.indd 122
Procedures
Time
Cost
10 11 8
47 26 46
15.1 47.3 25.3
# of Months
15 10 5 0 4
5
6 7 8 9 Unemployment Rate (%)
10
Here is the time series plot for the same data. 10.00 Unemployment Rate
Procedures (#)
20
8.75 7.50 6.25 5.00 2004
2006
2008 Year
2010
2012
14/07/14 7:27 AM
www.freebookslides.com
Exercises 123
*78. Consumer Price Index (CPI). Here is a histogram of the monthly CPI as reported by the Bureau of Labor Statistics (ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt) from 2008 through 2012: 15
10
50 40 # of Companies
a) What features of the data can you see in the histogram that aren’t clear in the time series plot? b) What features of the data can you see in the time series plot that aren’t clear in the histogram? c) Which graphical display seems the more appropriate for these data? Explain. d) Write a brief description of unemployment rates over this time period in the United States.
20 10
20,000 40,000 Assets
a) What aspect of this distribution makes it difficult to summarize, or to discuss, center and spread? b) What would you suggest doing with these data if we want to understand them better? 80. Assets, again. Here are the same data you saw in Exercise 79 after re-expressions as the square root of assets and the logarithm of assets.
5
20
0 220 CPI
230
Here is the time series plot for the same data. 230
# of Companies
210
225 CPI
30
15 10 5
220
75
150
225
Assets
215
2008
2009
2010 Year
2011
2012
a) What features of the data can you see in the histogram that aren’t clear from the time series plot? b) What features of the data can you see in the time series plot that aren’t clear in the histogram? c) Which graphical display seems the more appropriate for these data? Explain. d) Write a brief description of monthly CPI over this time period. 79. Assets. Here is a histogram of the assets (in millions of dollars) of 79 companies chosen from the Forbes list of the nation’s top corporations.
M03_SHAR8696_03_SE_C03.indd 123
# of Companies
10 8 6 4 2 2.25
3.00
3.75 Log (Assets)
4.50
a) Which re-expression do you prefer? Why? b) In the square root re-expression, what does the value 50 actually indicate about the company’s assets?
14/07/14 7:27 AM
www.freebookslides.com 124
CHAPTER 3 Displaying and Describing Quantitative Data
Ju s t Che c k i n g A n s w ers 1 Incomes are probably skewed to the right and not
symmetric, making the median the more appropriate measure of center. The mean will be influenced by the high end of family incomes and not reflect the “typical” family income as well as the median would. It will give the impression that the typical income is higher than itis.
2 An IQR of 30 mpg would mean that only 50% of the
cars get gas mileages in an interval 30 mpg wide. Fuel economy doesn’t vary that much. 3 mpg is reasonable. It seems plausible that 50% of the cars will be within about 3 mpg of each other. An IQR of 0.3 mpg would mean that the gas mileage of half the cars varies little from the estimate. It’s unlikely that cars, drivers, and driving conditions are that consistent.
3 We’d prefer a standard deviation of 2 months.
Making a consistent product is important for quality. Customers want to be able to count on the MP3 player lasting somewhere close to 5 years, and a standard deviation of 2 years would mean that life spans were highly variable.
M03_SHAR8696_03_SE_C03.indd 124
14/07/14 7:27 AM
4
www.freebookslides.com
Correlation and Linear Regression
Amazon.com Amazon.com opened for business in July 1995, billing itself even then as “Earth’s Biggest Bookstore,” with an unusual business plan: They didn’t plan to turn a profit for four to five years. Although some shareholders complained when the dotcom bubble burst, Amazon continued its slow, steady growth, becoming profitable for the first time in 2002. Since then, Amazon has remained profitable and has continued to grow. By 2011, sales had topped $48 billion, of which 44% were international sales. In 2012 Amazon was ranked the 20th most valuable brand by Business Week. Amazon’s selection of merchandise has expanded to include almost anything you can imagine, from $400,000 necklaces, to yak cheese from Tibet, to the largest book in the world. Amazon R&D is constantly monitoring and evolving their website to best serve their customers and maximize their sales performance. To make changes to the site, they experiment by collecting data and analyzing what works best. As Ronny Kohavi, former director of Data Mining and Personalization, said, “Data trumps intuition. Instead of using our intuition, we experiment on the live site and let our customers tell us what works for them.”
125
M04_SHAR8696_03_SE_C04.indd 125
14/07/14 7:26 AM
www.freebookslides.com 126
Who What Units When Where Why
CHAPTER 4 Correlation and Linear Regression
Books sold by Amazon List Price and Weight $ and Ounces 2012 Online Originally collected as a class project
E
ven with the rapid growth of e-books (about 15% growth per year), sales of traditional ink on paper books have held their own. Of course, one difference between e-books and print books is that a print book has weight, and that can help predict its price. The weight can account for the materials in the book’s manufacture as well as other related costs. Amazon makes a variety of facts available for most books it sells, including the shipping weight (in ounces), the number of pages, the dimensions, and of course, the price. Figure 4.1 shows a plot of List Price ($) against (shipping) Weight (oz) for 307 books selected1 from Amazon’s offerings. Clearly price and weight are related. If you were asked to summarize the relationship, what would you say?
Figure 4.1 List Price ($) vs Weight (oz) for307 books sold by Amazon List Price ($)
40 30 20 10
7.5
15.0 22.5 Weight (oz)
30.0
There is an overall trend; heavier books tend to cost more. But the relationship is far from perfect. This plot is an example of a scatterplot. It plots one quantitative variable against another. Just by looking at a scatterplot, you can see patterns, trends, relationships, and the occasional unusual cases standing apart from the general pattern. Scatterplots are the best way to start observing the relationship between two quantitative variables. Relationships between variables are often at the heart of what we’d like to learn from data. • Is consumer confidence related to oil prices? • Are customers who consult online help sites as satisfied with a company’s customer relations as those who speak with human customer support specialists? • Is an increase in money spent on advertising related to sales? • What is the relationship between a stock’s sales volume and its price? Questions such as these relate two quantitative variables and ask whether there is an association between them. Scatterplots are the ideal way to picture such associations.
4.1
Looking at Scatterplots The Texas Transportation Institute, which studies the mobility provided by the nation’s transportation system, issues an annual report on traffic congestion and its costs to society and business. Figure 4.2 shows a scatterplot of the annual Congestion Cost Per Person of traffic delays (in dollars) in 65 cities in the United States against the Peak Period Freeway Speed (mph). 1
The books were selected by “walking” randomly through Amazon’s offerings, starting at a variety of haphazardly selected books and then selecting a randomly chosen book from the “Customers who bought this item also bought” list, and continuing the randomly generated thread. Although it is not arandom sample, for our purposes it can be regarded as a representative collection of books.
M04_SHAR8696_03_SE_C04.indd 126
14/07/14 7:26 AM
www.freebookslides.com
Looking at Scatterplots
UNITS
WHEN WHERE WHY
Cities in the United States Congestion Cost Per Person and Peak Period Freeway Speed Congestion Cost Per Person ($ per person per year); Peak Period Freeway Speed (mph) 2000 United States To examine the relationship between congestion on the highways and its impact on society and business
700 Congestion Cost Per Person ($ per year)
WHO WHAT
127
600 500 400 300 200 100 0 45.0
47.5 50.0 52.5 55.0 Peak Period Freeway Speed (mph)
57.5
60.0
Figure 4.2 Congestion Cost Per Person ($ per year) of traffic delays against Peak Period Freeway Speed (mph) for 65 U.S. cities.
Everyone looks at scatterplots. But many people would find it hard to say what to look for in a scatterplot. What do you see? Try to describe the scatterplot of Congestion Cost against Freeway Speed. You might say that the direction of the association is important. As the peak freeway speed goes up, the cost of congestion goes down. A pattern that runs from Look for Direction: What’s the sign—positive, negative, or neither?
the upper left to the lower right
the other way , as we saw for the price and weight of books, is called positive. The second thing to look for in a scatterplot is its form. If there is a straight line relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form. For example, the scatterplot of traffic congestion has an underlying linear form, although some points stray away from it. Scatterplots can reveal many different kinds of patterns. Often they will not be straight, but straight line patterns are both the most common and the most useful for statistics. If the relationship isn’t straight, but curves gently, while still increasing or
decreasing steadily,
Look for Form: Straight, curved, something exotic, or no pattern?
is said to be negative. A pattern running
we can often find ways to straighten it out. But if it
curves sharply—up and then down, for example, —then you’ll need more advanced methods. The third feature to look for in a scatterplot is the strength of the relationship. At one extreme, do the points appear tightly clustered in a single stream (whether straight, curved, or bending all over the place)? Or, at the other extreme, do the points seem to be so variable and spread out that we can barely discern
Look for Strength: How much scatter?
any trend or pattern? The traffic congestion plot shows moderate scatter around a generally straight form. That indicates that there’s a moderately strong linear relationship between cost and speed. Finally, always look for the unexpected. Often the most interesting discovery in a scatterplot is something you never thought to look for. One example of such a
M04_SHAR8696_03_SE_C04.indd 127
14/07/14 7:26 AM
www.freebookslides.com 128
CHAPTER 4 Correlation and Linear Regression
Look for Unusual Features: Are there unusual observations or subgroups?
surprise is an unusual observation, or outlier, standing away from the overall pattern of the scatterplot. Such a point is almost always interesting and deserves special attention. You may see entire clusters or subgroups that stand away or show a trend in a different direction than the rest of the plot. That should raise questions about why they are different. They may be a clue that you should split the data into subgroups instead of looking at them all together.
For Example
Creating a scatterplot
The first automobile crash in the United States occurred in New York City in 1896, when a motor vehicle collided with a “pedalcycle” rider. Cycle/car accidents are a serious concern for insurance companies. About 53,000 cyclists have died in traffic crashes in the United States since 1932. Demographic information such as this is often available from government agencies. It can be useful to insurers, who use it to set appropriate rates, and to retailers, who must plan what safety equipment to stock and how to present it to their customers. This becomes a more pressing concern when the demographic profiles change over time. Here’s data on the mean age of cyclists killed each year during the decade from 1998 to 2010. (Source: National Highway Transportation Safety Agency, found at www-nrd.nhtsa.dot.gov/Pubs/811624.pdf)
Year
Mean Age
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
32 33 35 36 37 36 39 39 41 40 41 41 42
Question Make a scatterplot and summarize what it says.
Mean Age
Answer The mean age of cyclist traffic deaths has been increasing almost linearly during this period. The trend is a strong one. 42 40 38 36 34 32 1998
M04_SHAR8696_03_SE_C04.indd 128
2000
2002
2004 Year
2006
2008
2010
14/07/14 7:26 AM
www.freebookslides.com
Assigning Roles to Variables in Scatterplots
4.2
129
Assigning Roles to Variables in Scatterplots Scatterplots were among the first modern mathematical displays. The idea of using two axes at right angles to define a field on which to display values can be traced back to René Descartes (1596–1650), and the playing field he defined in this way is formally called a Cartesian plane, in his honor. The two axes Descartes specified characterize the scatterplot. The axis that runs up and down is, by convention, called the y-axis, and the one that runs from side to side is called the x-axis. These terms are standard.2 To make a scatterplot of two quantitative variables, assign one to the y-axis and the other to the x-axis. As with any graph, be sure to label the axes clearly, and indicate the scales of the axes with numbers. Scatterplots display quantitative variables. Each variable has units, and these should appear with the display—usually near each axis. Each point is placed on a scatterplot at a position that corresponds to values of these two variables. Its horizontal location is specified by its x-value, and its vertical location is specified by its y-value variable. Together, these are known as coordinates and written (x, y). y
Descartes was a philosopher, famous for his statement cogito, ergo sum: I think, therefore Iam.
y
(x, y)
x
x
Scatterplots made by computer programs (such as the two we’ve seen in this chapter) often do not—and usually should not—show the origin, the point at x = 0, y = 0 where the axes meet. If both variables have values near or on both sides of zero, then the origin will be part of the display. If the values are far from zero, though, there’s no reason to include the origin. In fact, it’s far better to focus on the part of the Cartesian plane that contains the data. In our example about books none of the books were free and all weighed something so the computer drew the scatterplot in Figure 4.1 with axes that don’t quite meet. Which variable should go on the x-axis and which on the y-axis? What we want to know about the relationship can tell us how to make the plot. Amazon may have questions such as: • How are prices related to sales volume? • Are increased sales at Amazon reflected in its stock price? • What offers will encourage shoppers to browse the Amazon site for a longer time? In all of these examples, one variable plays the role of the explanatory or predictor variable, while the other takes on the role of the response variable. We place the explanatory variable on the x-axis and the response variable on the y-axis. When you make a scatterplot, you can assume that those who view it will think this way, so choose which variables to assign to which axes carefully. 2
The axes are also called the “ordinate” and the “abscissa”—but we can never remember which is which because statisticians don’t generally use these terms. In Statistics (and in all statistics computer programs) the axes are generally called “x” (abscissa) and “y” (ordinate) and are usually labeled with the names of the corresponding variables.
M04_SHAR8696_03_SE_C04.indd 129
14/07/14 7:26 AM
www.freebookslides.com 130
CHAPTER 4 Correlation and Linear Regression
N o t at i o n A l e r t So x and y are reserved letters as well, but not just for labeling the axes of a scatterplot. In Statistics, the assignment of variables to the x- and y-axes (and choice of notation for them in formulas) often conveys information about their roles as predictor or response.
The roles that we choose for variables have more to do with how we think about them than with the variables themselves. Just placing a variable on the x-axis doesn’t necessarily mean that it explains or predicts anything, and the variable on the y-axis may not respond to it in any way. The Amazon marketing department may want to predict prices leading to Price as the response variable in Figure 4.1. But the shipping department is likely to be more interested in predicting the weights of books, so for them, Weight would be a natural r esponse variable. The x- and y-variables are sometimes referred to as the independent and dependent variables, respectively. The idea is that the y-variable depends on the x-variable and the x-variable acts independently to make y respond. These names, however, conflict with other uses of the same terms in Statistics. Instead, we’ll sometimes use the terms “explanatory” or “predictor variable” and “response variable” when we’re discussing roles, but we’ll often just say x-variable and y-variable.
For Example
Assigning roles to variables
Question When examining the ages of victims in cycle/car accidents, why does
it make the most sense to plot year on the x-axis and mean age on the y-axis? (See the example on page 128.)
Answer We are interested in how the age of accident victims might change over time, so we think of the year as the basis for prediction and the mean age of victims as the variable that is predicted.
4.3
Understanding Correlation If you had to put a number (say, between 0 and 1) on the strength of the linear association between book prices and weights in Figure 4.1, what would it be? Your measure shouldn’t depend on the choice of units for the variables. After all, Amazon sells books in euros as well as in dollars and book weights can be recorded in grams rather than ounces, but regardless of the units, the scatterplot would look the same. When we change units, the direction, form, and strength won’t change, so neither should our measure of the association’s (linear) strength. We saw a way to remove the units in the previous chapter. We can standardize y - y x - x each of the variables, finding zx = a b and zy = a b . With these, we sx sy can compute a measure of strength that you’ve probably heard of—the correlation coefficient:
N o t at i o n A l e r t The letter r is always used for correlation, so you can’t use it for anything else in Statistics. Whenever you see an “r,” it’s safe to assume it’s a correlation.
r =
Keep in mind that the x’s and y’s are paired. For each book we have a price and a weight. To find the correlation we multiply each standardized value by the standardized value it is paired with and add up those cross products. We divide the total by the number of pairs minus one, n - 1.3
3
M04_SHAR8696_03_SE_C04.indd 130
a zx zy . n - 1
The same n - 1 we used for calculating the standard deviation.
14/07/14 7:26 AM
www.freebookslides.com
Understanding Correlation
131
There are alternative formulas for the correlation in terms of the variables x and y. Here are two of the more common: r =
a 1x - x21 y - y2
2 a 1x - x2 2 a 1 y - y2 2
=
a 1x - x21 y - y2 . 1n - 12sx sy
These formulas can be more convenient for calculating correlation by hand, but the form using z-scores is best for understanding what correlation means. No matter which formula you use, the correlation between List Price and Weight for the Amazon books is 0.498.
For Example
Finding the correlation coefficient
Question What is the correlation of mean age and year for the cyclist accident data on page 128?
Answer Working by hand: x = 2004, sx = 3.89 y = 37.85, sy = 3.26 The sum of the cross product of the deviations is found as follows: a 1x - x21 y - y2 = 147
Putting the sum of the cross products in the numerator and 1n - 12 * sx * sy in the denominator, we get 147 = 0.96 113 - 12 * 3.89 * 3.26
For mean age and year, the correlation coefficient is 0.96. That indicates a strong linear association. Because this is a time series, we refer to it as a strong “trend.”
Correlation Conditions Correlation measures the strength of the linear association between two quantitative variables. Before you use correlation, you must check three conditions: • Quantitative Variables Condition: Correlation applies only to quantitative variables. Don’t apply correlation to categorical data masquerading as quantitative. Check that you know the variables’ units and what they measure. • Linearity Condition: Sure, you can calculate a correlation coefficient for any pair of variables. But correlation measures the strength only of the linear association and will be misleading if the relationship is not straight enough. What is “straight enough”? This question may sound too informal for a statistical condition, but that’s really the point. We can’t verify whether a relationship is linear or not. Very few relationships between variables are perfectly linear, even in theory, and scatterplots of real data are never perfectly straight. How nonlinear looking would the scatterplot have to be to fail the condition? This is a judgment call that you just have to think about. Do you think that the underlying relationship is curved? If so, then summarizing its strength with a correlation would be misleading. • Outlier Condition: Unusual observations can distort the correlation and can make an otherwise small correlation look big or, on the other hand, hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa). When you see an outlier, it’s often a good idea to report the correlation both with and without the point.
M04_SHAR8696_03_SE_C04.indd 131
14/07/14 7:26 AM
www.freebookslides.com 132
CHAPTER 4 Correlation and Linear Regression
Each of these conditions is easy to check with a scatterplot. Many correlations are reported without supporting data or plots. You should still think about the conditions. You should be cautious in interpreting (or accepting others’ interpretations of) the correlation when you can’t check the conditions for yourself. Throughout this course, you’ll see that doing Statistics right means selecting the proper methods. That means you have to think about the situation at hand. An important first step is to check that the type of analysis you plan is appropriate. These conditions are just the first of many such checks.
Ju s t Che c k i n g For the years 1992 to 2002, the quarterly stock prices of the semiconductor companies Cypress and Intel have a correlation of 0.86. 1 Before drawing any conclusions from the correlation, what
would you like to see? Why?
2 If your coworker tracks the same prices in euros, how will this
change the correlation? Will you need to know the exchange rate between euros and U.S. dollars to draw conclusions?
Guided Example
3 If you standardize both prices, how will this affect the
correlation?
4 In general, if on a given day the price of Intel is relatively
low, is the price of Cypress likely to be relatively low as well?
5 If on a given day the price of Intel stock is high, is the price
of Cypress stock definitely high as well?
Customer Spending A major credit card company sends an incentive to its best customers hoping that the customers will use the card more. Market analysts wonder how often they can offer the incentive. Will repeated offerings of the incentive result in repeated increased credit card use? To examine this question, an analyst took a random sample of 184 customers from their highest use segment and investigated the charges in each of the two months following the date that customers received the incentive.
Plan
Setup State the objective. I dentify the quantitative variables to examine. Report the time frame over which the data have been collected and define each variable. (State the W’s.) Make the scatterplot and clearly label the axes to identify the scale and units.
Our objective is to investigate the association between the amount that a customer charges in each of the two months after they receive an incentive. The customers have been randomly selected from among the highest use segment of customers. The variables measured are the total credit card charges (in $) in the two months of interest. ✓ Quantitative Variable Condition. Both variables are quantitative. Both charges are measured in dollars. Because we have two quantitative variables measured on the same cases, we can make a scatterplot. Second Month’s Charge ($)
5000 4000 3000 2000 1000 0 1000
M04_SHAR8696_03_SE_C04.indd 132
2000 3000 4000 5000 First Month’s Charge ($)
6000
14/07/14 7:26 AM
www.freebookslides.com
Understanding Correlation
DO
Report
133
Check the conditions.
✓ Linearity Condition. The scatterplot is straight enough. ✓ Outlier Condition. There are no obvious outliers.
Mechanics Once the conditions are satisfied, calculate the correlation with technology.
The correlation is -0.391. The negative correlation coefficient confirms the impression from the scatterplot.
Conclusion Describe the
Memo Re: credit card spending We have examined some of the data from the incentive program. In particular, we looked at the charges made in each of the first two months of the program. We noted that there was a negative association between charges in the second month and charges in the first month. The correlation was -0.391, which is only moderately strong, and indicates substantial variation. We’ve concluded that although the observed pattern is negative, these data do not allow us to find the causes of this behavior. It is likely that some customers were encouraged by the offer to increase their spending in the first month, but then returned to former spending patterns. It is possible that others didn’t change their behavior until the second month of the program, increasing their spending at that time. Without data on the customers’ pre-incentive spending patterns it would be hard to say more. We suggest further research, and we suggest that the next trial extend for a longer period of time to help determine whether the patterns seen here persist.
irection, form, and the strength d of the plot, along with any unusual points or features. Be sure to state your interpretation in the proper context.
Correlation Properties Because correlation is so widely used as a measure of association it’s a good idea to remember some of its basic properties. Here’s a useful list of facts about the correlation coefficient: How Strong Is Strong? There’s little agreement on what the terms “weak,” “moderate,” and “strong” mean. The same correlation might be strong in one context and weak in another. A correlation of 0.7 between an economic index and stock market prices would be exciting, but finding “only” a correlation of 0.7 between a drug dose and blood pressure might be seen as a failure by a pharmaceutical company. Use these terms cautiously and be sure to report the correlation and show a scatterplot so others can judge the strength for themselves.
M04_SHAR8696_03_SE_C04.indd 133
• The sign of a correlation coefficient gives the direction of the association. • Correlation is always between −1 and +1. Correlation can be exactly equal to -1.0 or +1.0, but watch out. These values are unusual in real data because they mean that all the data points fall exactly on a single straight line. • Correlation treats x and y symmetrically. The correlation of x with y is the same as the correlation of y with x. • Correlation has no units. This fact can be especially important when the data’s units are somewhat vague to begin with (customer satisfaction, worker efficiency, productivity, and so on). • Correlation is not affected by changes in the center or scale of either variable. Changing the units or baseline of either variable has no effect on the correlation coefficient because the correlation depends only on the z-scores. • Correlation measures the strength of the linear association between the two variables. Variables can be strongly associated but still have a small correlation if the association is not linear. • Correlation is sensitive to unusual observations. A single outlier can make a small correlation large or make a large one small.
14/07/14 7:26 AM
www.freebookslides.com 134
CHAPTER 4 Correlation and Linear Regression
Correlation Tables Sometimes you’ll see the correlations between each pair of variables in a dataset arranged in a table. The rows and columns of the table name the variables, and the cells hold the correlations.
#Pages
Width
Thick Pub year
#Pages
1.000
Width
0.003
1.000
Thick
0.813
0.074
1.000
Pub year
0.253
0.012
0.309
1.000
Table 4.1 A correlation table for some variables collected on a sample of Amazon books.
Correlation tables are compact and give a lot of summary information at a glance. The diagonal cells of a correlation table always show correlations of exactly 1.000, and the upper half of the table is symmetrically the same as the lower half (can you see why?), so by convention, only the lower half is shown. A table like this can be an efficient way to start looking at a large dataset, but be sure to check for linearity and unusual observations or the correlations in the table may be misleading or meaningless. Can you be sure, looking at Table 4.1, that the variables are linearly associated? Correlation tables are often produced by statistical software packages. Fortunately, these same packages often offer simple ways to make all the scatterplots you need to look at.4
4.4
Lurking Variables and Causation An educational researcher finds a strong association between height and reading ability among elementary school students in a nationwide survey. Taller children tend to have higher reading scores. Does that mean that students’ height causes their reading scores to go up? No matter how strong the correlation is between two variables, there’s no simple way to show from observational data that one variable causes the other. A high correlation just increases the temptation to think and to say that the x-variable causes the y-variable. Just to make sure, let’s repeat the point again. No matter how strong the association, no matter how large the r value, no matter how straight the form, there is no way to conclude from a high correlation alone that one variable causes the other. There’s always the possibility that some third variable—a lurking variable—is affecting both of the variables you have observed. In the reading score example, you may have already guessed that the lurking variable is the age of the child. Older children tend to be taller and have stronger reading skills. But even when the lurking variable isn’t as obvious, resist the temptation to think that a high correlation implies causation. Here’s another example.
4
A table of scatterplots arranged just like a correlation table is sometimes called a scatterplot matrix, or SPLOM, and is easily created using a statistics package.
M04_SHAR8696_03_SE_C04.indd 134
14/07/14 7:26 AM
www.freebookslides.com
135
84 79 Life Expectancy
Figure 4.3 Life Expectancy and numbers of Doctors per Person in 40 countries shows a fairly strong, positive, somewhat linear relationship with a correlation of 0.705.
Lurking Variables and Causation
74 69 64 59 54 49 0
0.001
0.002
0.003
0.004
0.005
Doctors per Person
Figure 4.3 shows the Life Expectancy (average of men and women, in years) for each of 40 countries of the world, plotted against the number of Doctors per Person in each country. The strong positive association 1r = 0.7052 seems to confirm our expectation that more Doctors per Person improves health care, leading to longer lifetimes and a higher Life Expectancy. Perhaps we should send more doctors to developing countries to increase life expectancy. If we increase the number of doctors, will the life expectancy increase? That is, would adding more doctors cause greater life expectancy? Could there be another explanation of the association? Figure 4.4 shows another scatterplot. Life Expectancy is still the response, but this time the predictor variable is not the number of doctors, but the number of Televisions per Person in each country. The positive association in this scatterplot looks even stronger than the association in the previous plot. If we wanted to calculate a correlation, we should straighten the plot first, but even from this plot, it’s clear that higher life expectancies are associated with more televisions per person. Should we conclude that increasing the number of televisions extends lifetimes? If so, we should send televisions instead of doctors to developing countries. Not only is the association with life expectancy stronger, but televisions are cheaper than doctors. Figure 4.4 Life Expectancy and number of Televisions per Person shows a strong, positive (although clearly not linear) relationship.
Life Expectancy
75.0
67.5
60.0
52.5 0.2
0.4 TVs per Person
0.6
What’s wrong with this reasoning? Maybe we were a bit hasty earlier when we concluded that doctors cause greater life expectancy. Maybe there’s a lurking variable here. Countries with higher standards of living have both longer life
M04_SHAR8696_03_SE_C04.indd 135
14/07/14 7:26 AM
www.freebookslides.com 136
CHAPTER 4 Correlation and Linear Regression
expectancies and more doctors. Could higher living standards cause changes in the other variables? If so, then improving living standards might be expected to prolong lives, increase the number of doctors, and increase the number of televisions. From this example, you can see how easy it is to fall into the trap of mistakenly inferring causality from a correlation. For all we know, doctors (or televisions) do increase life expectancy. But we can’t tell that from data like these no matter how much we’d like to. Resist the temptation to conclude that x causes y from a correlation, no matter how obvious that conclusion seems to you.
For Example
Understanding causation
Question An insurance company analyst suggests that the data on ages of cyclist
accident deaths are actually due to the entire population of cyclists getting older and not to a change in the safe riding habits of older cyclists (see page 128). What would we call the mean cyclist age if we had that variable available?
Answer It would be a lurking variable. If the entire population of cyclists is aging then that would lead to the average age of cyclists in accidents increasing.
4.5 “Statisticians, like artists, have the bad habit of falling in love with their models.”
The Linear Model Let’s return to the relationship between Amazon book prices and weights. In Figure 4.1 (repeated here) we saw a moderate, positive, linear relationship, so we can s ummarize its strength with a correlation. For this relationship, the correlation is 0.498. 40
List Price ($)
—George Box, Famous Statistician
30 20 10
7.5
15.0 22.5 Weight (oz)
30.0
That’s moderately strong, but the strength of the relationship is only part of the picture. Amazon’s management might want a deeper understanding of prices for trade books to compare with prices in the rapidly growing e-book market. That’s a reasonable business need but to meet it we’ll need a model for the trend. The correlation says that there seems to be a linear association between the variables, but it doesn’t tell us what that association is. Of course, we can say more. We can model the relationship with a line and give the equation. Specifically, we can find a linear model to describe the relationship we saw in Figure 4.1 between Price and Weight. A linear model is just an equation of a straight line through the data. The points in the scatterplot don’t all line up, but a straight line can summarize the general pattern with only a few parameters. This model can help us understand how the variables are associated.
M04_SHAR8696_03_SE_C04.indd 136
14/07/14 7:26 AM
www.freebookslides.com
Correlation and the Line
137
Residuals Positive or Negative? A negative residual means the predicted value is too big—an overestimate. A positive residual shows the model makes an underestimate. These may actually seem backwards at first.
We know the model won’t be perfect. No matter what line we draw, it won’t go through many of the points. The best line might not even hit any of the points. Then how can it be the “best” line? We want to find the line that somehow comes closer to all the points than any other line. Some of the points will be above the line and some below. Any linear model can be written as yn = b0 + b1x, where b0 and b1 are numbers estimated from the data and yn (pronounced y-hat) is the predicted value. We use the hat to distinguish the predicted value from the observed value y. The difference between these two is called the residual: e = y - yn.
N o t at i o n A l e r t “Putting a hat on it” is standard Statistics notation to indicate that something has been predicted by a model. Whenever you see a hat over a variable name or symbol, you can assume it is the predicted version of that variable or symbol.
4.6 Interpreting the Intercept Are e-books just “weightless” books? They don’t have the paper, ink, binding, and other physical a ttributes that make up the weight of a book and are responsible for some of its cost. The typcial price of e-books seems to be settling down near $10, so maybe our intercept can be interpreted in that way. But we should be cautious about extrapolating to a prediction that far from the data we have. (See “Let the Ebook Price Wars Begin: Three Ebook Pricing Predictions,” Forbes 12.10.2012 found at www.forbes.com/sites/ jeremygreenfield/2012/12/10/letthe-ebook-price-wars-begin-threeebook-pricing-predictions/.)
M04_SHAR8696_03_SE_C04.indd 137
The residual value tells us how far the model’s prediction is from the observed value at that point. To find the residuals, we always subtract the predicted values from the observed ones. Our question now is how to find the right line.
The Line of “Best Fit” When we draw a line through a scatterplot, some residuals are positive, and some are negative. We can’t assess how well the line fits by adding up all the residuals—the positive and negative ones would just cancel each other out. We need to find the line that’s closest to all the points, and to do that, we need to make all the distances positive. We faced the same issue when we calculated a standard deviation to measure spread. And we deal with it the same way here: by squaring the residuals to make them positive. The sum of all the squared residuals tells us how well the line we drew fits the data—the smaller the sum, the better the fit. A different line will produce a different sum, maybe bigger, maybe smaller. The line of best fit is the line for which the sum of the squared residuals is smallest—often called the least squares line. This line has the special property that the variation of the data around the model, as seen in the residuals, is the smallest it can be for any straight line model for these data. No other line has this property. Speaking mathematically, we say that this line minimizes the sum of the squared residuals. You might think that finding this “least squares line” would be difficult. Surprisingly, it’s not, although it was an exciting mathematical discovery when Legendre published it in 1805.
Correlation and the Line Any straight line can be written as: y = b0 + b1x. If we were to plot all the (x, y) pairs that satisfy this equation, they’d fall exactly on a straight line. We’ll use this form for our linear model. Of course, with real data, the points won’t all fall on the line. So, we write our model as yn = b0 + b1x, using yn for the predicted values, because it’s the predicted values (not the data values) that fall on the line. If the model is a good one, the data values will scatter closely around it. For the Amazon book data, the line is: Price = 10.35 + 0.477 Weight. What does this mean? The slope, 0.477, says that we can expect a book that weighs an ounce more to cost about $0.48 more. Slopes are always expressed in y-units per x-units. They tell you how the response variable changes for a one unit step in the predictor variable. So we’d say that the slope is 0.477 dollars per ounce.
14/07/14 7:26 AM
www.freebookslides.com 138
CHAPTER 4 Correlation and Linear Regression
The intercept, 10.35, is the value of the line when the x-variable is zero. Of course a weightless physical book isn’t possible, so we wouldn’t use these data to predict such a price, and we would choose to treat the intercept as just a “starting point” for our model.
Jus t C h e c k in g A scatterplot of sales per month (in thousands of dollars) vs. number of employees for all the outlets of a large computer chain shows a relationship that is straight, with only moderate scatter and no outliers. The correlation between Sales and Employees is 0.85, and the equation of the least squares model is:
Sales = 9.564 + 122.74 Employees
6 What does the slope of 122.74 mean? 7 What are the units of the slope? 8 The outlet in Dallas, Texas, has 10 more employees than the
outlet in Cincinnati. How much more Sales do you expect it to have?
How do we find the slope and intercept of the least squares line? The formulas are simple. The model is built from the summary statistics we’ve used before. We’ll need the correlation (to tell us the strength of the linear association), the standard deviations (to give us the units), and the means (to tell us where to locate the line). The slope of the line is computed as: sy b1 = r . sx We’ve already seen that the correlation tells us the sign and the strength of the relationship, so it should be no surprise to see that the slope inherits this sign as well. If the correlation is positive, the scatterplot runs from lower left to upper right, and the slope of the line is positive. Correlations don’t have units, but slopes do. How x and y are measured—what units they have—doesn’t affect their correlation, but does change the slope. The slope gets its units from the ratio of the two standard deviations. Each standard deviation has the units of its respective variable. So, the units of the slope are a ratio, too, and are always expressed in units of y per unit of x. How do we find the intercept? If you had to predict the y-value for a data point whose x-value was average, what would you say? The best fit line predicts y for points whose x-value is x. Putting that into our equation and using the slope we just found gives: y = b0 + b1x and we can rearrange the terms to find: b0 = y - b1x. It’s easy to use the estimated linear model to predict the price of a book from its weight. Consider, for example, Peter Drucker’s book Innovation and Entrepreneurship. According to Amazon, the book weighs 6.4 ounces. Our model predicts its price as: Price = 10.35 + 0.477 * 6.4 = +13.40 In fact, the book’s list price is $16.99. The difference between the observed value and the value predicted by the regression equation is called the residual. For Drucker’s book, the residual is +16.99 - 13.40 = +3.59. Least squares lines are commonly called regression lines. Although this name is an accident of history (as we’ll soon see), “regression” almost always means “the
M04_SHAR8696_03_SE_C04.indd 138
14/07/14 7:26 AM
www.freebookslides.com
Correlation and the Line
139
linear model fit by least squares.” Clearly, regression and correlation are closely related. We’ll need to check the same conditions before computing a regression as we did for correlation: 1. Quantitative Variables Condition 2. Linearity Condition 3. Outlier Condition A little later in the chapter we’ll add two more.
For Example
Interpreting the equation of a linear model
Question The data on cyclist accident deaths show a linear pattern. Find and interpret the equation of a linear model for that pattern. Refer to the values given in the answer to the example on page 131.
Answer b1 = 0.96 *
3.26 = 0.80 3.89
ba = 37.85 - 0.80 * 2004 = -1565.35 MeanAge = - 1565.35 + 0.80 Year The mean age of cyclists killed in vehicular accidents has increased by about 0.80 years of age (about 10 months) per year during the years observed by these data.
Understanding Regression from Correlation The correlation coefficient has no units. (It is found from z-scores, which have no units.) When we multiply r by the ratio of the standard deviations of the two v ariables sy to find the slope, we introduce their units. As a result, b1 = r is in y-units per x-unit. sx For the Amazon books, the slope is in dollars per ounce. What happens to the regression equation if we standardize both the predictor and response variables and regress zy on zx? Both standardized variables have standard deviation = 1 and mean = 0. So, the slope is just r, and the intercept is 0 (because both y and x are now 0), and we have the simple equation zny = r zx. Although we don’t usually standardize variables for regression, thinking in z-scores is a good way to understand what the regression equation is doing. The equation says that cases that deviate by one standard deviation from the mean in x are predicted to have a value of y that is r standard deviations away from the mean in y. Let’s be more specific. For the Amazon books the correlation is 0.498. So, we know immediately that: znPrice = 0.498 zWeight. That means that a book that is one standard deviation heavier than the mean book is predicted by our model to cost about half a standard deviation more than the mean book.
M04_SHAR8696_03_SE_C04.indd 139
14/07/14 7:26 AM
www.freebookslides.com 140
CHAPTER 4 Correlation and Linear Regression
4.7
Sir Francis Galton was the first to speak of “regression,” although others had fit lines to data by the same method.
The First Regression Sir Francis Galton related the heights of sons to the heights of their fathers with a regression line. The slope of his line was less than 1. That is, sons of tall fathers were tall, but not as much above the average height as their fathers had been above their mean. Sons of short fathers were short, but generally not as far from their mean as their fathers. Galton interpreted the slope correctly as indicating a “regression” toward the mean height—and “regression” stuck as a description of the method he had used to find the line.
Harold Hotelling was a prominent statistician and economic theorist. He taught statistics to (among others) Nobel prize winners Milton Friedman and Kenneth Arrow.
M04_SHAR8696_03_SE_C04.indd 140
Regression to the Mean Suppose you were told of a book and, without any additional information, you were asked to guess its price. What would be your guess? A good guess would be the mean price of books. Now suppose you are also told that this book has an ISBN (International Standard Book Number) that is 2 standard deviations (SDs) above the mean ISBN. Would that change your guess? Probably not. The correlation between ISBN and Price is near 0, so knowing the ISBN doesn’t tell you anything and doesn’t move your guess. (And the standardized regression equation, zny = r zx tells us that as well, since it says that we should move 0 * 2 SDs from the mean.) On the other hand, if you were told that, measured in euros, the book’s price was 2 SDs above the mean, you’d know the price in dollars. There’s a perfect correlation between Price in dollars and Price in euros 1r = 12, so you know it’s 2 SDs above mean Price in dollars as well. What if you were told that the book was 2 SDs above the mean in number of pages? Would you still guess that its list price is average? You might guess that it costs more than average, since there’s a positive correlation between Price and number of pages. But would you guess 2 SDs above the mean? When there was no correlation, we didn’t move away from the mean at all. With a perfect correlation, we moved our guess the full 2 SDs. Any correlation between these extremes should lead us to move somewhere between 0 and 2 SDs above the mean. (To be exact, our best guess would be to move r * 2 standard deviations away from the mean.) Notice that if x is 2 SDs above its mean, we won’t ever move more than 2 SDs away for y, since r can’t be bigger than 1.0. So each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean. This is why the line is called the regression line.
Business Tales about Regression to the Mean During the Great Depression, Northwestern University professor Horace Secrist followed the fortunes of 49 department stores, measuring the ratio of net profit or loss to net sales. He divided the stores into four groups, took the average performance of each group, and followed those averages from 1920 to 1930. He found that stores that had been above average tended to perform worse, while those that were below average tended to improve. A careful scientist, he then examined other types of business. All 73 of the different industries he examined showed the same pattern! He solicited comments and criticisms from economists and statisticians. He then published a book entitled Triumph of Mediocrity in Business (Secrist 1933) announcing his discovery. Initial reviews were favorable, but the prominent statistician Harold Hotelling pointed out in a scathing review that Secrist had simply rediscovered regression to the mean, which is a mathematical certainty and not a new principle of economics or business. Although Secrist wrote 13 books on economics and statistics and was director of Northwestern University’s bureau of business research, his name is still largely associated with the fallacious interpretation of regression to the mean. More recently, the psychologist Daniel Kahneman, winner of the 2002 Nobel Prize in Economics, explained (see his book Thinking, Fast and Slow) that regression to the mean can make managers believe that praising employees who do well doesn’t work, but punishing them when they do badly does. Employees who score a success are likely to do a bit less well the next time (just due to random fluctuation), whether they are praised or not. Those who make a big error are likely to do better next time, whether they were chastised or not. But managers like to believe that their management actions have consequences and will therefore conclude that praise causes workers to do worse, but criticism encourages workers to improve.
14/07/14 7:26 AM
www.freebookslides.com
Regression to the Mean
141
math box Equation of the line of best fit Where does the equation of the line of best fit come from? To write the equation of any line, we need to know a point on the line and the slope. It’s logical to expect that an average x will correspond to an average y, and, in fact, the line does pass through the point 1x, y2. (This is not hard to show as well.) To think about the slope, we look once again at the z-scores. We need to remember a few things. 1. The mean of any set of z-scores is 0. This tells us that the line that best fits the z-scores passes through the origin (0, 0). 2. The standard deviation of a set of z-scores is 1, so the variance is also 1. g 1zy - zy 2 2 g 1zy - 02 2 g z2y This means that = = = 1, a fact that n - 1 n - 1 n - 1 will be important soon. Σzxzy 3. The correlation is r = , also important soon. n - 1 Remember that our objective is to find the slope of the best fit line. Because it passes through the origin, the equation of the best fit line willbe of the form zny = mzx. We want to find the value for m that will minimize the sum of the squared errors. Actually we’ll divide that sum by n - 1 and minimize this mean squared error (MSE). Here goes: Minimize:
Who Was First? One of history’s most famous disputes of authorship was between Gauss and Legendre over the method of “least squares.” Legendre was the first to publish the solution to finding the best fit line through data in 1805, at which time Gauss claimed to have known it for years. There is some evidence that, in fact, Gauss may have been right, but he hadn’t bothered to publish it, and had been unable to communicate its importance to other scientists.5 Gauss later referred to the solution as “our method” (principium nostrum), which certainly didn’t help his relationship with Legendre.
MSE =
Since zny = mzx:
=
Square the binomial:
=
g 1zy - zny 2 2
g 1zy - mzx 2 2 n - 1
n - 1 g 1z2y - 2mzxzy + m2z2x 2
n - 1 g zxzy g z2x Rewrite the summation: = - 2m + m2 n - 1 n - 1 n - 1 2 4. Substitute from (2) and (3): = 1 - 2mr + m This last expression is a quadratic. A parabola in the form y = ax 2 + bx + c reaches its minimum at its turning point, which occurs -b when x = . We can minimize the mean of squared errors by choosing 2a m =
g z2y
- 1 -2r2 = r. 2112
The slope of the best fit line for z-scores is the correlation, r. This fact leads us immediately to two important additional results: A slope with value r for z-scores means that a difference of 1 standard deviation in zx corresponds to a difference of r standard deviations in zny.
5
Stigler, Steven M., “Gauss and the Invention of Least Squares,” Annals of Statistics, 9, (3), 1981, pp. 465–474.
M04_SHAR8696_03_SE_C04.indd 141
14/07/14 7:26 AM
www.freebookslides.com 142
CHAPTER 4 Correlation and Linear Regression
Why r for Correlation? In his original paper on correlation, Galton used r for the “index of correlation”—what we now call the correlation coefficient. He calculated it from the regression of y on x or of x on y after standardizing the variables, just as we have done. It’s fairly clear from the text that he used r to stand for (standardized) regression.
4.8
Assumptions and Conditions Most models are useful only when specific assumptions are true. Of course, assumptions are hard— often impossible—to check. That’s why we assume them. But we should check to see whether the assumptions are reasonable. Fortunately, there are often conditions that we can check. Checking the conditions provides information about whether the assumptions are reasonable, and whether it’s safe to proceed with the model.
Check the Scatterplot! Check the scatterplot. The shape must be linear, or you can’t use regression for the variables in their current form. And watch out for outliers.
M04_SHAR8696_03_SE_C04.indd 142
Translate that back to the original x and y values: “Over one standard deviation in x, up r standard deviations in yn.” rsy The slope of the regression line is b = . sx We know choosing m = r minimizes the sum of the squared errors (SSE), but how small does that sum get? Equation (4) told us that the mean of the squared errors is 1 - 2mr + m2. When m = r, 1 - 2mr + m2 = 1 - 2r 2 + r 2 = 1 - r 2. This is the percentage of variability not explained by the regression line. Since 1 - r 2 of the variability is not explained, the percentage of variability in y that is explained by x is r 2. This important fact will help us assess the strength of our models. And there’s still another bonus. Because r 2 is the percent of variability explained by our model, r 2 is at most 100%. If r 2 … 1, then -1 … r … 1, proving that correlations are always between -1 and +1.
Checking the Model The linear regression model may be the most widely used model in all of Statistics. It has everything we could want in a model: two easily estimated parameters, a meaningful measure of how well the model fits the data, and the ability to predict new values. It even provides a self-check in plots of the residuals to help us avoid all kinds of mistakes. For the linear model, we start by checking the same conditions we checked earlier in this chapter for using correlation. Linear models only make sense for quantitative data. The Quantitative Data Condition is pretty easy to check, but don’t be fooled by categorical data recorded as numbers. You probably don’t want to predict zip codes from credit card account numbers. The regression model assumes that the relationship between the variables is, in fact, linear—the Linearity Assumption. If you try to model a curved relationship with a straight line, you’ll usually get what you deserve. We can’t ever verify that the underlying relationship between two variables is truly linear, but an examination of the scatterplot will let you check the Linearity Condition as we did for correlation. If you don’t judge the scatterplot to be straight enough, stop. You can’t use a linear model for just any two variables, even if they are related. The two variables must have a linear association, or the model won’t mean a thing and decisions you base on the model may be wrong. Some nonlinear relationships can be saved by re-expressing—or transforming—the data to make the scatterplot more linear. (See Section 4.11.) Watch out for outliers. The linearity assumption also requires that no points lie far enough away to distort the line of best fit. Check the Outlier Condition to make sure no point needs special attention. Outlying values may have large residuals, and squaring makes their influence that much greater. Outlying points can dramatically change a regression model. Unusual observations can even change the sign of the slope, misleading us about the direction of the underlying relationship between the variables. Another assumption that is usually made when fitting a linear regression is that the residuals are independent of each other. We don’t strictly need this assumption to fit the line, but we will need it to draw conclusions beyond the data. We’ll come back to it when we discuss inference. We can’t be sure that the Independence
14/07/14 7:26 AM
www.freebookslides.com
Checking the Model
143
Assumption is true, but we are more willing to believe that the cases are independent if the cases are a random sample from the population. We can also check displays of the regression residuals for evidence of patterns, trends, or clumping, any of which would suggest a failure of independence. In the special case when we have a time series, a common violation of the Independence Assumption is for successive errors to be correlated with each other (autocorrelation). The error our model makes today may be similar to the one it made yesterday. We can check this violation by plotting the residuals against time and looking for patterns. When our goal is just to explore and describe the relationship, independence isn’t essential. However, when we want to go beyond the data at hand and make inferences for other situations (in Chapter 15) this will be a crucial assumption, so it’s good practice to think about it even now, especially for time series. We always check conditions with a scatterplot of the data, but we can learn even more after we’ve fit the regression model. There’s extra information in the residuals that we can use to help us decide how reasonable our model is and how well the model fits. So, we plot the residuals and check the conditions again. The residuals are the part of the data that hasn’t been modeled. We can write Why e for Residual? The easy answer is that r is already taken for correlation, but the truth is that e stands for “error.” It’s not that the data point is a mistake but that statisticians often refer tovariability not explained by a model as error.
Data = Predicted + Residual or, equivalently, Residual = Data - Predicted. Or, as we showed earlier, in symbols: e = y - yn. A scatterplot of the residuals versus the x-values should be a plot without patterns. It shouldn’t have any interesting features—no direction, no shape. It should stretch horizontally, showing no bends, and it should have no outliers. If you see nonlinearities, outliers, or clusters in the residuals, find out what the regression model missed. Let’s examine the residuals from our regression of Amazon book prices on weight.6
Figure 4.5 Residuals of the regression model predicting Amazon book prices from weights. Residuals
15.0 7.5 0.0 – 7.5 12
16
20 Predicted
24
Not only can the residuals help check the conditions, but they can also tell us how well the model performs. The better the model fits the data, the less the 6
Most computer statistics packages plot the residuals as we did in Figure 4.5, against the predicted values, rather than against x. When the slope is positive, the scatterplots are virtually identical except for the axes labels. When the slope is negative, the two versions are mirror images. Since all we care about is the patterns (or, better, lack of patterns) in the plot, either plot is useful.
M04_SHAR8696_03_SE_C04.indd 143
14/07/14 7:26 AM
www.freebookslides.com 144
CHAPTER 4 Correlation and Linear Regression
Equal Spread Condition This condition requires that the scatter is about equal for all x-values. It’s often checked using a plot of residuals against predicted values. The underlying assumption of equal variance is also called homoscedasticity.
residuals will vary around the line. The standard deviation of the residuals, se, gives us a measure of how much the points spread around the regression line. Of course, for this summary to make sense, the residuals should all share the same underlying spread. So we must assume that the standard deviation around the line is the same wherever we want the model to apply. This new assumption about the standard deviation around the line gives us a new condition, called the Equal Spread Condition. The associated question to ask is does the plot have a consistent spread or does it fan out? We check to make sure that the spread of the residuals is about the same everywhere. We can check that either in the original scatterplot of y against x or in the scatterplot of residuals (or, preferably, in both plots). We estimate the standard deviation of the residuals in almost the way you’d expect: se =
g e2 Bn - 2
We don’t need to subtract the mean of the residuals because e = 0. Why divide by n - 2 rather than n - 1? We used n - 1 for s when we estimated the mean. Now we’re estimating both a slope and an intercept. Looks like a pattern—and it is. We subtract one more for each parameter we estimate. When we predicted the price of Peter Drucker’s book Innovation and Entrepreneurship our model made an error of $3.59. The standard deviation of the errors is se = +5.49, so that’s a fairly typical size for a residual because it’s within one standard deviation.
For Example
Examining the residuals
Here is a scatterplot of the residuals for the linear model found in the example on page 139 plotted against the predicted values:
Residuals
1.50 0.75 0.00 –0.75 35.0
37.5 Predicted
40.0
Question Show how the plotted values were calculated. What does the plot suggest about the model? Answer The predicted values are the values of MeanAge found for each year
bysubstituting the year value in the linear model. The residuals are the differences between the actual mean ages and the predicted values for each year. The plot shows some remaining pattern in the form of four possible, nearly parallel, lower left to upper right trends. Afurther analysis may want to determine the reason for this pattern.
M04_SHAR8696_03_SE_C04.indd 144
14/07/14 7:26 AM
www.freebookslides.com Variation in the Model and R2
4.9
145
Variation in the Model and R2 How is the thickness of a book related to the number of pages? We certainly expect books with more pages to be thicker, but can we find a model to relate these two variables? Here’s a scatterplot of the data:
Figure 4.6 Naturally enough, the thickness of a book grows with the number of pages.
2.0
Thickness
1.6 1.2 0.8 0.4
200
400 # Pages
600
800
The regression model is Thickness = 0.28 + 0.00189 Pages which says that books start out about 0.28 inches thick (due to covers and binding, perhaps) and then have about 0.00189 inches per page of thickness. We can see that the residuals vary less than the original thickness values did. That shows that we can get a better prediction of the thickness of a book by using a model rather than just using the mean to estimate the thickness of all the books. Figure 4.7 Thickness and the regression residuals compared. The thickness values have their mean subtracted for the comparison. The smaller variation of the residuals shows the success of the regression model.
1.2 0.9 0.6 0.3 0.0 –0.3 –0.6 –0.9 Thickness
Residuals
If the linear model were perfect, the residuals would all be zero and would have a standard deviation of 0. If knowing about the number of pages gave us no information about the thickness of a book, then we’d just use the mean thickness and not bother with the regression. We can construct a measure that tells us where our model falls between being perfect and being useless. One measure we could use is the correlation between the data y and the predicted values yn. In a perfect regression model, the predictions
M04_SHAR8696_03_SE_C04.indd 145
14/07/14 7:26 AM
www.freebookslides.com 146
CHAPTER 4 Correlation and Linear Regression
r and R2 Is a correlation of 0.80 twice asstrong as a correlation of 0.40? Not if you think in terms ofR2. A correlation of 0.80 means an R2 of 0.802 = 64%. A correlation of 0.40 means an R2 of0.402 = 16%—only a quarter as much of the variability accounted for. A correlation of 0.80 gives an R2 four times as strong as a correlation of 0.40 and accounts for four times as much of the variability.
would match the observed values and the correlation would be 1.0. In the worthless regression, we’d expect a correlation of 0. All regression models fall somewhere between the two extremes of zero correlation and perfect correlation. We’d like to gauge where our model falls. But a regression model with correlation +0.5 is doing as well as one with correlation -0.5. They just have different directions. If we square the correlation coefficient, we’ll get a value between 0 and 1, and the direction won’t matter. But that’s not the real reason for squaring the correlation. In fact, the squared correlation, r 2, gives the fraction of the data’s variation accounted for by the model, and 1 - r 2 is the fraction of the original variation left in the residuals. For the thickness and pages model, r 2 = 0.8222 = 0.676, and 1 - r 2 is 0.324, so 32.4% of the variability in Thickness has been left in the residuals. All regression analyses include this statistic, although by tradition, it is written with a capital letter, R2, and pronounced “R-squared.” Because R2 is a fraction of a whole, it is often given as a percentage.7 An R2 of 0% means that none of the variance in the data is in the model; all of it is still in the residuals. An R2 of 100% sounds great, but means you’ve probably made a mistake.8 We can see how R2 relates to the variance. The variance of Thickness is 0.128. The variance of the residuals is 0.0415.9 That’s 0.0415>0.128 = 0.324, or 32.4% of the variance of Thickness left behind in the residuals. So 100% - 32.4% = 67.6% is the R2 of the regression. The appropriate way to report this as part of a regression analysis is to say that 67.6% of the variance is accounted for by the regression.
J ust C hecking Let’s go back to our regression of sales ($000) on number of employees again.
Sales = 9.564 + 122.74 Employees The R2 value is reported as 72.25%.
9 What does the R2 value mean about the relationship of Sales and Employees? 10 Is the correlation of Sales and Employees positive or negative? How do you know? 11 If we measured the Sales in thousands of euros instead of thousands of dollars,
would the R2 value change? How about the slope?
Some Extreme Tales One major company developed a method to differentiate between proteins. To do so, they had to distinguish between regressions with R2 of 99.99% and 99.98%. For this application, 99.98% was not high enough. The president of a financial services company reports that although his regressions give R2 below 2%, they are highly successful because those used by his competition are even lower.
How Big Should R2 Be? The value of R2 is always between 0% and 100%. But what is a “good” R2 value? The answer depends on the kind of data you are analyzing and on what you want to do with it. Just as with correlation, there is no value for R2 that automatically determines that the regression is “good.” Data from scientific experiments often have R2 in the 80% to 90% range and even higher. Data from observational studies and surveys, though, often show relatively weak associations because it’s so difficult to measure reliable responses. An R2 of 30% to 50% or even lower might be taken as evidence of a useful regression. The standard deviation of the residuals can give us more information about the usefulness of the regression by telling us how much scatter there is around the line.
7
By contrast, we usually give correlation coefficients as decimal values between - 1.0 and 1.0. Well, actually, it means that you have a perfect fit. But perfect models don’t happen with real data unless you accidentally try to predict a variable from itself. 9 This isn’t quite the same as squaring se which we discussed previously, but it’s very close. 8
M04_SHAR8696_03_SE_C04.indd 146
14/07/14 7:26 AM
www.freebookslides.com
147
Reality Check: Is the Regression Reasonable?
For Example
Understanding R2
Question Find and interpret the R2 for the regression of cyclist death ages vs. time
found in the example on page 128. (Hint: The calculation is a simple one.)
Answer We are given the correlation, r = 0.96. R2 is the square of this, or 0.92. It
tells us that 92% of the variation in the mean age of cyclist deaths can be accounted for by the trend of increasing age over time.
As we’ve seen, an R2 of 100% is a perfect fit, with no scatter around the line. The se would be zero. All of the variance would be accounted for by the model with none left in the residuals. This sounds great, but it’s too good to be true for real data.10
4.10
Reality Check: Is the Regression Reasonable? Statistics don’t come out of nowhere. They are based on data. The results of a statistical analysis should reinforce common sense. If the results are surprising, then either you’ve learned something new about the world or your analysis is wrong. Whenever you perform a regression, think about the coefficients and ask whether they make sense. Is the slope reasonable? Does the direction of the slope seem right? The small effort of asking whether the regression equation is plausible will be repaid whenever you catch errors or avoid saying something silly or absurd about the data. It’s too easy to take something that comes out of a computer at face value and assume that it makes sense. Always be skeptical and ask yourself if the answer is reasonable.
Guided Example
Home Size and Price Real estate agents know the three most important factors in determining the price of a house are location, location, and location. But what other factors help determine the price at which a house should be listed? Number of bathrooms? Size of the yard? A student amassed publicly available data on thousands of homes in upstate New York. We’ve drawn a random sample of 1000 homes from that larger dataset to examine house pricing. Among the variables she collected were the total living area (in square feet), number of bathrooms, number of bedrooms, size of lot (in acres), and age of house (in years). We will investigate how well the size of the house, as measured by living area, can predict the selling price.
Plan
Setup State the objective of the study. Identify the variables and their context.
We want to find out how well the living area of a house in upstate NY can predict its selling price. We have two quantitative variables: the living area (in square feet) and the selling price ($). These data come from public records in upstate New York in 2006. (continued )
10
If you see an R2 of 100%, it’s a good idea to investigate what happened. You may have accidentally regressed two variables that measure the same thing.
M04_SHAR8696_03_SE_C04.indd 147
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 4 Correlation and Linear Regression Model We need to check the same conditions for regression as we did for correlation. To do that, make a picture. Never fit a regression without looking at the scatterplot first.
✓ Quantitative Variables Condition
500 Price ($000)
148
375 250 125
2500 Living Area
1250
Do
3750
5000
Check the Linearity, Equal Spread, and Outlier Conditions.
✓ Linearity Condition The scatterplot shows two variables that appear to
Mechanics Find the equation of
Our software produces the following output.
the regression line using a statistics package. Remember to write the equation of the model using meaningful variable names.
have a fairly strong positive association. The plot appears to be fairly linear. ✓ Equal Spread Condition The scatterplot shows a consistent spread across the x-values. ✓ Outlier Condition There appear to be a few possible outliers, especially among large, relatively expensive houses. A few smaller houses are expensive for their size. We will check their influence on the model later. We have two quantitative variables that appear to satisfy the conditions, so we will model this relationship with a regression line.
Dependent variable is: Price 1000 total cases R squared = 62.43% s = 57930 with 1000 - 2 = 998 df Variable Coefficient
Intercept Living Area
200 Residuals ($000)
Once you have the model, plot the residuals and check the Equal Spread Condition again.
6378.08 115.13
100 0 –100 100
200
300 Predicted ($000)
400
The residual plot appears generally patternless. The few relatively expensive small houses are evident, but setting them aside and refitting the model did not change either the slope or intercept very much so we left them in. There is a slight tendency for cheaper houses to have less variation, but the spread is roughly the same throughout.
Report
Conclusion Interpret what you have found in the proper context.
M04_SHAR8696_03_SE_C04.indd 148
Memo Re: Report on housing prices We examined how well the size of a house could predict its selling price. Data were obtained from recent sales of 1000 homes in upstate New York. The model is: Price = +6378.08 + 115.13 * Living Area
14/07/14 7:26 AM
www.freebookslides.com
149
Nonlinear Relationships
In other words, from a base of $6378.08, houses cost about $115.13 per square foot in upstate NY. This model appears reasonable from both a statistical and real estate perspective. Although we know that size is not the only factor in pricing a house, the model accounts for 62.4% of the variation in selling price. As a reality check, we checked with several real estate pricing sites (www.realestateabc.com, www.zillow.com) and found that houses in thisregion were averaging $100 to $150 per square foot, so our model is plausible. Of course, not all house prices are predicted well by the model. We computed the model without several of these houses, but their impact on the regression model was small. We believe that this is a reasonable place to start to assess whether a house is priced correctly for this market. Future analysis might benefit by considering other factors.
4.11
Nonlinear Relationships Everything we’ve discussed in this chapter requires that the underlying relationship between two variables be linear. But what should we do when the relationship is nonlinear and we can’t use the correlation coefficient or a linear model? There are three basic approaches, each with its advantages and disadvantages. Let’s consider an example. The Human Development Index (HDI) was developed by the United Nations as a general measure of quality of life in countries around the world. It combines economic information (GDP), life expectancy, and education. The growth of cell phone usage has been phenomenal worldwide. Is cell phone usage related to the developmental state of a country? Figure 4.8 shows a scatterplot of number of Cell Phones vs. HDI for 152 countries of the world. We can look at the scatterplot and see that cell phone usage increases with increasing HDI. But the relationship is not straight. In Figure 4.8, we can easily see the bend in the form. But that doesn’t help us summarize or model the relationship.
1000 Cell Phones
Figure 4.8 The scatterplot of number of Cell Phones (000s) vs. HDI for countries shows a bent relationship not suitable for correlation or regression.
750 500 250
0.45
0.60
0.75
0.90
HDI
You might think that we should just fit some curved function such as an exponential or quadratic to a shape like this. But using curved functions is complicated, and the resulting model can be difficult to interpret. And many of the convenient associated statistics (which we’ll see in Chapters 17 and 18) are not appropriate for such models. So this approach isn’t often used.
M04_SHAR8696_03_SE_C04.indd 149
14/07/14 7:26 AM
www.freebookslides.com 150
CHAPTER 4 Correlation and Linear Regression
Another approach allows us to summarize the strength of the association between the variables even when we don’t have a linear relationship. The Spearman rank correlation11 works with the ranks of the data rather than their values. To find the ranks we simply count from the lowest value to the highest so that rank 1 is assigned to the lowest value, rank 2 to the next lowest, and so on. Using ranks for both variables generally straightens out the relationship, as Figure 4.9 shows.
Rank : Cell Phone
Figure 4.9 Plotting the ranks results in a plot with a straight relationship.
120 80 40
40
120 80 Rank : HDI
160
Now we can calculate a correlation on the ranks. The resulting correlation12 summarizes the degree of relationship between two variables—but not, of course, of the degree of linear relationship. The Spearman correlation for these variables is 0.876. That says there’s a reasonably strong relationship between Cell Phones and HDI. We don’t usually fit a linear model to the ranks because that would be difficult to interpret and because the supporting statistics (as we said, see Chapters 17 and 18) wouldn’t be appropriate. A third approach to a nonlinear relationship is to transform or re-express one or both of the variables by a function such as the square root, logarithm, or reciprocal. We saw in Chapter 3 that a transformation can improve the symmetry of the distribution of a single variable. In the same way—and often with the same transforming function—transformations can make a relationship more nearly linear. Figure 4.10, for example, shows the relationship between the log of the number of Cell Phones and the HDI for the same countries. Figure 4.10 Taking the logarithm of Cell Phones results in a more nearly linear relationship.
Log Cell Phone
3.00 2.25 1.50 0.75 0.45
0.60 HDI
0.75
0.90
The advantage of re-expressing variables is that we can use regression models, along with all the supporting statistics still to come. The disadvantage is that we must interpret our results in terms of the re-expressed data, and it can be difficult to explain what we mean by the logarithm of the number of cell phones in 11
Due to Charles Spearman, a psychologist who did pioneering work in intelligence testing. Spearman rank correlation is a nonparametric statistical method. You’ll find other nonparametric methods in Chapter 22. 12
M04_SHAR8696_03_SE_C04.indd 150
14/07/14 7:26 AM
www.freebookslides.com
Nonlinear Relationships
151
a country. We can, of course, reverse the transformation to transform a predicted value or residual back to the original units. (In the case of a logarithmic transformation, calculate 10y to get back to the original units.) Which approach you choose is likely to depend on the situation and your needs. Statisticians, economists, and scientists generally prefer to transform their data, and many of their laws and theories include transforming functions. 13 But for just understanding the shape of a relationship, a scatterplot does a fine job, and as a summary of the strength of a relationship, a Spearman correlation is a good general-purpose tool.
For Example
Re-expressing for linearity
Consider the relationship between a company’s Assets and its Sales as reported in annual financial statements. Here’s a scatterplot of those variables for 79 of the largest companies: 50000
Sales
37500 25000 12500
12500
25000 Assets
37500
50000
The Pearson correlation is 0.746, and the Spearman rank correlation is 0.50. Taking the logarithm of both variables produces the following scatterplot:
Log Sales
4.50
3.75
3.00
2.25 3.00
3.75 Log Assets
4.50
Question What should we say about the relationship between Assets and Sales? Answer The Pearson correlation is not appropriate because the scatterplot of the data is not linear. The Spearman correlation is a more appropriate summary. The scatterplot of the log transformed variables is linear and shows a strong pattern. We could find a linear model for this relationship, but we’d have to interpret it in terms of log Sales and log Assets.
13
In fact, the HDI itself includes such transformed variables in its construction.
M04_SHAR8696_03_SE_C04.indd 151
14/07/14 7:26 AM
www.freebookslides.com 152
CHAPTER 4 Correlation and Linear Regression
What Can Go Wrong? • Don’t say “correlation” when you mean “association.” How often have you heard the word “correlation”? Chances are pretty good that when you’ve heard the term, it’s been misused. It’s one of the most widely misused Statistics terms, and given how often Statistics are misused, that’s saying a lot. One of the problems is that many people use the specific term correlation when they really mean the more general term association. Association is a deliberately vague term used to describe the relationship between two variables. Correlation is a precise term used to describe the strength and direction of a linear relationship between quantitative variables. • Don’t correlate categorical variables. Be sure to check the Quantitative Variables Condition. It makes no sense to compute a correlation of categorical variables. • Make sure the association is linear. Not all associations between quantitative variables are linear. Correlation can miss even a strong nonlinear association. And linear regression models are never appropriate for relationships that are not linear. A company, concerned that customers might use ovens with imperfect temperature controls, performed a series of experiments14 to assess the effect of baking temperature on the quality of brownies made from their freeze-dried reconstituted brownies. The company wants to understand the sensitivity of brownie quality to variation in oven temperatures around the recommended baking temperature of 325°F. The lab reported a correlation of -0.05 between the scores awarded by a panel of trained taste-testers and baking temperature and a regression slope of -0.02, so they told management that there is no relationship. Before printing directions on the box telling customers not to worry about the temperature, a savvy intern asks to see the scatterplot.
Figure 4.11 The relationship between brownie taste Score and Baking Temperature is strong, but not linear.
10
Score
8 6 4 2 0 150
300 450 Baking Temperature (°F)
600
The plot actually shows a strong association—but not a linear one. Don’t forget to check the Linearity Condition. • Beware of outliers. You can’t interpret a correlation coefficient or a regression model safely without a background check for unusual observations.
14
Experiments designed to assess the impact of environmental variables outside the control of the company on the quality of the company’s products were advocated by the Japanese quality expert Dr. Genichi Taguchi starting in the 1980s in the United States.
M04_SHAR8696_03_SE_C04.indd 152
14/07/14 7:26 AM
www.freebookslides.com
What Can Go Wrong?
153
Here’s an example. The relationship between IQ and Shoe Size among comedians shows a surprisingly strong positive correlation of 0.50. To check assumptions, we look at the scatterplot. Figure 4.12 IQ vs. Shoe Size
175
IQ
150 125 100
7.5
22.5 Shoe Size
From this “study,” what can we say about the relationship between the two? The correlation is 0.50. But who does that point in the upper right-hand corner belong to? The outlier is Bozo the Clown, known for his large shoes and widely acknowledged to be a comic “genius.” Without Bozo the correlation is near zero. Even a single unusual observation can dominate the correlation value. That’s why you need to check the Unusual Observations Condition. • Don’t confuse correlation with causation. Once we have a strong correlation, it’s tempting to try to explain it by imagining that the predictor variable has caused the response to change. Putting a regression line on a scatterplot tempts us even further. Humans are like that; we tend to see causes and effects in everything. Just because two variables are related does not mean that one causes the other. Does Cancer Cause Smoking? Even if the correlation of two variables is due to a causal relationship, the correlation itself cannot tell us what causes what. Sir Ronald Aylmer Fisher (1890–1962) was one of the greatest statisticians of the 20th century. Fisher testified in court (paid by the tobacco companies) that a causal relationship might underlie the correlation of smoking and cancer: Is it possible, then, that lung cancer Á is one of the causes of smoking c igarettes? I don’t think it can be excluded Á the pre-cancerous condition is one involving a certain amount of slight chronic inflammation Á A slight cause of irritation Á is commonly accompanied by pulling out a cigarette, and getting a little compensation for life’s minor ills in that way. And Á is not unlikely to be associated with smoking more frequently. Ironically, the proof that smoking indeed is the cause of many cancers came from experiments conducted following the principles of experiment design and analysis that Fisher himself developed.
Scatterplots, correlation coefficients, and regression models never prove causation. This is, for example, partly why it took so long for the U.S. Surgeon General to get warning labels on cigarettes. Although there was plenty of evidence that increased smoking was associated with increased levels of lung cancer, it took years to provide evidence that smoking actually causes lung cancer. (The tobacco companies used this to great advantage.) (continued )
M04_SHAR8696_03_SE_C04.indd 153
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 4 Correlation and Linear Regression
154
• Watch out for lurking variables. A scatterplot of the damage (in dollars) caused to a house by fire would show a strong correlation with the number of firefighters at the scene. Surely the damage doesn’t cause firefighters. And firefighters actually do cause damage, spraying water all around and chopping holes, but does that mean we shouldn’t call the fire department? Of course not. There is an underlying variable that leads to both more damage and more firefighters—the size of the blaze. You can often debunk claims made about data by finding a lurking variable behind the scenes. • Don’t fit a straight line to a nonlinear relationship. Linear regression is suited only to relationships that are, in fact, linear. • Beware of extraordinary points. Data values can be extraordinary or unusual in a regression in two ways. They can have y-values that stand off from the linear pattern suggested by the bulk of the data. These are what we have been calling outliers; although with regression, a point can be an outlier by being far from the linear pattern even if it is not the largest or smallest y-value. Points can also be extraordinary in their x-values. Such points can exert a strong influence on the line. Both kinds of extraordinary points require attention. • Don’t predict far beyond the data. A linear model will often do a rea-
sonable job of summarizing a relationship in the range of observed xvalues. Once we have a working model for the relationship, it’s tempting to
use it. But beware of predicting y-values for x-values that lie too far outside the range of the original data. The model may no longer hold there, so such extrapolations too far from the data are dangerous.
• Don’t choose a model based on R 2 alone. Although R2 measures the strength of the linear association, a high R2 does not demonstrate the appropriateness of the regression. A single unusual observation, or data that separate into two groups, can make the R2 seem quite large when, in fact, the linear regression model is simply inappropriate. Conversely, a low R2 value may be due to a single outlier. It may be that most of the data fall roughly along a straight line, with the exception of a single point. Always look at the scatterplot.
Ethics in Action
R
ebekkah Greene, owner of Up with Life Café and Marketplace, is a true believer in the health and healing benefits of food. She offers her customers pure, wholesome, and locally sourced products. Recently, she decided to add a line of hearty “made to order” cereals that she prepares according to each customer’s expressed preferences. To do this, she keeps a wide variety of ingredients on hand from which her customers can choose to “design” their own unique cereal mix. These not only include organic grains, nuts, and dried fruits typically found in cereals, like oat bran and wheat germ, but also a wide assortment of sprouted grains.
M04_SHAR8696_03_SE_C04.indd 154
Rebekkah is following the lead of food innovators who understand that sprouting grains activates food enzymes, increases vitamin content, and decreases starch—all of which improve digestion and absorption. She finds that sprouted grains are particularly delicious choices for warm breakfast cereals. Being a purest, Rebekkah decided that she would take care of the sprouting process herself. She therefore invested in the equipment and designed a “sprouting” space with appropriate temperature and humidity controls. At the onset, her “made to order” cereal mixes were a big hit. But after the novelty wore off, she noticed that sales began to decline. Since most of her regular customers are young
14/07/14 7:26 AM
www.freebookslides.com
155
What Have We Learned?
to middle-aged women, Rebekkah now considered the possibility that most are weight conscious as well as health conscious. She suspects that many ultimately decided to give up eating cereal for breakfast, even if it can provide superior health benefits, to opt for lower carb alternatives to maintain or lose weight. Given her sizeable investment in sprouting grains, Rebekkah realizes she needs to do something to get more of her cereals back on the breakfast table! She began to do some research on the topic to better educate her customers. While she found a number of studies suggesting that sprouted grains are particularly healthful, she focused her attention on findings that emphasize the relationship between eating breakfast (specifically cereal) and weight loss. In her weekly flyer on “Up
with Life Food Facts” she cited one study she found in a dietetic association journal that showed regular cereal eaters had fewer weight problems than infrequent cereal eaters. She stressed this positive correlation in her advice to customers. The more often you eat cereal for breakfast the more weight you can lose… more cereal = more weight lost! She did fail to mention, however, that this particular study was funded by the big cereal companies. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • Propose an ethical solution that considers the welfare of all stakeholders.
What Have We Learned? Learning Objectives
Make a scatterplot to display the relationship between two quantitative variables.
• Look at the direction, form, and strength of the relationship, and any outliers that stand away from the overall pattern. Provided the form of the relationship is linear, summarize its strength with a correlation, r.
• The sign of the correlation gives the direction of the relationship. • - 1 … r … 1; A correlation of 1 or - 1 is a perfect linear relationship. A correlation of 0 is a lack of linear relationship. • Correlation has no units, so shifting or scaling the data, standardizing, or even swapping the variables has no effect on the numerical value. • A large correlation is not a sign of a causal relationship Model a linear relationship with a least squares regression model.
• The regression (best fit) line doesn’t pass through all the points, but it is the best compromise in the sense that the sum of squares of the residuals is the smallest possible. • The slope tells us the change in y per unit change in x. • The R2 gives the fraction of the variation in y accounted for by the linear regression model. Recognize regression to the mean when it occurs in data.
• A deviation of one standard deviation from the mean in one variable is predicted to correspond to a deviation of r standard deviations from the mean in the other. Because r is never more than 1, we predict a change toward the mean. Examine the residuals from a linear model to assess the quality of the model.
• When plotted against the predicted values, the residuals should show no pattern and no change in spread.
Terms Association
M04_SHAR8696_03_SE_C04.indd 155
• Direction: A positive direction or association means that, in general, as one variable increases, so does the other. When increases in one variable generally correspond to decreases in the other, the association is negative.
14/07/14 7:26 AM
www.freebookslides.com 156
CHAPTER 4 Correlation and Linear Regression • Form: The form we care about most is straight, but you should certainly describe other patterns you see in scatterplots. • Strength: A scatterplot is said to show a strong association if there is little scatter around the underlying relationship. Correlation coefficient
A numerical measure of the direction and strength of a linear association. r =
Explanatory or independent variable (x-variable) Intercept
gzx zy
n - 1
The variable that accounts for, explains, predicts, or is otherwise responsible for the y-variable.
The intercept, b0, gives a starting value in y-units. It’s the yn value when x is 0. b0 = y - b1x
Least squares
A criterion that specifies the unique line that minimizes the variance of the residuals or, equivalently, the sum of the squared residuals.
Linear model (Line of best fit)
The linear model of the form yn = b0 + b1x fit by least squares. Also called the regression line. To interpret a linear model, we need to know the variables and their units.
Lurking variable
A variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two.
Outlier
A point that does not fit the overall pattern seen in the scatterplot.
Predicted value
The prediction for y found for each x-value in the data. A predicted value, ny , is found by substituting the x-value in the regression equation. The predicted values are the values on the fitted line; the points (x, yn ) lie exactly on the fitted line.
Re-expression or transformation
Re-expressing one or both variables using functions such as log, square root, or reciprocal can improve the straightness of the relationship between them.
Residual
The difference between the actual data value and the corresponding value predicted by the regression model—or, more generally, predicted by any model: e = y - yn.
Regression line
The particular linear equation that satisfies the least squares criterion, often called the line of best fit.
Regression to the mean
Because the correlation is always less than 1.0 in magnitude, each predicted y tends to be fewer standard deviations from its mean than its corresponding x is from its mean.
Response or dependent variable (y-variable)
The variable that the model is intended to explain or predict.
R2
• The square of the correlation between y and x • The fraction of the variability of y accounted for by the least squares linear regression on x • An overall measure of how successful the regression is in linearly relating y to x
Scatterplot Slope
A graph that shows the relationship between two quantitative variables measured on the same cases. The slope, b1, is given in y-units per x-unit. Differences of one unit in x are associated with differences of b1 units in predicted values of y: sy b1 = r . sx
Spearman rank correlation Standard deviation of the residuals
M04_SHAR8696_03_SE_C04.indd 156
The correlation between the ranks of two variables may be an appropriate measure of the strength of a relationship when the form isn’t straight. se is found by: se =
ge2 . An - 2
14/07/14 7:26 AM
www.freebookslides.com
Technology Help
157
Technology Help: Correlation and Regression All statistics packages make a table of results for a regression. These tables may differ slightly from one package to another, but all are essentially the same—and all include much more than we need to know
for now. Every computer regression table includes a section that looks something like this:
R squared
Standard dev of residuals ( se )
The “dependent,” response, or y-variable
Dependent variable is: Sales R squared = 69.0% s = 9.277 Coefficient SE(Coeff) t-ratio Variable 2.664 6.83077 2.56 Intercept 0.1209 8.04 Shelf Space 0.971381
The “independent,” predictor, or x -variable
The slope The intercept
The slope and intercept coefficient are given in a table such as this one. Usually the slope is labeled with the name ofthe x-variable, and the intercept is labeled “Intercept” or“Constant.” So the regression equation shown here is Sales = 6.83077 + 0.971381 Shelf Space.
P-value 0.0158 #0.0001
We'll deal with all of these later in the book. You may ignore them for now.
• Design, Layout, and Format options now show at the top of the screen as Chart Tools. Use these to change chart layouts, labels, and colors. • By default, Excel includes in the intercept in the plot. If your data values are all far from zero, you may need to re-format your scatterplot. (See the discussion of Format axis below.)
Excel
To carry out a Linear Regression in Excel:
To make a scatterplot in Excel:
• First, make sure that you’ve installed the Data Analysis add-in, as follows:
• Arrange data in worksheet so that x-variable and y-variable are in columns next to each other in that order. • Highlight data in x-variable and y-variable columns. • Navigate to Insert + Scatter. (Mac users choose Marked Scatter.) • Choose Scatter With Only Markers. • The scatterplot appears. Shown here is a scatterplot of the Real Estate data.
• On the File tab, click Options, and then click Add-Ins. • Near the bottom of the Excel Options dialog box, select Excel Add-ins in the Manage box, and then click Go. • In the Add-Ins dialog box, select the check box for Analysis ToolPak, and then click OK. • If Excel displays a message that states it can’t run this add-in and prompts you to install it, click Yes to install the add-in. • Navigate to Data + Data Analysis. • Select Regression.
• To move the scatterplot to a new worksheet choose Chart Tools + Design + Move Chart. • To add a linear regression Line to a scatterplot, choose Chart Tools + Layout + Trendline + Linear Trendline. (Mac users choose Analysis + Trendline + Linear Trendline.)
M04_SHAR8696_03_SE_C04.indd 157
14/07/14 7:26 AM
www.freebookslides.com 158
CHAPTER 4 Correlation and Linear Regression
• Choose the cells of the spreadsheet holding the x and y variables.
• Select where the output will appear.
• Check the Labels box if your data columns have variable names in the first row.
• Check the Line Fit Plots box to display a scatterplot of the data with the Least Squares Regression Line.
• The series displayed in red is the predicted values for price based on the Linear Regression Model. These points make up the Least Squares Regression Line.
• Specify the x-variable and click the X, Factor button.
• The R 2 value is in cell B5. • The y-intercept and slope are in cells B17 and B18 respectively. • The Design, Layout, and Format options now show at the top of the screen as Chart Tools. Use these to change chart layouts, labels, and colors. But we aren’t quite done yet. Excel always scales the axes of a scatterplot to show the origin (0, 0). But most data are not near the origin, so you may get a plot that, like this one, is bunched up in one corner. • Right-click on the x-axis labels. From the menu that drops down, choose Format axis… • Choose Scale. • Set the x-axis minimum value. One useful trick is to use the dialog box itself as a straightedge to read over to the x-axis so you can estimate a good minimum value. Here 500 seems appropriate. • Repeat the process with the y-axis if necessary.
JMP
• Click OK to make a scatterplot. • In the scatterplot window, click on the red triangle beside the heading labeled “Bivariate Fit…” and choose Fit Line. JMP draws the least squares regression line on the scatterplot and displays the results of the regression in tables below the plot.
Minitab To make a scatterplot, • Choose Scatterplot from the Graph menu. • Choose Simple for the type of graph. Click OK. • Enter variable names for the Y-variable and X-variable into the table. Click OK. To compute a correlation coefficient, • Choose Basic Statistics from the Stat menu. • From the Basic Statistics submenu, choose Correlation. Specify the names of at least two quantitative variables in the “Variables” box. • Click OK to compute the correlation table.
To make a scatterplot and compute correlation: • Choose Fit Y by X from the Analyze menu.
SPSS
• In the Fit Y by X dialog, drag the Y variable into the “Y, Response” box, and drag the X variable into the “X, Factor” box.
To make a scatterplot in SPSS, open the Chart Builder from the Graphs menu. Then
• Click the OK button.
• Click the Gallery tab.
Once JMP has made the scatterplot, click on the red triangle next to the plot title to reveal a menu of options.
• Choose Scatterplot from the list of chart types.
• Select Density Ellipse and select .95. JMP draws an ellipse around the data and reveals the Correlation tab. • Click the blue triangle next to Correlation to reveal a table containing the correlation coefficient. To compute a regression, • Choose Fit Y by X from the Analyze menu. Specify the Y variable in the — Select Columns box and click the Y, Response button.
M04_SHAR8696_03_SE_C04.indd 158
• Drag the scatterplot onto the canvas. • Drag a scale variable you want as the response variable to the y-axis drop zone. • Drag a scale variable you want as the factor or predictor to the x-axis drop zone. • Click OK.
14/07/14 7:27 AM
www.freebookslides.com
159
Brief Case
To compute a correlation coefficient,
To compute a regression, from the Analyze menu, choose
• Choose Correlate from the Analyze menu.
• Regression + Linear … In the Linear Regression dialog, specify the Dependent ( y ), and Independent (x ) variables.
• From the Correlate submenu, choose Bivariate. • In the Bivariate Correlations dialog, use the arrow button to move variables between the source and target lists. Make sure the Pearson option is selected in the Correlation Coefficients field.
Brief Case
• Click the Plots button to specify plots and Normal Probability Plots of the residuals. Click OK.
Fuel Efficiency Both drivers and auto companies are motivated to raise the fuel efficiency of cars. Recent information posted by the U.S. government proposes some simple ways to increase fuel efficiency (see www.fueleconomy.gov): avoid rapid acceleration, avoid driving over 60 mph, reduce idling, and reduce the vehicle’s weight. An extra 100 pounds can reduce fuel efficiency (mpg) by up to 2%. A marketing executive is studying the relationship between the fuel efficiency of cars (as measured in miles per gallon) and their weight to design a new compact car campaign. In the file Fuel Efficiency you’ll find data on the variables below.15 • Model of Car • Engine Size • Cylinders • MSRP (Manufacturer’s Suggested Retail Price in $) • City (mpg)
• Highway (mpg) • Carbon footprint • Transmission type (Automatic or Manual) • Fuel type (Regular or premium)
Describe the relationship of MSRP and Engine Size with Fuel Efficiency (both City and Highway) in a written report. Only in the U.S. is fuelefficiency measured in miles per gallon. The rest of the world uses liters per 100 kilometers. To convert mpg to l>100 km, compute 235.215>mpg. Try that form of the variable and compare the resulting models. Be sure to plot the residuals.
Cost of Living The Numbeo website (www.numbeo.com) provides access to a variety of data. One table lists prices of certain items in selected cities around the world. They also report an overall cost-of-living index for each city compared to the costs of hundreds of items in New York City. For example, London at 110.69 us 10.69% more expensive than New York. You’ll find the data for 322 cities as of March 23, 2013 in the file Cost of living 2013. Included are the Cost of Living Index, a Rent Index, a Groceries Index, a Restaurant price Index, and a Local Purchasing Power Index that measures the ability of the average wage earner in a city to buy goods and services. All indices are measured relative to New York City, which is scored 100. Examine the relationship between the Cost of Living Index and the Cost Index for each of these individual items. Verify the necessary conditions and describe the relationship in as much detail as possible. (Remember to look at direction, form, and strength.) Identify any unusual observations. Based on the correlations and linear regressions, which item would be the best predictor of overall cost in these cities? Which would be the worst? Are there any surprising relationships? Write a short report detailing your conclusions. (continued )
15
M04_SHAR8696_03_SE_C04.indd 159
Data are from the 2004 model year and were compiled from www.Edmonds.com.
14/07/14 7:27 AM
www.freebookslides.com 160
CHAPTER 4 Correlation and Linear Regression
Mutual Funds According to the U.S. Securities and Exchange Commission (SEC), a mutual fund is a professionally-managed collection of investments for a group of investors in stocks, bonds, and other securities. The fund manager manages the investment portfolio and tracks the wins and losses. Eventually the dividends are passed along to the individual investors in the mutual fund. The first group fund was founded in 1924, but the spread of these types of funds was slowed by the stock market crash in 1929. Congress passed the Securities Act in 1933 and the Securities Exchange Act in 1934 to require that investors be provided disclosures about the fund, the securities, and the fund manager. The SEC drafted the Investment Company Act, which provided guidelines for registering all funds with the SEC. By the end of the 1960s, funds reported $48 billion in assets and, by October 2007 there were over 8,000 mutual funds with combined assets under management of over $12 trillion. Investors often choose mutual funds on the basis of past performance, and many brokers, mutual fund companies, and other websites offer such data. In the file Mutual fund returns 2013, you’ll find the 3-month return, the annualized 1 yr, 5 yr, and 10 yr returns, and the return since inception of 64 funds of v arious types. Which data from the past provides the best predictions of the recent 3 months? Examine the scatterplots and regression models for predicting 3-month returns and write a short report containing your conclusions.
Exercises The calculations for correlation and regression models can be very sensitive to how intermediate results are rounded. If you find your answers using a calculator and writing down intermediate results, you may obtain slightly different answers than you would have had you used statistics software. Different programs can also yield different results. So your answers may differ in the trailing digits from those in the Appendix. That should not concern you. The meaningful digits are the first few; the trailing digits may be essentially random results of the rounding of intermediate results.
Section 4.1 1. Consider the following data from a small bookstore. Number of Sales People Working
Sales (in $1000)
2 3 7 9 10 10 12 15 16 20 x = 10.4 SD1x2 = 5.64
10 11 13 14 18 20 20 22 22 26 y = 17.6 SD1y2 = 5.34
M04_SHAR8696_03_SE_C04.indd 160
a) Prepare a scatterplot of Sales against Number of Sales People Working. b) What can you say about the direction of the association? c) What can you say about the form of the relationship? d) What can you say about the strength of the relationship? e) Does the scatterplot show any outliers? 2. Disk drives have been getting larger. Their capacity is now often given in terabytes (TB) where 1 TB = 1000 gigabytes, or about a trillion bytes. A survey of prices for external disk drives found the following data: Capacity (in TB)
Price (in $)
.150 .200 .250 .320 1.0 2.0 3.0 4.0
35.00 299.00 39.95 49.95 75.00 110.00 140.00 325.00
a) Prepare a scatterplot of Price against Capacity. b) What can you say about the direction of the association? c) What can you say about the form of the relationship? d) What can you say about the strength of the relationship? e) Does the scatterplot show any outliers?
14/07/14 7:27 AM
www.freebookslides.com
Exercises 161
Section 4.2
Section 4.5
3. The finance department at a national medical corporation wants to know and predict the average salary of physicians compared to the number of medical operations performed during their years of service in the profession. Data was collected on salaries and years of experience for a random sample of physicians. a) Which variable is the predictor or explanatory variable? b) Which variable is the response variable? c) Which variable would plot on the y axis?
9. True or False. If False, explain briefly. a) We choose the linear model that passes through the most data points on the scatterplot. b) The residuals are the observed y-values minus the y-values predicted by the linear model. c) Least squares means that the square of the largest residual is as small as it could possibly be.
4. A company that relies on Internet-based advertising linked to key search terms wants to understand the relationship between the amount it spends on this advertising and revenue (in $). a) Which variable is the explanatory or predictor variable? b) Which variable is the response variable? c) Which variable would you plot on the x axis?
Section 4.3 5. If we assume that the conditions for correlation are met, which of the following are true? If false, explain briefly. a) A correlation factor of 0.92 indicates a strong, positive association. b) Dividing every value of x by 2 will half the correlation. c) The units of the correlation factor are the same as the units of x. 6. If we assume that the conditions for correlation are met, which of the following are true? If false, explain briefly. a) A correlation of 0.02 indicates a strong positive association. b) Standardizing the variables will make the correlation 0. c) Adding an outlier can dramatically change the correlation.
Section 4.4 7. A larger firm is considering acquiring the bookstore of Exercise 1. An analyst for the firm, noting the relationship seen in Exercise 1, suggests that when they acquire the store they should hire more people because that will drive higher sales. Is his conclusion justified? What alternative explanations can you offer? Use appropriate statistics terminology. 8. A recent survey found that car sales during a holiday weekend are highly associated with the number of advertisements posted on the local media channels; the more advertisements made using local television, newspapers, and internet media, the more the sales. The regional manager of the car sales association suggests that each local dealer should consider using the suggested media to increase sales. Comment.
M04_SHAR8696_03_SE_C04.indd 161
10. True or False. If False, explain briefly. a) Some of the residuals from a least squares linear model will be positive and some will be negative. b) Least Squares means that some of the squares of the residuals are minimized. c) We write yn to denote the predicted values and y to d enote the observed values.
Section 4.6 11. For the bookstore sales data in Exercise 1, the correlation of number of sales people and sales is 0.965. a) If the number of people working is 2 standard deviations above the mean, how many standard deviations above or below the mean do you expect sales to be? b) What value of sales does that correspond to? c) If the number of people working is 1 standard deviation below the mean, how many standard deviations above or below the mean do you expect sales to be? d) What value of sales does that correspond to? 12. For the hard drive data in Exercise 2, some research on the prices discovered that the 200 GB hard drive was a special “hardened” drive designed to resist physical shocks and work under water. Because it is completely different from the other drives, it was removed from the data. For the remaining 7 drives, the correlation is now 0.927 and other summary statistics are: Capacity (in TB)
Price (in $)
x = 1.531 SD1x2 = 1.515
y = 110.70 SD1y2 = 102.05
a) If a drive has a capacity of 3.046 TB (or 1 SD above the mean of 1.531 TB), how many standard deviations above or below the mean price of $110.70 do you expect the drive to cost? b) What price does that correspond to? 13. For the bookstore of Exercise 1, the manager wants to predict Sales from Number of Sales People Working. a) Find the slope estimate, b1. b) What does it mean, in this context?
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 4 Correlation and Linear Regression
14. For the disk drives in Exercise 2 (as corrected in Exercise 12), we want to predict Price from Capacity. a) Find the slope estimate, b1. b) What does it mean, in this context? c) Find the intercept, b0. d) What does it mean, in this context? Is it meaningful? e) Write down the equation that predicts Price from Capacity. f) What would you predict for the price of a 3.0 TB disk? g) You have found a 3.0 TB drive for $175. Is this a good buy? How much would you save compared to what you expected to pay? h) Did the model overestimate or underestimate the pricing?
18. Here are residual plots (residuals plotted against predicted values) for three linear regression models. Indicate which condition appears to be violated (linearity, outlier, or equal spread) in each case. a) 15 10 Residual
c) Find the intercept, b0. d) What does it mean, in this context? Is it meaningful? e) Write down the equation that predicts Sales from Number of Sales People Working. f) If 18 people are working, what Sales do you predict? g) If sales are actually $25,000 when 18 people are working, what is the value of the residual? h) Have we overestimated or underestimated the sales?
5 0 –5 –10 –10
b) Residual
162
10
20 30 40 50 Predicted Value
60
70
50 40 30 20 10 0 –10
Section 4.7
15.0 17.5 20.0 22.5 25.0 27.5 30.0 32.5 Predicted Value
c)
200 Residual
15. A CEO complains that the winners of his “rookie junior executive of the year” award often turn out to have less impressive performance the following year. He wonders whether the award actually encourages them to slack off. Can you offer a better explanation? 16. An online investment blogger advises investing in mutual funds that have performed badly the past year because “regression to the mean tells us that they will do well next year.” Is he correct?
100 0 –100 –200 100
200
300 400 500 Predicted Value
600
700
Section 4.8 17. Here are the residuals for a regression of Sales on Number of Sales People Working for the bookstore of Exercise 1: Sales People Working
Residual
2 3 7 9 10 10 12 15 16 20
0.07 0.16 - 1.49 - 2.32 0.77 2.77 0.94 0.20 - 0.72 - 0.37
a) What are the units of the residuals? b) Which residual contributes the most to the sum that was minimized according to the Least Squares Criterion to find this regression? c) Which residual contributes least to that sum?
M04_SHAR8696_03_SE_C04.indd 162
Section 4.9 19. For the regression model for the bookstore of Exercise 1, what is the value of R2 and what does it mean? 20. For the disk drive data of Exercise 2 (as corrected in Exercise 12), find and interpret the value of R2.
Section 4.11 21. When analyzing data on the number of employees in small companies in one town, a researcher took square roots of the counts. Some of the resulting values, which are reasonably symmetric were: 4, 4, 6, 7, 7, 8, 10 What were the original values, and how are they distributed? 22. You wish to explain to your boss what effect taking the base-10 logarithm of the salary values in the company’s database will have on the data. As simple, example values you compare a salary of $10,000 earned by a part-time
14/07/14 7:27 AM
www.freebookslides.com
Exercises 163
Chapter Exercises 23. Association. Suppose you were to collect data for each pair of variables. You want to make a scatterplot. Which variable would you use as the explanatory variable and which as the response variable? Why? What would you expect to see in the scatterplot? Discuss the likely direction and form. a) Cell phone bills: number of text messages, cost. b) Automobiles: Fuel efficiency (mpg), sales volume (number of autos). c) For each week: Ice cream cone sales, air conditioner sales. d) Product: Price ($), demand (number sold per day). 24. Association, part 2. Suppose you were to collect data for each pair of variables. You want to make a scatterplot. Which variable would you use as the explanatory variable and which as the response variable? Why? What would you expect to see in the scatterplot? Discuss the likely direction and form. a) T- shirts at a store: price each, number sold. b) Real estate: house price, house size (square footage). c) Economics: Interest rates, number of mortgage applications. d) Employees: Salary, years of experience. 25. Scatterplots. Which of the scatterplots show: a) Little or no association? b) A negative association? c) A linear association? d) A moderately strong association? e) A very strong association?
26. Scatterplots, part 2. Which of the scatterplots show: a) Little or no association? b) A negative association? c) A linear association? d) A moderately strong association? e) A very strong association?
(1)
(2)
(3)
(4)
27. Manufacturing. A ceramics factory can fire eight large batches of pottery a day. Sometimes a few of the pieces break in the process. In order to understand the problem better, the factory records the number of broken pieces in each batch for three days and then creates the scatterplot shown. 6 5 # of Broken Pieces
shipping clerk, a salary of $100,000 earned by a manager, and the CEO’s $1,000,000 compensation package. Why might the average of these values be a misleading summary? What would the logarithms of these three values be?
4 3 2 1 0 1
(2)
(1)
(3)
M04_SHAR8696_03_SE_C04.indd 163
(4)
2
3 4 5 Batch Number
6
7
8
a) Make a histogram showing the distribution of the number of broken pieces in the 24 batches of pottery examined. b) Describe the distribution as shown in the histogram. What feature of the problem is more apparent in the histogram than in the scatterplot? c) What aspect of the company’s problem is more apparent in the scatterplot? 28. Coffee sales. Owners of a new coffee shop tracked sales for the first 20 days and displayed the data in a scatterplot (by day).
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 4 Correlation and Linear Regression
164
5
Sales ($100)
4 3 2 1 4
8
Day
12
16
a) Make a histogram of the daily sales since the shop has been in business. b) State one fact that is obvious from the scatterplot, but not from the histogram. c) State one fact that is obvious from the histogram, but not from the scatterplot. 29. Matching. Here are several scatterplots. The calculated correlations are -0.923, -0.487, 0.006, and 0.777. Which is which?
(a)
(b)
(c)
(d)
30. Matching, part 2. Here are several scatterplots. The calculated correlations are -0.977, -0.021, 0.736, and 0.951. Which is which?
(a)
(c)
M04_SHAR8696_03_SE_C04.indd 164
(b)
(d)
31. Pizza sales and price. A linear model fit to predict weekly Sales of frozen pizza (in pounds) from the average Price ($/unit) charged by a sample of stores in the city of Dallas in 39 recent weeks is: Sales = 141,865.53 - 24,369.49 Price. a) What is the explanatory variable? b) What is the response variable? c) What does the slope mean in this context? d) What does the y-intercept mean in this context? Is it meaningful? e) What do you predict the sales to be if the average price charged was $3.50 for a pizza? f) If the sales for a price of $3.50 turned out to be 60,000 pounds, what would the residual be? 32. Student skills surplus. According to the OECD’s How’s life? 2013 study, the dimension of education and skills consists of three different indicators: ‘Educational attainment’, ‘Student skills’, and ‘Years in education’. A linear model to predict Student skills from Years in education was fit to the 34 OECD countries, Russia, and Brazil. The model was: Student skills = 213.645 + 16.028 Years in education a) What is the explanatory variable? b) What is the response variable? c) What does the slope mean in this context? d) What does the y-intercept mean in this context? Is it meaningful? e) What do you predict average student skills to be for South Korea, with on average 17.7 years of education? f) In fact, South Korea students achieve an average student skills score of 541 after staying on average 17.7 years in education. What is the residual? g) How would you judge the quality of the South Korean educational system in preparing its students for the labor market, in comparison to other OECD countries? Explain. 33. Football salaries 2013. Is there a relationship between total team salary and the performance of teams in the National Football League (NFL)? For the 2012–2013 season, a linear model predicting Wins (out of 16 regular season games) from the total team Salary ($M) for the 32 teams in the league is: Wins = -16.32 + 0.219 Salary a) What is the explanatory variable? b) What is the response variable? c) What does the slope mean in this context? d) What does the y-intercept mean in this context? Is it meaningful? e) If one team spends $10 million more than another on salary, how many more games on average would you predict them to win? f) If a team spent $120 million on salaries and won 8 games, would they have done better or worse than predicted? g) What would the residual of the team in part f be? h) The residual standard deviation is 2.78 games. What does that tell you about the likely practical use of this model for predicting wins?
14/07/14 7:27 AM
www.freebookslides.com
Exercises 165
36. Student skills surplus, part 2. The 36 countries in Exercise 32 had an average score for ‘Student skills’ of 493.25 (SD = 30.285), and the correlation between ‘Student skills’ and ‘Years in education’ was 0.633. In Chile, students stay an average of 16.2 years in education, which is exactly one SD below the OECD average. What would you predict to be the average score for student skills in Chile? 37. Packaging. A CEO announces at the annual shareholders meeting that the new see-through packaging for the company’s flagship product has been a success. In fact, he says, “There is a strong correlation between packaging and sales.” Criticize this statement on statistical grounds. 38. Insurance. Insurance companies carefully track claims histories so that they can assess risk and set rates appropriately. The National Insurance Crime Bureau reports that Honda Accords, Honda Civics, and Toyota Camrys are the cars most frequently reported stolen, while Ford Tauruses, Pontiac Vibes, and Buick LeSabres are stolen least often. Is it correct to say that there’s a correlation between the type of car you own and the risk that it will be stolen? 39. Sales by region. A sales manager for a major pharmaceutical company analyzes last year’s sales data for her 96 sales representatives, grouping them by region (1 = East Coast
M04_SHAR8696_03_SE_C04.indd 165
1000
800
600
400
200 1
2
3
4
5
6
Region
She fits a regression to the data and finds: Sales = 1002.5 - 102.7 Region 2 The R is 70.5%. Write a few sentences interpreting this model and describing what she can conclude from this analysis. 40. Salary by job type. At a small company, the head of human resources wants to examine salary to prepare annual reviews. He selects 28 employees at random with job types ranging from 01 = Stocking clerk to 99 = President. He plots Salary ($) against Job Type and finds a strong linear relationship with a correlation of 0.96. 200,000
Salary
35. Pizza sales and price, part 2. For the data in Exercise 31, the average Sales was 52,697 pounds (SD = 10,261 pounds), and the correlation between Price and Sales was = -0.547. If the Price in a particular week was one SD higher than the mean Price, how much pizza would you predict was sold that week?
United States; 2 = Mid West United States; 3 = West United States; 4 = South United States; 5 = Canada; 6 = Rest of World). She plots Sales (in $1000) against Region (1–6) and sees a strong negative correlation.
Total Sales 2008 ($ 1000)
34. Baseball salaries 2012. In 2012, the New York Yankees won 95 games and spent $198 million on salaries for their players (USA Today). Is there a relationship between salary and team performance in Major League Baseball? For the 2012 season, a linear model fit to the number of Wins (out of 162 regular season games) from the team Salary ($M) for the 30 teams in the league is: Wins = 76.45 + 0.046 Salary. a) What is the explanatory variable? b) What is the response variable? c) What does the slope mean in this context? d) What does the y-intercept mean in this context? Is it meaningful? e) If one team spends $10 million more than another on salaries, how many more games on average would you predict them to win? f) If a team spent $110 million on salaries and won half (81) of their games, would they have done better or worse than predicted? g) What would the residual of the team in part f be? h) The R2 for this model is 2.05% and the residual standard deviation is 12.0 games. How useful is this model likely to be for predicting the number of wins?
150,000 100,000 50,000 20
40 60 Job Type
80
100
The regression output gives: Salary = 15827.9 + 1939.1 Job Type Write a few sentences interpreting this model and describing what he can conclude from this analysis. 41. Carbon footprint 2013. The scatterplot shows, for 2013 cars, the carbon footprint (tons of CO2 per mile) vs. the new Environmental Protection Agency (EPA) highway mileage for 76 family sedans as reported by the U.S.
14/07/14 7:27 AM
www.freebookslides.com 166
CHAPTER 4 Correlation and Linear Regression
g overnment (www.fueleconomy.gov/feg/byclass.htm). There are seven cars (two points in the scatterplot are pairs of overlapping cars) with high highway mpg and low carbon footprint. They are all hybrids.
Homes for Sale 6 Price ($000,000)
5
375
3 2
300
5.0
225
10.0 Rooms
15.0
a) Is there an association? b) Check the assumptions and conditions for correlation. 25
30
35 40 Highway mpg
45
a) The correlation is -0.959. Describe the association. b) Are the assumptions and conditions met for computing correlation? c) Using technology, find the correlation of the data when the hybrid cards are not included with the others. Can you explain why it changes in that way? 42. EPA mpg 2013. In 2008, the EPA revised their methods for estimating the fuel efficiency (mpg) of cars—a factor that plays an increasingly important role in car sales. How do the new highway and city estimated mpg values relate to each other? Here’s a scatterplot for 76 family sedans as reported by the U.S. government. These are the same cars as in Exercise 41. 45.5 City mpg
4
1
37.5 30.0 22.5
25
30
40 35 Highway mpg
45
a) The correlation of these two variables is 0.896. Describe the association. b) If the hybrids were removed from the data, what would you expect to happen to the slope (increase, decrease, or stay pretty much the same) and to the correlation (increase, decrease, the same)? Try it using technology. Report and discuss what you find.
44. Economic analysis 2012. An economics student is studying the American economy and finds that the correlation between the inflation-adjusted Dow Jones I ndustrial Average and the Gross Domestic Product (GDP) (also inflation adjusted) is 0.81 for the years 1946 to 2011. (www.measuringworth.com). From that he concludes that there is a strong linear relationship between the two series and predicts that a drop in the GDP will make the stock market go down. Here is a scatterplot of the adjusted DJIA against the GDP (in the years 1946 to 2011). Describe the relationship and comment on the student’s conclusions. Inflation Adjusted Dow Jones Yearly Average
Carbon Footprint
450
12,000 10,000 8,000 6,000 4,000 2,000 2.0e + 06 4.0e + 06 6.0e + 06 8.0e + 06 1.0e + 07 1.2e + 07 United States GDP (Inflation Adjusted)
45. GDP growth 2012. Is economic growth in the developing world related to growth in the industrialized countries? Here’s a scatterplot of the growth (in % of Gross Domestic Product) of 180 developing countries vs. the growth of 33 developed countries as grouped by the World Bank (www. ers.usda.gov/data/macroeconomics). Each point represents one of the years from 1970 to 2011. The output of a regression analysis follows.
43. Real estate. Is the number of total rooms in the house associated with the price of a house? Here is the scatterplot of a random sample of homes for sale:
M04_SHAR8696_03_SE_C04.indd 166
14/07/14 7:27 AM
www.freebookslides.com
Annual GDP Growth Rates— Developing Countries (%)
Exercises 167 7 6 5 4 3 2 1 –2 0 2 4 6 Annual GDP Growth Rates—Developed Countries (%) Dependent variable: GDP Growth Developing Countries R2 = 31.64% s = 1.201 Variable Intercept GDP Growth Developed Countries
Coefficient 3.38 0.468
a) Check the assumptions and conditions for the linear model. b) Explain the meaning of R2 in this context. c) What are the cases in this model?
6 4 2 0
48. European GDP growth 2012, part 2. From the linear model fit to the data on GDP growth of Exercise 46: a) Write the equation of the regression line. b) What is the meaning of the intercept? Does it make sense in this context? c) Interpret the meaning of the slope. d) In a year in which the United States grows at 0%, what do you predict for European growth? e) In 2010, the United States experienced a 3.00% growth, while Europe grew at a rate of 1.78%. Is this more or less than you would have predicted? f) What is the residual for this year? 49. Attendance 2012. American League baseball games are played under the designated hitter rule, meaning that weak-hitting pitchers do not come to bat. Baseball owners believe that the designated hitter rule means more runs scored, which in turn means higher attendance. Is there evidence that more fans attend games if the teams score more runs? Data collected from Major League games from both major leagues during the 2012 season have a correlation of 0.477 between Runs Scored and the Home Attendance (espn.go.com/mlb/attendance).
–2 –4
3,500,000
–6 –2
0 2 4 6 Annual GDP Growth Rates—United States (%) Dependent variable: European Countries GDP Growth R2 = 44.92% s = 1.352 Variable Intercept U.S. GDP Growth
3,000,000 2,500,000 2,000,000
Coefficient 0.693 0.534
a) Check the assumptions and conditions for the linear model. b) Explain the meaning of R2 in this context. 47. GDP growth 2012, part 2. From the linear model fit to the data on GDP growth in Exercise 45: a) Write the equation of the regression line.
M04_SHAR8696_03_SE_C04.indd 167
Attendance
Annual GDP Growth Rates— 27 European Countries (%)
46. European GDP growth 2012. Is economic growth in Europe related to growth in the United States? Here’s a scatterplot of the average growth in 25 European countries (in % of Gross Domestic Product) vs. the growth in the United States. Each point represents one of the years from 1970 to 2011.
b) What is the meaning of the intercept? Does it make sense in this context? c) Interpret the meaning of the slope. d) In a year in which the developed countries grow 4%, what do you predict for the developing world? e) In 2007, the developed countries experienced a 2.65% growth, while the developing countries grew at a rate of 6.09%. Is this more or less than you would have predicted? f) What is the residual for this year?
1,500,000 600
650
700 Runs Scored
750
800
a) Does the scatterplot indicate that it’s appropriate to calculate a correlation? Explain. b) Describe the association between attendance and runs scored. c) Does this association prove that the owners are right that more fans will come to games if the teams score more runs?
14/07/14 7:27 AM
www.freebookslides.com 168
CHAPTER 4 Correlation and Linear Regression
50. Attendance 2012, part 2. Perhaps fans are just more interested in teams that win. Here are displays of other variables in the dataset of exercise 49 (espn.go.com). Are the teams that win necessarily those that score the most runs? Correlation Wins Runs Attend
Wins
Runs
Attend
1.000 0.437 0.133
1.000 0.477
1.000
Purchases = -31.6 + 0.012 Income. a) Interpret the intercept in the linear model. b) Interpret the slope in the linear model. c) If a customer has an Income of $20,000, what is his predicted total yearly Purchases? d) This customer’s yearly Purchases were actually $100. What is the residual using this linear model? Did the model provide an underestimate or overestimate for this customer? 53. Residual plots. Tell what each of the following residual plots indicates about the appropriateness of the linear model that was fit to the data.
3,500,000
Attendance
The least squares linear regression is:
a)
2,500,000
b)
c)
54. Residual plots, again. Tell what each of the following residual plots indicates about the appropriateness of the linear model that was fit to the data.
1,500,000 60
70
80
90
Wins
a) Do winning teams generally enjoy greater attendance at their home games? Describe the association. b) Is attendance more strongly associated with winning or scoring runs? Explain. c) How strongly is scoring more runs associated with winning more games? 51. Mutual fund flows. As the nature of investing shifted in the 1990s (more day traders and faster flow of information using technology), the relationship between mutual fund monthly performance (Return) in percent and money flowing (Flow) into mutual funds ($ million) shifted. Using only the values for the 1990s (we’ll examine later years in later chapters), answer the following questions. (You may assume that the assumptions and conditions for regression are met.) The least squares linear regression is: Flow = 9747 + 771 Return. a) Interpret the intercept in the linear model. b) Interpret the slope in the linear model. c) What is the predicted fund Flow for a month that had a market Return of 0%? d) If during this month, the recorded fund Flow was $5 billion, what is the residual using this linear model? Did the model provide an underestimate or overestimate for this month? 52. Online clothing purchases. An online clothing retailer examined their transactional database to see if total yearly Purchases ($) were related to customers’ Incomes ($). (You may assume that the assumptions and conditions for regression are met.)
M04_SHAR8696_03_SE_C04.indd 168
a)
b)
c)
55. Consumer spending. An analyst at a large credit card bank is looking at the relationship between customers’ charges to the bank’s card in two successive months. He selects 150 customers at random, regresses charges in March ($) on charges in February ($), and finds an R2 of 79%. The intercept is $730.20, and the slope is 0.79. After verifying all the data with the company’s CPA, he concludes that the model is a useful one for predicting one month’s charges from the other. Examine the data on the CD and comment on his conclusions. 56. Insurance policies. An actuary at a mid-sized insurance company is examining the sales performance of the company’s sales force. She has data on the average size of the policy ($) written in two consecutive years by 200 salespeople. She fits a linear model and finds the slope to be 3.00 and the R2 is 99.92%. She concludes that the predictions for next year’s policy size will be very accurate. Examine the data on the CD and comment on her conclusions. 57. What slope? If you create a regression model for predicting the sales ($ million) from money spent on advertising the prior month ($ thousand), is the slope most likely to be closer to 0.03, 300, or 3000? Explain. 58. What slope, part 2? If you create a regression model for estimating a student’s business school GPA (on a scale of 1–5) based on his math SAT (on a scale of 200–800), is the slope most likely to be closer to 0.01, 1, or 10? Explain.
14/07/14 7:27 AM
www.freebookslides.com
Exercises 169
60. More misinterpretations. An economist investigated the association between a country’s Literacy Rate and Gross Domestic Product (GDP) and used the association to draw the following conclusions. Explain why each statement is incorrect. (Assume that all the calculations were done properly.) a) The Literacy Rate determines 64% of the GDP for a country. b) The slope of the line shows that an increase of 5% in Literacy Rate will produce a $1 billion improvement in GDP. 61. Business admissions. An analyst at a business school’s admissions office claims to have developed a valid linear model predicting success (measured by starting salary ($) at time of graduation) from a student’s undergraduate performance (measured by GPA). Describe how you would check each of the four regression conditions in this context. 62. School rankings. A popular magazine annually publishes rankings of both U.S. business programs and international business programs. The latest issue claims to have developed a linear model predicting the school’s ranking (with “1” being the highest ranked school) from its financial resources (as measured by size of the school’s endowment). Describe how you would apply each of the four regression conditions in this context. 63. Rooms per person. Personal earnings and favorable living conditions are both among contributors to well-being. The numbers of rooms per person is one of these living conditions. But in order to allow oneself a spacious house, adequate personal earnings are needed. These two variables thus won’t be unrelated. The OECD has collected data on personal earnings and number of rooms per person for the 34 OECD countries, Russia, and Brazil. a) Make a scatterplot relating the number of rooms per person to personal earnings. b) Describe the association between the two variables. c) Do you think a linear model is appropriate? d) Computer software says that R2 = 58.1%. What is the correlation between number of rooms per person and personals earnings? e) Explain the meaning of R2 in this context. f) Why doesn’t this model explain 100% of the variability in number of rooms per person? 64. Rooms per person, part 2. Use the OECD data on number of rooms per person and personals earnings to create a
M04_SHAR8696_03_SE_C04.indd 169
linear model for the relationship between Rooms per person and Personals earnings. a) Find the equation of the regression line. b) Explain the meaning of the slope of the line. c) Explain the meaning of the intercept of the line. d) Amongst all OECD countries, what living conditions may a household expect with personal earnings equal to $45,000? e) Would you prefer to live in a country having a positive residual in this regression equation, or a country with a negative residual, if other circumstances are comparable? 65. Expensive cities. The Worldwide Cost of Living Survey City Rankings determine the cost of living in the most expensive cities in the world as an index. This index scales New York City as 100 and expresses the cost of living in other cities as a percentage of the New York cost. For example, in 2007, the cost of living index in Tokyo was 122.1, which means that it was 22% higher than New York. The scatterplot shows the index for 2013 plotted against the 2007 index for the 15 most expensive cities of 2007. 160 140 Index 2013
59. Misinterpretations. An advertising agent who created a regression model using amount spent on Advertising to predict annual Sales for a company made these two statements. Assuming the calculations were done correctly, explain what is wrong with each interpretation. a) My R2 of 93% shows that this linear model is appropriate. b) If this company spends $1.5 million on advertising, then annual sales will be $10 million.
120 100 80 60 95
100
105
110
115 120 Index 2007
125
130
135
a) Describe the association between cost of living indices in 2007 and 2013. b) The R2 for the regression equation is 0.070. Interpret the value of R2. c) Find the correlation. d) Using the data provided, find the least squares fit of the 2013 index to the 2007 index. e) Predict the 2013 cost of living index of Moscow and find its residual. 66. Lobster prices 2012. The demand for lobster has grown steadily for several decades. The Maine lobster fishery is carefully controlled to protect the lobster population from over-fishing. The number of fishing licenses and the number of traps are both limited. During the years from 1974 to 2006, the price of lobster also grew, as shown in this plot. a) Describe the increase in the Price of lobster during this period. b) The R2 for the regression equation is 87.42%. Interpret the value of R2. c) Find the correlation. d) Find the linear model.
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 4 Correlation and Linear Regression
170
Dependent variable: Temperature
5
R-squared = 33.4% Variable Intercept
Price/lb
4
Coefficient 15.3066 0.004
CO2
3 2
1975
1985
1995
2005
Year
The years from 2007 to 2012 have seen a change in this pattern. Here is the scatterplot.
a) What is the correlation between CO2 and Mean Temperature? b) Explain the meaning of R-squared in this context. c) Give the regression equation. d) What is the meaning of the slope in this equation? e) What is the meaning of the intercept of this equation? f) Here is a scatterplot of the residuals vs. CO2. Does this plot show evidence of the violations of any of the assumptions of the regression model? If so, which ones?
5 0.075 Residuals
Price/lb
4 3
0.000 –0.075
2
1980
1990
2000
2010
325.0
Year
e) How would you suggest dealing with these new cases to model the change in prices? 67. El Niño. Concern over the weather associated with El Niño has increased interest in the possibility that the climate on Earth is getting warmer. The most common theory relates an increase in atmospheric levels of carbon dioxide 1CO2 2, a greenhouse gas, to increases in temperature. Here is a scatterplot showing the mean annual CO2 concentration in the atmosphere, measured in parts per million (ppm) at the top of Mauna Loa in Hawaii, and the mean annual air temperature over both land and sea across the globe, in degrees Celsius (C).
337.5
350.0
CO2(ppm)
g) CO2 levels may reach 364 ppm in the near future. What Mean Temperature does the model predict for that value? 68. U.S. birthrates. The table shows the number of live births per 1000 women aged 15–44 years in the United States, starting in 1965. (National Center for Health Statistics, www.cdc.gov/nchs/) Year Rate
1965 19.4
1970 18.4
1975 14.8
1980 15.9
1985 15.6
Year Rate
1990 16.4
1995 14.8
2000 14.4
2005 14.0
2010 13.0
Mean Temperature ( °C)
16.800 16.725 16.650 16.575 16.500 325.0
337.5 350.0 CO2 (ppm)
A regression predicting Mean Temperature from CO2 produces the following output table (in part).
M04_SHAR8696_03_SE_C04.indd 170
a) Make a scatterplot and describe the general trend in Birthrates. (Enter Year as years since 1900: 65, 70, 75, etc.) b) Find the equation of the regression line. c) Check to see if the line is an appropriate model. Explain. d) Interpret the slope of the line. e) The table gives rates only at 5-year intervals. Estimate what the rate was in 1978. f) In 1978, the birthrate was actually 15.0. What was the residual? g) Predict what the Birthrate will be in 2012. Comment on your faith in this prediction. h) Predict the Birthrate for 2050. Comment on your faith in this prediction.
14/07/14 7:27 AM
www.freebookslides.com
Exercises 171
Ju s t Che c k i n g A n swers 1 We know the scores are quantitative. We should check to see if the Linearity Condition and the Outlier Condition are satisfied by looking at a scatterplot of the two scores. 2 It won’t change. 3 It won’t change. 4 They are more likely to do poorly. The positive correlation means that low closing prices for Intel are associated with low closing prices for Cypress. 5 No, the general association is positive, but daily closing prices may vary. 6 For each additional employee, monthly sales increase, on average, $122,740. 7 Thousands of $ per employee. 8 $1,227,400 per month. 9 Differences in the number of employees account for about 71.4% of the variation in the monthly sales. 10 It’s positive. The correlation and the slope have the
same sign.
11 R2, No. Slope, Yes.
M04_SHAR8696_03_SE_C04.indd 171
14/07/14 7:27 AM
www.freebookslides.com
Case Study Paralyzed Veterans of America
P
hilanthropic organizations often rely on contributions from individuals to finance the work that they do, and a national veterans’ organization is no exception. The Paralyzed Veterans of America (PVA) was founded as a congressionally chartered veterans’ service organization more than 60 years ago. It provides a range of services to veterans who have experienced spinal cord injury or dysfunction. Some of the services offered include medical care, research, education, and accessibility and legal consulting. In 2008, this organization had total revenue of more than $135 million, with more than 99% of this revenue coming from contributions. An organization that depends so heavily on contributions needs a multifaceted fundraising program, and PVA solicits donations in a number of ways. From its website (www.pva.org), people can make a one-time donation, donate monthly, donate in honor or in memory of someone, and shop in the PVA online store. People can also support one of the charity events, such as its golf tournament, National Veterans Wheelchair Games, and Charity Ride. Traditionally, one of PVA’s main methods of soliciting funds was the use of return address labels and greeting cards (although still used, this method has declined in recent years). Typically, these gifts were sent to potential donors about every six weeks with a request for a contribution. From its established donors, PVA could expect a response rate of about 5%, which, given the relatively small cost to produce and send the gifts, kept the organization well funded. But fundraising accounts for 28% of expenses, so PVA wanted to know who its donors are, what variables might be useful in predicting whether a donor is likely to give to an upcoming campaign, and what the size of that gift might be. On your CD is a dataset Case Study 1, which includes data designed to be very similar to part of the data that this organization works with. Here is a description of some of the variables. Keep in mind, however, that in the real dataset, there would be hundreds more variables given for each donor.
Variable Name
Units (if applicable)
Age
Years
Own Home?
H = Yes; U = No or unknown
Children
Counts
Income Sex
Description
Remarks
1 = Lowest ; 7 = Highest
Based on national medians and percentiles
1 = Lowest; 9 = Highest
Based on national medians and percentiles
M = Male; F = Female
Total Wealth Gifts to Other Orgs
Counts
Number of Gifts (if known) to other philanthropic organizations in the same time period
Number of Gifts
Counts
Number of Gifts to this organization in this time period
Time Between Gifts
Months
Time between first and second gifts
Smallest Gift
$
Smallest Gift (in $) in the time period
See also Sqrt(Smallest Gift)
Largest Gift
$
Largest Gift (in $) in the time period
See also Sqrt(Largest Gift)
172
M04_SHAR8696_03_SE_C04.indd 172
14/07/14 7:27 AM
www.freebookslides.com
173
PartCase I CaseStudy Study 1
Variable Name
Units (if applicable)
Description
Remarks
Previous Gift
$
Gift (in $) for previous campaign
See also Sqrt(Previous Gift)
Average Gift
$
Total amount donated divided by total number of gifts
See also Sqrt(Average Gift)
Current Gift
$
Gift (in $) to organization this campaign
See also Sqrt(Current Gift)
Sqrt(Smallest Gift)
Sqrt($)
Square Root of Smallest Gift in $
Sqrt(Largest Gift)
Sqrt($)
Square Root of Largest Gift in $
Sqrt(Previous Gift)
Sqrt($)
Square Root of Previous Gift in $
Sqrt(Average Gift)
Sqrt($)
Square Root of Average Gift in $
Sqrt(Current Gift)
Sqrt($)
Square Root of Current Gift in $
Let’s see what the data can tell us. Are there any interesting relationships between the current gift and other variables? Is it possible to use the data to predict who is going to respond to the next direct-mail campaign? Recall that when variables are highly skewed or the relationship between variables is not linear, reporting a correlation coefficient is not appropriate. You may want to consider a transformed version of those variables (square roots are provided for all the variables concerning gifts) or a correlation based on the ranks of the values rather than the values themselves.
Suggested Study Plan and Questions Write a report of what you discover about the donors to this organization. Be sure to follow the Plan, Do, Report outline for your report. Include a basic description of each variable (shape, center, and spread), point out any interesting features, and explore the relationships between the variables. In particular you should describe any interesting relationships between the current gift and other variables. Use these questions as a guide: • Is the age distribution of the clients a typical one found in most businesses? • Do people who give more often make smaller gifts on average? • Do people who give to other organizations tend to give to this organization? Describe the relationship between the Income and Wealth rankings. How do you explain this relationship (or lack of one)? (Hint: Look at the age distribution.) What variables (if any) seem to have an association with the Current Gift? Do you think the organization can use any of these variables to predict the gift for the next campaign? Optional: This file includes people who did not give to the current campaign. Do your answers to any of the questions above change if you consider only those who gave to this campaign?
M04_SHAR8696_03_SE_C04.indd 173
14/07/14 7:27 AM
www.freebookslides.com
M04_SHAR8696_03_SE_C04.indd 174
14/07/14 7:27 AM
5
www.freebookslides.com
Randomness and Probability
Credit Reports and the Fair Isaacs Corporation You’ve probably never heard of the Fair Isaacs Corporation, but they probably know you. Whenever you apply for a loan, a credit card, or even a job, your credit “score” will be used to determine whether you are a good risk. And because the most widely used credit scores are Fair Isaacs’ FICO® scores, the company may well be involved in the decision. The Fair Isaacs Corporation (FICO) was founded in 1956, with the idea that data, used intelligently, could improve business decision making. Today, Fair Isaacs claims that their services provide companies around the world with information for more than 180 billion business decisions a year. Your credit score is a number between 350 and 850 that summarizes your credit “worthiness.” It’s a snapshot of credit risk today based on your credit history and past behavior. Lenders of all kinds use credit scores to predict behavior, such as how likely you are to make your loan payments on time or to default on a loan. Lenders use the score to determine not only whether to give credit, but also the cost of the credit that they’ll offer. There are no established boundaries, but generally scores over 750 are considered excellent, and applicants with those scores get the best rates. An applicant with a score below 620 is generally considered to be a poor risk. Those with very low scores may be denied credit outright or only offered “subprime” loans at substantially higher rates. It’s important that you be able to verify the information that your score is based on, but until recently, you could only hope that your score was based on correct information. That changed in 2000, when a California law gave mortgage applicants the right to see their credit scores. Today, the credit industry is more open about giving consumers 175
M05_SHAR8696_03_SE_C05.indd 175
14/07/14 7:25 AM
www.freebookslides.com 176
CHAPTER 5 Randomness and Probability
access to their scores and the U.S. government, through the Fair and Accurate Credit Transaction Act (FACTA), now guarantees that you can access your credit report at no cost, at least once a year.1
C
ompanies have to manage risk to survive, but by its nature, risk carries uncertainty. A bank can’t know for certain that you’ll pay your mortgage on time— or at all. What can they do with events they can’t predict? They start with the fact that, although individual outcomes cannot be anticipated with certainty, random phenomena do, in the long run, settle into patterns that are consistent and predictable. It’s this property of random events that makes Statistics practical.
5.1
A random phenomenon consists of trials. Each trial has an outcome. Outcomes combine to make events.
Random Phenomena and Probability When a customer calls the 800 number of a credit card company, he or she is asked for a card number before being connected with an operator. As the connection is made, the purchase records of that card and the demographic information of the customer are retrieved and displayed on the operator’s screen. If the customer’s FICO score is high enough, the operator may be prompted to “cross-sell” another service— perhaps a new “platinum” card for customers with a credit score of at least 750. Of course, the company doesn’t know which customers are going to call. Call arrivals are an example of a random phenomenon. With random phenomena, we can’t predict the individual outcomes, but we can hope to understand characteristics of their long-run behavior. We don’t know whether the next caller will qualify for the platinum card, but as calls come into the call center, the company will find that the percentage of platinum callers who qualify for cross-selling will settle into a pattern, like that shown in the graph in Figure 5.1. Part of a call center operator’s earnings might be based on the number of platinum cards she sells. To figure out what her potential bonus might be, an operator might first want to know what percentage of all callers qualifies. She decides to write down whether the caller from each call she gets today qualifies or not. The first caller today qualified. Then the next five callers’ qualifications were no, yes, yes, no, and no. If we plot the percentage who qualify against the number of calls she’s made so far the graph would start at 100% because the first caller qualified (1 out of 1, for 100%). The next caller didn’t qualify, so the accumulated percentage dropped to 50% (1 out of 2). The third caller qualified (2 out of 3, or 67%), then yes again (3 out of 4, or 75%), then no twice in a row (3 out of 5, for 60%, and then 3 out of 6, for 50%), and so on (Table 5.1). Each new call is a smaller fraction of the total number, so the percentages change less after each call. After a while, the graph starts to settle down and we can see that the fraction of customers who qualify is about 35% (Figure 5.1). When talking about long-run behavior, it helps to define our terms. For any random phenomenon, each attempt, or trial, generates an outcome. For the call center, each call is a trial. Something happens on each trial, and we call whatever happens the outcome. Here the outcome is whether the caller qualifies or not. We use the more general term event to refer to outcomes or combinations of outcomes. For example, suppose we categorize callers into 6 risk categories and number these outcomes from 1 to 6 (of increasing credit worthiness). The three outcomes 4, 5, or 6 could make up the event “caller is at least a category 4.” We sometimes talk about the collection of all possible outcomes, a special event that we’ll refer to as the sample space. We denote the sample space S; you may also 1
However, the score you see in your report will be an “educational” score intended to show consumers how scoring works. You still have to pay a “reasonable fee” to see your FICO score.
M05_SHAR8696_03_SE_C05.indd 176
14/07/14 7:25 AM
www.freebookslides.com
Random Phenomena and Probability
Figure 5.1 The percentage of credit card customers who qualify for the premium card.
177
100.0
Percent Qualifying
75.0
50.0 35.0 25.0
20
40
60 Number of Callers
80
Call
FICO Score
Qualify?
Running % Qualify
1
750
Yes
100
2
640
No
50
3
765
Yes
66.7
4
780
Yes
75
5
680
No
60
6 O
630 O
No
50 O
100
Table 5.1 Data on the first six callers showing their FICO score, whether they qualified for the platinum card offer, and a running percentage of number of callers who qualified.
Probability as Long-Run Frequency The probability of an event is its long-run relative frequency. A relative frequency is a fraction, so 35 we can write it as 100 , as a decimal, 0.35, or as a percentage, 35%.
M05_SHAR8696_03_SE_C05.indd 177
see the Greek letter Ω used. But whatever symbol we use, the sample space is the set that contains all the possible outcomes. For the calls, if we let Q = qualified and N = not qualified, the sample space is simple: S = 5Q, N6. If we look at two calls together, the sample space has four outcomes: S = 5QQ, QN, NQ, NN6. If we were interested in at least one qualified caller from the two calls, we would be interested in the event (call it A) consisting of the three outcomes QQ, QN, and NQ, and we’d write A = 5QQ, QN, NQ6. Although we may not be able to predict a particular individual outcome, such as which incoming call represents a potential upgrade sale, we can say a lot about the long-run behavior. Look back at Figure 5.1. If you were asked for the probability that a random caller will qualify, you might say that it was 35% because, in the long run, the percentage of the callers who qualify is about 35%. That’s exactly what we mean by probability. When we think about what happens with a series of trials, it really simplifies things if the individual trials are independent. Roughly speaking, independence means that the outcome of one trial doesn’t influence or change the outcome of another. Recall, that in Chapter 2, we called two variables independent if the value of one categorical variable did not influence the value of another categorical variable.
14/07/14 7:26 AM
www.freebookslides.com 178
CHAPTER 5 Randomness and Probability
Law of Large Numbers (LLN) The long-run relative frequency of repeated, independent events eventually produces the true relative frequency as the number of trials increases.
5.2 “Slump? I ain’t in no slump. I just ain’t hittin’.” —Yogi Berra
You may think it’s obvious that the frequency of repeated events settles down in the long run to a single number. The discoverer of the Law of Large Numbers thought so, too. The way Jacob Bernoulli put it was: “For even the most stupid of men is convinced that the more observations have been made, the less danger there is of wandering from one’s goal.”
M05_SHAR8696_03_SE_C05.indd 178
(We checked for independence by comparing relative frequency distributions across variables.) There’s no reason to think that whether the one caller qualifies influences whether another caller qualifies, so these are independent trials. We’ll see a more formal definition of independence later in the chapter. You might think that we just got lucky when the percentage of the qualifying calls settled down to a number. But for independent events, we can depend on a principle called the Law of Large Numbers (LLN), which states that if the events are independent, then as the number of trials increases, the long-run relative frequency of any outcome gets closer and closer to a single value. This gives us the guarantee we need and makes probability a useful concept. Because the LLN guarantees that relative frequencies settle down in the long run, we can give a name to the value that they approach. We call it the probability of that event. For the call center, we can write P(qualified) = 0.35. Because it is based on repeatedly observing the event’s outcome, this definition of probability is often called empirical probability.
The Nonexistent Law of Averages The Law of Large Numbers is often misunderstood to be a “law of averages.” Many people believe, for example, that an outcome of a random event that hasn’t occurred in many trials is “due” to occur. The original “dogs of the Dow” strategy for buying stocks recommended buying the 10 worst performing stocks of the 30 that make up the Dow Jones Industrial Average, figuring that these “dogs” were bound to do better next year. The thinking was that, since the relative frequency will settle down to the probability of that outcome in the long run, we’ll have some “catching up” to do. That may seem logical, but random events don’t work that way. In fact, Louis Rukeyser (the former host of Wall Street Week) said of the “dogs of the Dow” strategy, “that theory didn’t work as promised.” Here’s why. We actually know very little about the behavior of random events in the short run. The fact that we are seeing independent random events makes each individual result impossible to predict. Relative frequencies even out only in the long run. And the long run referred to in the LLN is really long. The “Large” in the law’s name means infinitely large. Sequences of random events don’t compensate in the short run and don’t need to do so to get back to the right longrun probability. Any short-run deviations will be overwhelmed in the long run. If the probability of an outcome doesn’t change and the events are independent, the probability of any outcome in another trial never changes, no matter what has happened in other trials. Many people confuse the Law of Large numbers with the so-called Law of Averages, which says that things have to even out in the short run. But even though the Law of Averages doesn’t exist at all, you’ll hear people talk about it as if it does. Is a good hitter in baseball who has struck out the last six times due for a hit his next time up? If the stock market has been down for the last three sessions, is it due to increase today? No. This isn’t the way random phenomena work. There is no Law of Averages for short runs—no “Law of Small Numbers.” A belief in such a “law” can lead to poor business decisions. Keno and the Law of Averages Of course, sometimes an apparent drift from what we expect means that the probabilities are, in fact, not what we thought. If you get 10heads in a row, maybe the coin has heads on both sides! Here’s a true story that illustrates this. Keno is a simple casino game in which numbers from 1 to 80 are chosen. The numbers, as in most lottery games, are supposed to be equally likely. Payoffs are made
14/07/14 7:26 AM
www.freebookslides.com
Different Types of Probability
179
depending on how many of those numbers you match on your card. A group of graduate students from a Statistics department decided to take a field trip to Reno. They (very discreetly) wrote down the outcomes of the games for a couple of days, then drove back to test whether the numbers were, in fact, equally likely. It turned out that some numbers were more likely to come up than others. Rather than bet on the Law of Averages and put their money on the numbers that were “due,” the students put their faith in the LLN— and all their (and their friends’) money on the numbers that had come up before. After they pocketed more than $50,000, they were escorted off the premises and invited never to show their faces in that casino again. Not coincidentally, the ringleader of that group currently makes his living on Wall Street.
“In addition, in time, if the roulettebetting fool keeps playing the game, the bad histories [outcomes] will tend to catch up with him.” —Nassim Nicholas Taleb in Fooled by Randomness
The Law of Averages Debunked You’ve just flipped a fair coin and seen six heads in a row. Does the coin “owe” you some tails? Suppose you spend that coin and your friend gets it in change. When she starts flipping the coin, should she expect a run of tails? Of course not. Each flip is a new event. The coin can’t “remember” what it did in the past, so it can’t “owe” any particular outcomes in the future. Just to see how this works in practice, we simulated 100,000 flips of a fair coin on a computer. In our 100,000 “flips,” there were 2981 streaks of at least 5 heads. The “Law of Averages” suggests that the next flip after a run of 5 heads should be tails more often to even things out. Actually, in this particular simulation the next flip was heads more often than tails: 1550 times to 1431 times. That’s 51.9% heads. You can perform a similar simulation easily.
Just C hecking 1 It has been shown that the stock market fluctuates randomly. Nevertheless, some
investors believe that they should buy right after a day when the market goes down because it is bound to go up soon. Explain why this is faulty reasoning.
5.3
Different Types of Probability Model-Based (Theoretical) Probability
Model-based Probability We can write: P1A2 =
# of outcomes in A total # of outcomes
and call this the (theoretical) probability of the event.
M05_SHAR8696_03_SE_C05.indd 179
We’ve discussed empirical probability—the relative frequency of an event’s occurrence as the probability of an event. There are other ways to define probability as well. Probability was first studied extensively by a group of French mathematicians who were interested in games of chance. Rather than experiment with the games and risk losing their money, they developed mathematical models of probability. To make things simple (as we usually do when we build models), they started by looking at games in which the different outcomes were equally likely. Fortunately, many games of chance are like that. Any of 52 cards is equally likely to be the next one dealt from a well-shuffled deck. Each face of a die is equally likely to land up (or at least it should be). When we have equally likely outcomes, we write the (theoretical) probability of an event A, as P1A2 = # of outcomes in A>total # of outcomes possible. When
14/07/14 7:26 AM
www.freebookslides.com 180
CHAPTER 5 Randomness and Probability
outcomes are equally likely, the probability that one of them occurs is easy to compute—it’s just 1 divided by the number of possible outcomes. So the probability of rolling a 3 with a fair die is one in six, which we write as 1>6. The probability of picking the ace of spades from a well-shuffled deck is 1>52. It’s almost as simple to find probabilities for events that are made up of several equally likely outcomes. We just count all the outcomes that the event contains. The probability of the event is the number of outcomes in the event divided by the total number of possible outcomes. For example, Pew Research2 reports that of 10,190 randomly generated working phone numbers called for a survey, the initial results of the calls were as follows:
Result
Number of Calls
No Answer
311
Busy
61
Equally Likely?
Answering Machine
1336
In an attempt to understand why someone would buy a lottery ticket, an interviewer asked someone who had just purchased one “What do you think your chances are of winning the lottery?” The reply was, “Oh, about 50–50.” The shocked interviewer asked, “How do you get that?” to which the response was, “Well, the way I figure it, either I win or I don’t!” The moral of this story is that events are not always equally likely.
Callbacks
189
Other Non-Contacts
893
Contacted Numbers
7400
The phone numbers were generated randomly, so each was equally likely. To find the probability of a contact, we just divide the number of contacts by the number of calls: 7400>10,190 = 0.7262. But don’t get trapped into thinking that random events are always equally likely. The chance of winning a lottery—especially lotteries with very large payoffs—is small. Regardless, people continue to buy tickets.
Personal Probability What’s the probability that gold will sell for more than $2000 an ounce at the end of next year? You may be able to come up with a number that seems reasonable. Of course, no matter what your guess is, your probability should be between 0 and 1. In our discussion of probability, we’ve defined probability in two ways: 1) in terms of the relative frequency—or the fraction of times—that an event occurs in the long run or 2) as the number of outcomes in the event divided by the total number of outcomes. Neither situation applies to your assessment of gold’s chances of selling for more than $2000. We use the language of probability in everyday speech to express a degree of uncertainty without necessarily basing it on long-run relative frequencies. Your personal assessment of an event expresses your uncertainty about the outcome. We call this kind of probability a subjective, or personal probability. Although personal probabilities may be based on experience, they are typically not based on long-run relative frequencies or on equally likely events. But, like the two other probabilities we defined, they need to satisfy the same rules as both empirical and theoretical probabilities that we’ll discuss in the next section.
2
M05_SHAR8696_03_SE_C05.indd 180
www.pewinternet.org/pdfs/PIP_Digital_Footprints.pdf.
14/07/14 7:26 AM
www.freebookslides.com
Probability Rules
5.4 N o t at ion A l e r t We often represent events with capital letters (such as A and B), so P(A) means “the probability of event A.” “Baseball is 90% mental. The other balf is physical.”
181
Probability Rules For some people, the phrase “50>50” means something vague like “I don’t know” or “whatever.” But when we discuss probabilities, 50>50 has the precise meaning that two outcomes are equally likely. Speaking vaguely about probabilities can get you into trouble, so it’s wise to develop some formal rules about how probability works. These rules apply to probability whether we’re dealing with empirical, theoretical, or personal probability. Rule 1. If the probability of an event occurring is 0, the event won’t occur; l ikewise if the probability is 1, the event will always occur. Even if you think an event is very unlikely, its probability can’t be negative, and even if you’re sure it will happen, its probability can’t be greater than 1. So we require that:
—Yogi Berra
A probability is a number between 0 and 1. For any event A, 0 " P1 A2 " 1. Rule 2. If a random phenomenon has only one possible outcome, it’s not very interesting (or very random). So we need to distribute the probabilities among all the outcomes a trial can have. How can we do that so that it makes sense? For example, consider the behavior of a certain stock. The possible daily outcomes might be: A: The stock price goes up. B: The stock price goes down. C: The stock price remains the same. When we assign probabilities to these outcomes, we should be sure to distribute all of the available probability. Something always occurs, so the probability of something happening is 1. This is called the Probability Assignment Rule: The probability of the set of all possible outcomes must be 1. P1 S 2 = 1
Ac
where S is the sample space.
Rule 3. Suppose the probability that you get to class on time is 0.8. What’s the probability that you don’t get to class on time? Yes, it’s 0.2. The set of outcomes that are not in the event A is called the “complement” of A and is denoted AC. This leads to the Complement Rule:
A
C
The set A and its complement A . Together, they make up the entire sample space S.
The probability of an event occurring is 1 minus the probability that it doesn>t occur. P1 A2 = 1 − P1 AC 2
For Example
Applying the Complement Rule
Lee’s Lights sells lighting fixtures. Some customers are there only to browse, so Lee records the behavior of all customers for a week to assess how likely it is that a customer will make a purchase. Lee finds that of 1000 customers entering the store during the week, 300 make purchases. Lee concludes that the probability of a customer making a purchase is 0.30.
Question If P(purchase) = 0.30, what is the probability that a customer doesn’t make a purchase?
Answer Because “no purchase” is the complement of “purchase,” P1no purchase2 = 1 - P1purchase2 = 1 - 0.30 = 0.70 There is a 70% chance a customer won’t make a purchase.
M05_SHAR8696_03_SE_C05.indd 181
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 5 Randomness and Probability
182
Rule 4. Whether or not a caller qualifies for a platinum card is a random outcome. Suppose the probability of qualifying is 0.35. What’s the chance that the next two callers qualify? The Multiplication Rule says that to find the probability that two independent events occur, we multiply the probabilities. For two independent events A and B, the probability that both A and B occur is the product of the probabilities of the two events: P1 A and B2 = P1 A2 : P1 B2 , provided that A and B are independent. Thus if A = {customer 1 qualifies} and B = {customer 2 qualifies}, the chance that both qualify is: 0.35 * 0.35 = 0.1225 Of course, to calculate this probability, we have used the assumption that the two events are independent. We’ll expand the multiplication rule to be more general later in this chapter.
For Example
Using the Multiplication Rule
Lee knows that the probability that a customer will make a purchase is 30%.
Question If we can assume that customers behave independently, what is the probability that the next two customers entering Lee’s Lights both make purchases?
Answer Because the events are independent, we can use the multiplication rule. P1first customer makes a purchase and second customer makes a purchase2 = P1purchase2 * P1purchase2 = 0.30 * 0.30 = 0.09 There’s about a 9% chance that the next two customers will both make purchases.
B
A
Two disjoint sets, A and B.
Rule 5. Suppose the card center operator has more options. She can A: offer a special travel deal, B: offer a platinum card, or C: decide to send information about a new affinity card. If she can do one, but only one, of these, then these outcomes are disjoint (or mutually exclusive). To see whether two events are disjoint, we separate them into their component outcomes and check whether they have any outcomes in common. For example, if the operator can choose to both offer the travel deal and send the affinity card information, those events would not be disjoint. The Addition Rule allows us to add the probabilities of disjoint events to get the probability that either event occurs: P1 A or B2 = P1 A2 + P1 B2 , provided that A and B are disjoint. Thus the probability that the caller either is offered a platinum card or is sent the affinity card information is the sum of the two probabilities, since the events are disjoint.
B A and B A
Two sets A and B that are not disjoint. The event (A and B) is their intersection.
M05_SHAR8696_03_SE_C05.indd 182
For Example
Using the Addition Rule
Some customers prefer to see the merchandise but then make their purchase later using Lee’s Lights’ website. Lee offers a promotion to attempt to track customer behavior. Customers leaving the store without making a purchase are offered a “bonus code” to use at the Internet site. Using these codes, Lee determines that there’s a 9% chance of a customer making a purchase using the code later. We know that about 30% of customers make purchases when they enter the store.
14/07/14 7:26 AM
www.freebookslides.com
Probability Rules
183
Question What is the probability that a customer who enters the store will not make a purchase at all?
Answer We can use the Addition Rule because the alternatives “no purchase at all,” “purchase in the store,” and “purchase online” are disjoint events. P1purchase in the store or online2 = P1purchase in store2 + P1purchase online2 = 0.30 + 0.09 = 0.39 P1no purchase at all2 = P1not 1purchase in the store or purchase online22 = 1 - P1in store or online2 = 1 - 0.39 = 0.61
N o t at ion A l e r t
You may see the event (A or B) written as (A h B). The symbol h means “union” and represents the outcomes in event A or event B. Similarly the symbol x means intersection and represents outcomes that are in both event A and event B. You may see the event (A and B) written as (A x B).
Rule 6. Suppose we would like to know the probability that either of the next two callers qualifies for a platinum card? We know P 1A2 = P 1B2 = 0.35, but P 1A or B2 is not simply the sum P 1A2 + P 1B2 because the events A and B are not disjoint in this case. Both customers could qualify. So we need a new probability rule. We can’t simply add the probabilities of A and B because that would count the outcome of both customers qualifying twice. So, if we started by adding the two probabilities, we could compensate by subtracting out the probability that both qualify. In other words, P 1customer A or customer B qualifies2 = P 1customer A qualifies2 + P 1customer B qualifies2 - P 1both customers qualify2 = 10.352 + 10.352 - 10.35 * 0.352 1since events are independent2 = 10.352 + 10.352 - 10.12252 = 0.5775
It turns out that this method works in general. We add the probabilities of two events and then subtract out the probability of their intersection. This gives us the General Addition Rule, which does not require disjoint events: P 1 A or B2 = P 1 A2 + P 1 B2 − P 1 A and B2 for any two events A and B.
For Example
Using the General Addition Rule
Lee notices that when two customers enter the store together, their purchases are not disjoint. In fact, there’s a 20% chance they’ll both make a purchase.
Question When two customers enter the store together, what is the probability that at least one of them makes a purchase?
Answer Now we know that the events are not disjoint, so we must use the General Addition Rule. P1at least one purchases2 = P1A purchases or B purchases2 = P1A purchases2 + P1B purchases2 - P1A and B both purchase2 = 0.30 + 0.30 - 0.20 = 0.40
M05_SHAR8696_03_SE_C05.indd 183
14/07/14 7:26 AM
www.freebookslides.com 184
CHAPTER 5 Randomness and Probability
Ju s t Che c k i n g 2 Even successful companies sometimes make products with
high failure rates. One (in)famous example is the Apple 40GB click wheel iPod, which used a tiny disk drive for storage. According to Macintouch.com, 30% of those devices eventually failed. It is reasonable to assume that the failures were independent. What would a store that sold these devices have seen?
a) What is the probability that a particular 40GB click wheel iPod failed?
Guided Example
b) What is the probability that two 40GB click wheel iPods sold together both failed? c) What is the probability that the store’s first failure problem was the third one they sold? d) What is the probability the store had a failure problem with at least one of the five that they sold on a particular day?
M&M’s Modern Market Research In 1941, when M&M’s® milk chocolate candies were introduced to American GIs in World War II, there were six colors: brown, yellow, orange, red, green, and violet. Mars®, the company that manufactures M&M’s, has used the introduction of a new color as a marketing and advertising event several times in the years since then. In 1980, the candy went international adding 16 countries to their markets. In 1995, the company conducted a “worldwide survey” to vote on a new color. Over 10 million people voted to add blue. They even got the lights of the Empire State Building in New York City to glow blue to help announce the addition. In 2002, they used the Internet to help pick a new color. Children from over 200 countries were invited to respond via the Internet, telephone, or mail. Millions of voters chose among purple, pink, and teal. The global winner was purple, and for a brief time, purple M&M’s could be found in packages worldwide (although in 2013, the colors were brown, yellow, red, blue, orange, and green). In the United States, 42% of those who voted said purple, 37% said teal, and only 19% said pink. But in Japan the percentages were 38% pink, 36% teal, and only 16% purple. Let’s use Japan’s percentages to ask some questions. 1. What’s the probability that a Japanese M&M’s survey respondent selected at random preferred either pink or teal? 2. If we pick two respondents at random, what’s the probability that they both selected purple? 3. If we pick three respondents at random, what’s the probability that at least one preferred purple?
Plan
Setup The probability of an event is its long-term relative frequency. This can be determined in several ways: by looking at many replications of an event, by deducing it from equally likely events, or by using some other information. Here, we are told the relative frequencies of the three responses.
M05_SHAR8696_03_SE_C05.indd 184
The M&M’s website reports the proportions of Japanese votes by color. These give the probability of selecting a voter who preferred each of the colors: P1pink2 = 0.38 P1teal2 = 0.36 P1purple2 = 0.16
14/07/14 7:26 AM
www.freebookslides.com
Probability Rules Make sure the probabilities are legitimate. Here, they’re not. Either there was a mistake or the other voters must have chosen a color other than the three given. A check of other countries shows a similar deficit, so probably we’re seeing those who had no preference or who wrote in another color.
185
Each is between 0 and 1, but these don’t add up to 1. The remaining 10% of the voters must have not expressed a preference or written in another color. We’ll put them together into “other” and add P1other2 = 0.10. With this addition, we have a legitimate assignment of probabilities.
Question 1: What’s the probability that a Japanese M&M’s survey respondent selected at random preferred either pink or teal?
Plan
Do
Setup Decide which rules to use and check the conditions they require.
Mechanics Show your work.
The events “pink” and “teal” are individual outcomes (a respondent can’t choose both colors), so they are disjoint. We can apply the General Addition Rule anyway. P1pink or teal2 = P1pink2 + P1teal2 - P1pink and teal2 = 0.38 + 0.36 - 0 = 0.74 The probability that both pink and teal were chosen is zero, since respondents were limited to one choice.
Report
Conclusion Interpret your
The probability that the respondent said pink or teal is 0.74.
results in the proper context.
Question 2: If we pick two respondents at random, what’s the probability that they both said purple?
Plan
Do
Setup The word “both” suggests we want P1A and B2, which calls for the Multiplication Rule. Check the required condition.
Mechanics Show your work. For both respondents to pick purple, each one has to pick purple.
Report
Conclusion Interpret your
Independence It’s unlikely that the choice made by one respondent affected the choice of the other, so the events seem to be independent. We can use the Multiplication Rule. P1both purple2 = P1first respondent picks purple and second respondent picks purple2 = P1first respondent picks purple2 * P1second respondent picks purple2 = 0.16 * 0.16 = 0.0256 The probability that both respondents pick purple is 0.0256.
results in the proper context. (continued )
M05_SHAR8696_03_SE_C05.indd 185
14/07/14 7:26 AM
www.freebookslides.com 186
CHAPTER 5 Randomness and Probability
Question 3: If we pick three respondents at random, what’s the probability that at least one preferred purple?
Plan
Setup The phrase “at least one” often flags a question best answered by looking at the complement, and that’s the best approach here. The complement of “at least one preferred purple” is “none of them preferred purple.” Check the conditions.
Do
Report
P1at least one picked purple2 = P15none picked purple6 c 2 = 1 - P1none picked purple2.
Independence. These are independent events because they are choices by three random respondents. We can use the Multiplication Rule.
Mechanics We calculate P(none purple) by using the Multiplication Rule.
P1none picked purple2 = P1first not purple2 * P1second not purple2 * P1third not purple2 = [P1not purple2]3. P1not purple2 = 1 - P1purple2 = 1 - 0.16 = 0.84.
Then we can use the Complement Rule to get the probability we want.
So P1none picked purple2 = 10.842 3 = 0.5927.
Conclusion Interpret your re-
There’s about a 40.7% chance that at least one of the respondents picked purple.
sults in the proper context.
5.5
P1at least 1 picked purple2 = 1 - P1none picked purple2 = 1 - 0.5927 = 0.4073.
Joint Probability and Contingency Tables As part of a Pick Your Prize Promotion, a chain store invited customers to choose which of three prizes they’d like to win (while providing name, address, phone number, and e-mail address). At one store, the responses could be placed in the contingency table in Table 5.2.
Sex
Prize preference
Man Woman Total
MP3
Camera
Bike
Total
117 130 247
50 91 141
60 30 90
227 251 478
Table 5.2 Prize preference for 478 customers.
M05_SHAR8696_03_SE_C05.indd 186
14/07/14 7:26 AM
www.freebookslides.com
Conditional Probability
187
If the winner is chosen at random from these customers, the probability we select a woman is just the corresponding relative frequency (since we’re equally likely to select any of the 478 customers). There are 251 women in the data out of a total of 478, giving a probability of: P1woman2 = 251>478 = 0.525 A marginal probability uses a marginal frequency (from either the Total row or Total column) to compute the probability.
This is called a marginal probability because it depends only on totals found in the margins of the table. The same method works for more complicated events. For example, what’s the probability of selecting a woman whose preferred prize is the camera? Well, 91 women named the camera as their preference, so the probability is: P1woman and camera2 = 91>478 = 0.190 Probabilities such as these are called joint probabilities because they give the probability of two events occurring together. The probability of selecting a customer whose preferred prize is a bike is: P1bike2 = 90>478 = 0.188
For Example
Marginal probabilities
Lee suspects that men and women make different kinds of purchases at Lee’s Lights (see the example on page 181). The table shows the purchases made by the last 100 customers.
Utility Lighting
Fashion Lighting
Total
Men
40
20
60
Women
10
30
40
Total
50
50
100
Question What’s the probability that one of Lee’s customers is a woman? What is the probability that a random customer is a man who purchases fashion lighting?
Answer From the marginal totals we can see that 40% of Lee’s customers are women, so the probability that a customer is a woman is 0.40. The cell of the table for Men who purchase Fashion lighting has 20 of the 100 customers, so the probability of that event is 0.20.
5.6
Conditional Probability Since our sample space is these 478 customers, we can recognize the r elative frequencies as probabilities. What if we are given the information that the selected customer is a woman? Would that change the probability that the s elected customer’s preferred prize is a bike? You bet it would! The pie charts in Figure 5.2 on the next page show that women are much less likely to say their preferred prize is a bike than are men. When we restrict our focus to women, we look only at the women’s row of the table, which gives the conditional distribution of preferred prizes given “woman.” Of the 251 women, only 30 of them said their preferred prize was a bike. We write the probability that a selected customer wants a bike given that we have selected a woman as: P1bike woman2 = 30>251 = 0.120
M05_SHAR8696_03_SE_C05.indd 187
14/07/14 7:26 AM
www.freebookslides.com CHAPTER 5 Randomness and Probability
188
Women Bike
MP3
Camera Men Bike
For men, we look at the conditional distribution of preferred prizes given “man” shown in the top row of the table. There, of the 227 men, 60 said their preferred prize was a bike. So, P1bike man2 = 60>227 = 0.264, more than twice the women’s probability (see Figure 5.2). In general, when we want the probability of an event from a conditional distribution, we write P1B A2 and pronounce it “the probability of B given A.” A probability that takes into account a given condition such as this is called a conditional probability. Let’s look at what we did. We worked with the counts, but we could work with the probabilities just as well. There were 30 women who selected a bike as a prize, and there were 251 women customers. So we found the probability to be 30/251. To find the probability of the event B given the event A, we restrict our attention to the outcomes in A. We then find in what fraction of those outcomes B also occurred. Formally, we write: P1 B ∣ A2 =
MP3
P1 A and B2 P1 A2
We can use the formula directly with the probabilities derived from the contingency table (Table 5.2) to find: P1bike woman2 =
Camera
Figure 5.2 Conditional distributions of Prize Preference for Women and for Men.
30>478 P1bike and woman2 30 = = = 0.120 as before. P1woman2 251>478 251
The formula for conditional probability requires one restriction. The formula works only when the event that’s given has probability greater than 0. The formula doesn’t work if P(A) is 0 because that would mean we had been “given” the fact that A was true even though the probability of A is 0, which would be a contradiction. Rule 7. Remember the Multiplication Rule for the probability of A and B? It said
N o t at i o n A l e r t P(B ƒ A) is the conditional probability of B given A.
P1A and B2 = P1A2 * P1B2 when A and B are independent. Now we can write a more general rule that doesn’t require independence. In fact, we’ve already written it. We just need to rearrange the equation a bit. The equation in the definition for conditional probability contains the probability of A and B. Rearranging the equation gives the General Multiplication Rule for compound events that does not require the events to be independent: P1 A and B2 = P1 A2 : P1 B∣ A2 for any two events A and B. The probability that two events, A and B, both occur is the probability that event A occurs multiplied by the probability that event B also occurs given that event A occurs. Of course, there’s nothing special about which event we call A and which one we call B. We should be able to state this the other way around. Indeed we can. It is equally true that: P1A and B2 = P1B2 * P1A B2. Let’s return to the question of just what it means for events to be independent. We said informally in Chapter 2 that what we mean by independence is that the outcome of one event does not influence the probability of the other. With our new notation for conditional probabilities, we can write a formal definition. Events A and B are independent whenever: P1 B∣ A2 = P1 B2 .
M05_SHAR8696_03_SE_C05.indd 188
15/07/14 4:55 PM
www.freebookslides.com
Conditional Probability
Independence If we had to pick one key idea in this chapter that you should understand and remember, it’s the definition and meaning of independence.
189
Now we can see that the Multiplication Rule for independent events is just a special case of the General Multiplication Rule. The general rule says P1A and B2 = P1A2 * P1B A2 whether the events are independent or not. But when events A and B are independent, we can write P(B) for P1B A2 and we get back our simple rule: P1A and B2 = P1A2 * P1B2. Sometimes people use this statement as the definition of independent events, but we find the other definition more intuitive. When events are independent, the fact that one has occurred does not affect the probability of the other. Using our earlier example, is the probability of the event choosing a bike independent of the sex of the customer? We need to check whether P1bike man2 =
P1bike and man2 0.126 = = 0.264 P1man2 0.475
is the same as P(bike) = 0.188. Because these probabilities aren’t equal, we can say that prize preference is not independent of the sex of the customer. Whenever at least one of the joint probabilities in the table is not equal to the product of the marginal probabilities, we say that the variables are not independent.
Independent vs. Disjoint Are disjoint events independent? Both concepts seem to have similar ideas of separation and distinctness about them, but in fact disjoint events cannot be independent.3 Let’s see why. Consider the two disjoint events {you get an A in this course} and {you get a B in this course}. They’re disjoint because they have no outcomes in common. Suppose you learn that you did get an A in the course. Now what is the probability that you got a B? You can’t get both grades, so it must be 0. Think about what that means. The fact that the first event (getting an A) occurred changed the probability for the second event (down to 0). So these events aren’t independent. Mutually exclusive events can never be independent. They have no outcomes in common, so knowing that one occurred means the other didn’t. A common error is to treat disjoint events as if they were independent and apply the Multiplication Rule for independent events. Don’t make that mistake. Are events A and B independent or disjoint? Independent
Disjoint
Check whether P1B ∣ A2 = P1B2 or Check whether P1A ∣ B2 = P1A2 or Check whether P1A and B2 = P1A2 * P1B2 Check whether P1A and B2 = 0 or Check whether events A and B overlap in a sample space diagram or Check whether the two events can occur together
3
Technically two disjoint events can be independent, but only if the probability of one of the events is 0. For practical purposes, we can ignore this case, since we don’t anticipate collecting data about things that don’t happen.
M05_SHAR8696_03_SE_C05.indd 189
14/07/14 7:26 AM
www.freebookslides.com 190
CHAPTER 5 Randomness and Probability
For Example
Conditional probability
Question Using the table from the example on page 187, if a customer purchases a Fashion light, what is the probability that the customer is a woman?
Answer P1Woman Fashion2 = P1Woman and Fashion2 >P1Fashion2 = 0.30>0.50 = 0.60
5.7
Constructing Contingency Tables Sometimes we’re given probabilities without a contingency table. You can often construct a simple table to correspond to the probabilities. A survey of real estate in upstate New York classified homes into two price categories (Low—less than $175,000 and High—over $175,000). It also noted whether the houses had at least 2 bathrooms or not (True or False). We are told that 56% of the houses had at least 2 bathrooms, 62% of the houses were Low priced, and 22% of the houses were both. That’s enough information to fill out the table. Translating the percentages to probabilities, we have: At Least 2 Bathrooms Price
True Low High Total
False
Total
0.22
0.62
0.56
1.00
The 0.56 and 0.62 are marginal probabilities, so they go in the margins. What about the 22% of houses that were both low priced and had at least 2 bathrooms? That’s a joint probability, so it belongs in the interior of the table. Because the cells of the table show disjoint events, the probabilities always add to the marginal totals going across rows or down columns.
Price
At Least 2 Bathrooms True
False
Total
Low
0.22
0.40
0.62
High
0.34
0.04
0.38
Total
0.56
0.44
1.00
Now, finding any other probability is straightforward. For example, what’s the probability that a high-priced house has at least 2 bathrooms? P1at least 2 bathrooms high@priced2 = P1at least 2 bathrooms and high@priced2 >P1high@priced2 = 0.34>0.38 = 0.895 or 89.5%.
M05_SHAR8696_03_SE_C05.indd 190
14/07/14 7:26 AM
www.freebookslides.com
Probability Trees
191
Ju s t Che c k i n g 3 Suppose a supermarket is conducting a survey to find out
the busiest time and day for shoppers. Survey respondents are asked 1) whether they shopped at the store on a weekday or on the weekend and 2) whether they shopped at the store before or after 5 p.m. The survey revealed that: • 48% of shoppers visited the store before 5 p.m. 27% of shoppers visited the store on a weekday (Mon.–Fri.)
5.8
• 7% of shoppers visited the store before 5 p.m. on a weekday. a) Make a contingency table for the variables time of day and day of week. b) What is the probability that a randomly selected shopper who shops on a weekday also shops before 5 p.m.? c) Are time and day of the week disjoint events? d) Are time and day of the week independent events?
Probability Trees Some business decisions involve more subtle evaluation of probabilities. Given the probabilities of various states of nature, we can use a picture called a probability tree or tree diagram to help think through the decision-making process. A tree shows sequences of events as paths that look like branches of a tree. This can enable us to compare several possible scenarios. Here’s a manufacturing example. Personal electronic devices, such as smart phones and tablets, are getting more capable all the time. Manufacturing components for these devices is a challenge, and at the same time, consumers are demanding more and more functionality and increasing sturdiness. Microscopic and even submicroscopic flaws that can cause intermittent performance failures can develop during their fabrication. Defects will always occur, so the quality engineer in charge of the production process must monitor the number of defects and take action if the process seems out of control. Let’s suppose that the engineer is called down to the production line because the number of defects has crossed a threshold and the process has been declared to be out of control. She must decide between two possible actions. She knows that a small adjustment to the robots that assemble the components can fix a variety of problems, but for more complex problems, the entire production line needs to be shut down in order to pinpoint the problem. The adjustment requires that production be stopped for about an hour. But shutting down the line takes at least an entire shift (8 hours). Naturally, her boss would prefer that she make the simple adjustment. But without knowing the source or severity of the problem, she can’t be sure whether that will be successful. If the engineer wants to predict whether the smaller adjustment will work, she can use a probability tree to help make the decision. Based on her experience, the engineer thinks that there are three possible problems: (1) the motherboards could have faulty connections, (2) the memory could be the source of the faulty connections, or (3) some of the cases may simply be seating incorrectly in the assembly line. She knows from past experience how often these types of problem crop up and how likely it is that just making an adjustment will fix each type of problem. Motherboard problems are rare (10%), memory problems have been showing up about 30% of the time, and case alignment issues occur most often (60%). We can put those probabilities on the first set of branches in Figure 5.3 on the next page.
M05_SHAR8696_03_SE_C05.indd 191
14/07/14 7:26 AM
www.freebookslides.com 192
CHAPTER 5 Randomness and Probability
Figure 5.3 Possible problems and their probabilities.
Case
0.60
Memory 0.30 0.10 Motherboard
Notice that we’ve covered all the possibilities, and so the probabilities sum to one. To this diagram we can now add the conditional probabilities that a minor adjustment will fix each type of problem. Most likely the engineer will rely on her experience or assemble a team to help determine these probabilities. For example, the engineer knows that motherboard connection problems are not likely to be fixed with a simple adjustment: P1Fix Motherboard2 = 0.10. After some discussion, she and her team determine that P1Fix Memory2 = 0.50 and P1Fix Case alignment2 = 0.80. At the end of each branch representing the problem type, we draw two possibilities (Fix or Not Fixed) and write the conditional probabilities on the branches. Figure 5.4 Extending the tree diagram, we can show both the problem class and the outcome probabilities. The outcome (Fixed or Not fixed) probabilities are conditional on the problem type, and they change depending on which branch we follow.
Fixed 0.80
Case
0.60
Memory 0.30
Motherboard
0.20 Not Fixed Fixed 0.50 0.50 Not Fixed
0.10 Fixed 0.10 0.90 Not Fixed
Case and Fixed
Case and Not Fixed Memory and Fixed Memory and Not Fixed Motherboard and Fixed Motherboard and Not Fixed
At the end of each second branch, we write the joint event corresponding to the combination of the two branches. For example, the top branch is the combination of the problem being Case alignment, and the outcome of the small adjustment is that the problem is now Fixed. For each of the joint events, we can use the general multiplication rule to calculate their joint probability. For example: P1Case and Fixed2 = P1Case2 * P1Fixed Case2 = 0.60 * 0.80 = 0.48 We write this probability next to the corresponding event. Doing this for all branch combinations gives us Figure 5.5.
M05_SHAR8696_03_SE_C05.indd 192
14/07/14 7:26 AM
www.freebookslides.com
Reversing the Conditioning: Bayes’ Rule
Figure 5.5 We can find the probabilities of compound events by multiplying the probabilities along the branch of the tree that leads to the event, just the way the General Multiplication Rule specifies.
Fixed 0.80
Case
0.60
Memory 0.30
Motherboard
0.20 Not Fixed Fixed 0.50 0.50 Not Fixed
0.10 Fixed 0.10 0.90 Not Fixed
Case and Fixed
0.48
Case and Not Fixed
0.12
Memory and Fixed
0.15
Memory and Not Fixed
0.15
Motherboard and Fixed
0.01
Motherboard and Not Fixed
0.09
193
All the outcomes at the far right are disjoint because at every node, all the choices are disjoint alternatives. And those alternatives are all the possibilities, so the probabilities on the far right must add up to one. Because the final outcomes are disjoint, we can add up any combination of probabilities to find probabilities for compound events. In particular, the engineer can answer her question: What’s the probability that the problem will be fixed by a simple adjustment? She finds all the outcomes on the far right in which the problem was fixed. There are three (one corresponding to each type of problem), and she adds their probabilities: 0.48 + 0.15 + 0.01 = 0.64. So 64% of all problems are fixed by the simple adjustment. The other 36% require a major investigation.
*5.9
Reversing the Conditioning: Bayes’ Rule The engineer in our story decided to try the simple adjustment and, fortunately, it worked. Now she needs to report to the quality engineer on the next shift what she thinks the problem was. Was it more likely to be a case alignment problem or a motherboard problem? We know the probabilities of those problems beforehand, but they change now that we have more information. What are the likelihoods that each of the possible problems was, in fact, the one that occurred? Unfortunately, we can’t read those probabilities from the tree in Figure 5.5. For example, the tree gives us P1Fixed and Case2 = 0.48, but we want P1Case Fixed2. We know P1Fixed Case2 = 0.80, but that’s not the same thing. It isn’t valid to reverse the order of conditioning in a conditional probability statement. To “turn the probability around,” we need to go back to the definition of conditional probability. P1Case Fixed2 =
P1Case and Fixed2 P1Fixed2
We can read the probability in the numerator from the tree, and we’ve already calculated the probability in the denominator by adding all the probabilities on the final branches that correspond to the event Fixed. Putting those values in the formula, the engineer finds: P1Case Fixed2 =
0.48 = 0.75 0.48 + 0.15 + 0.01
She knew that 60% of all problems were due to case alignment, but now that she knows the problem has been fixed, she knows more. Given the additional information that a simple adjustment was able to fix the problem, she now can increase the probability that the problem was case alignment to 0.75. It’s usually easiest to solve problems like this by reading the appropriate probabilities from the tree. However, we can write a general formula for finding the
M05_SHAR8696_03_SE_C05.indd 193
14/07/14 7:26 AM
www.freebookslides.com 194
CHAPTER 5 Randomness and Probability
r everse conditional probability. To understand it, let’s review our example again. Let A 1 = 5Case6, A 2 = 5Memory6, and A 3 = 5Motherboard6 represent the three types of problems. Let B = 5Fixed6, meaning that the simple adjustment fixed the problem. We know P1B A 1 2 = 0.80, P1B A 2 2 = 0.50, and P1B A 3 2 = 0.10. We want to find the reverse probabilities, P1A i B2, for the three possible problem types. From the definition of conditional probability, we know (for any of the three types of problems): P1A i B2 =
P1A i and B2 P1B2
We still don’t know either of these quantities, but we use the definition of conditional probability again to find P1A i and B2 = P1B A i 2P1A i 2, both of which we know. Finally, we find P(B) by adding up the probabilities of the three events. P1B) = P1A 1 and B2 + P1A 2 and B2 + P1A 3 and B2 = P1B A 1 2P1A 1 2 + P1B A 2 2P1A 2 2 + P1B A 3 2P1A 3 2
In general, we can write this for n events Ai that are mutually exclusive (each pair is disjoint) and exhaustive (their union is the whole space). Then: P1A i B2 =
P1B A i 2P1A i 2
a P1B A j 2P1A j 2 j
This formula is known as Bayes’ rule, after the Reverend Thomas Bayes (1702– 1761), even though historians don’t really know if Bayes first came up with the reverse conditioning probability. When you need to find reverse conditional probabilities, we recommend drawing a tree and finding the appropriate probabilities as we did at the beginning of the section, but the formula gives the general rule.
What Can Go Wrong? • Beware of probabilities that don’t add up to 1. To be a legitimate assignment of probability, the sum of the probabilities for all possible outcomes must total 1. If the sum is less than 1, you may need to add another category (“other”) and assign the remaining probability to that outcome. If the sum is more than 1, check that the outcomes are disjoint. If they’re not, then you can’t assign probabilities by counting relative frequencies. (And if they are, you must locate the error.) • Don’t add probabilities of events if they’re not disjoint. Events must be
isjoint to use the Addition Rule. The probability of being under 80 or a d female is not the probability of being under 80 plus the probability of being female. That sum may be more than 1.
• Don’t multiply probabilities of events if they’re not independent. The
probability of selecting a customer at random who is over 70 years old and retired is not the probability the customer is over 70 years old times the probability the customer is retired. Knowing that the customer is over 70 changes the probability of his or her being retired. You can’t multiply these probabilities. The multiplication of probabilities of events that are not independent is one of the most common errors people make in dealing with probabilities.
• Don’t confuse disjoint and independent. Disjoint events can’t be inde-
pendent. If A = {you get a promotion} and B = {you don’t get a promotion}, A and B are disjoint. Are they independent? If you find out that A is true, does that change the probability of B? Yes, if A is true, then B cannot be true, so they are not independent.
M05_SHAR8696_03_SE_C05.indd 194
14/07/14 7:26 AM
www.freebookslides.com
What Have We Learned?
195
Ethics in Action He is reluctant to hire Paula on the spot, so Paula suggests they meet again after she has had the opportunity to pull together data on some of her most successful clients. Paula’s objective is to direct clients to angels who tend to make large initial investments. In this way, her clients reach their goals more quickly and she can spend less time with each client. She decides to compile some data only for angel investors who have made significant initial investments in her clients’ startups (in excess of $250,000). She came up with the following contingency table for this group of investors.
Board Seat?
F
abrizio Rivetti is an entrepreneur who has recently started a wine importing business. While he currently has an exclusive relationship with only one premier winery in Tuscany, he is hoping to expand his importing business to include other wineries as well as artisan Italian food products, such as cheeses and specialty meats. With plans to expand, Fabrizio is in need of extra funds. As a first step, he approaches a friend and fellow entrepreneur who has considerable experience dealing with angel investors, Chas Mulligan. Chas has successfully obtained funds from angel investors for his social networking start-up company, so Fabrizio is hopeful that Chas can provide some sound advice. Chas explains to Fabrizio that most angel investors bear considerable risk and consequently favor ventures that are in high-growth areas, such as software, healthcare, and biotech. He also mentions that many angel investors, like venture capitalists, want to exercise some control over the start-up companies in which they invest, either by securing a seat on the company’s board of directors or having veto power. Fabrizio is now a bit unsure about seeking angel investments, so Chas puts him in contact with a consultant, Paula Foxx, who can help him make the right decision. Paula is well connected with a network of angel investors, understands the types of start-ups they prefer to invest in, and, most importantly, knows how to prepare the perfect pitch. At their first meeting, Paula is quick to inform Fabrizio of her consultancy fee schedule. Next, she assures Fabrizio that she is acquainted with a number of angels whom she believes might be interested in his wine importing business. Fabrizio expresses to Paula his reservations about sharing too much control of his start-up with investors, and is particularly wary of granting investors veto power.
Yes No Total
Veto Power? Yes No
Total
.05 .45 .50
.50 .50 1.00
.45 .05 .50
She was happy to find that 50% did not get veto power and 50% did not sit on the board. By multiplying these two probabilities, she arrived at a figure she thought would help persuade Fabrizio to pursue angels and hire her to do so. She planned to tell him that 25% of angels who make large investments are not interested in either veto power or a seat on the board in the start-ups they fund. She called her administrative assistant to arrange another meeting with Fabrizio as soon as possible. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • Propose an ethical solution that considers the welfare of all stakeholders
What Have We Learned? Learning Objectives
Apply the facts about probability to determine whether an assignment of probabilities is legitimate.
• Probability is long-run relative frequency. • Individual probabilities must be between 0 and 1. • The sum of probabilities assigned to all outcomes must be 1. Understand the Law of Large Numbers and that the common understanding of the “Law of Averages” is false. Know the rules of probability and how to apply them.
• The Complement Rule says that P1not A2 = P1AC 2 = 1 - P1A2. • The Multiplication Rule for independent events says that P1A and B2 = P1A2 * P1B2 provided events A and B are independent.
M05_SHAR8696_03_SE_C05.indd 195
14/07/14 7:26 AM
www.freebookslides.com 196
CHAPTER 5 Randomness and Probability • The General Multiplication Rule says that P1A and B2 = P1A2 * P1B A2 for any events A and B. • The Addition Rule for disjoint events says that P1A or B2 = P1A2 + P1B2 provided events A and B are disjoint. • The General Addition Rule says that P1A or B2 = P1A2 + P1B2 - P1A and B2 for any events A and B. Know how to construct and read a contingency table. Know how to define and use independence.
• Events A and B are independent if P1A B2 = P1A2. Know how to construct tree diagrams and use them to calculate and understand conditional probabilities. Know how to use Bayes’ Rule to compute conditional probabilities.
Terms Addition Rule
If A and B are disjoint events, then the probability of A or B is P1A or B2 = P1A2 + P1B2.
Complement Rule
Conditional probability
The probability of an event occurring is 1 minus the probability that it doesn’t occur: P1A2 = 1 - P1AC 2.
P1B A2 =
P1A and B2 P1A2
.
P1B A2 is read “the probability of B given A.” Disjoint (or Mutually Exclusive) Events
Two events are disjoint if they have no outcomes in common. If A and B are disjoint, then the fact that A occurs tells us that B cannot occur. Disjoint events are also called “mutually exclusive.”
Empirical probability
When the probability comes from the long-run relative frequency of the event’s occurrence, it is an empirical probability.
Event General Addition Rule
A collection of outcomes. Usually, we identify events so that we can attach probabilities to them. We denote events with bold capital letters such as A, B, or C. For any two events, A and B, the probability of A or B is: P1A or B2 = P1A2 + P1B2 - P1A and B2.
General Multiplication Rule
For any two events, A and B, the probability of A and B is: P1A and B2 = P1A2 * P1B A2.
Independence (informally) Independence (used formally) Joint probabilities Law of Large Numbers (LLN) Marginal probability Multiplication Rule
Two events are independent if the fact that one event occurs does not change the probability of the other. Events A and B are independent when P1B A2 = P1B2. The probability that two events both occur. The Law of Large Numbers states that the long-run relative frequency of repeated, independent events settles down to the true relative frequency as the number of trials increases. In a joint probability table a marginal probability is the probability distribution of either variable separately, usually found in the rightmost column or bottom row of the table. If A and B are independent events, then the probability of A and B is: P1A and B2 = P1A2 * P1B2.
M05_SHAR8696_03_SE_C05.indd 196
14/07/14 7:26 AM
www.freebookslides.com
Technology Help Outcome Personal probability Probability
Probability Assignment Rule
197
The outcome of a trial is the value measured, observed, or reported for an individual instance of that trial. When the probability is subjective and represents one’s personal degree of belief, it is called a personal probability. The probability of an event is a number between 0 and 1 that reports the likelihood of the event’s occurrence. A probability can be derived from a model (such as equally likely outcomes), from the longrun relative frequency of the event’s occurrence, or from subjective degrees of belief. We write P1A2 for the probability of the event A. The probability of the entire sample space must be 1: P1S2 = 1.
Random phenomenon Sample space Theoretical probability Trial Tree diagram (or probability tree)
A phenomenon is random if we know what outcomes could happen, but not which particular values willhappen in any given trial. The collection of all possible outcome values. The sample space has a probability of 1. When the probability comes from a mathematical model (such as, but not limited to, equally likely outcomes), it is called a theoretical probability. A single attempt or realization of a random phenomenon. A display of conditional events or probabilities that is helpful in thinking through conditioning.
Technology Help: Generating Random Numbers Most statistics packages generate single or lists of random numbers. You may find them useful for introducing randomness in a study or drawing a random sample. Excel can generate random numbers with the RAND( ) function.
• Copy the value and paste into same cell using the Paste Values: Values command.
Excel To generate a random number in Excel: • In a cell, type =RAND 1 2 . A random number between 0 and 1 (areal number to 9 decimal places) appears in the cell.
• To generate more random numbers, copy and paste this cell or select it and Fill Down to obtain more random values. • To generate a random number within a range, type =RAND 1 2 * 1 b − a 2 + a into the formula bar where a is the number at the low end of the range and b is the number at the highend of the range.
• You can also use the function =RANDBETWEEN 1 a, b2 to generate an integer between a and b.
Random numbers are re-generated each time a change is made to the spreadsheet. To avoid this: • Highlight the cell containing the random number.
M05_SHAR8696_03_SE_C05.indd 197
14/07/14 7:26 AM
www.freebookslides.com 198
CHAPTER 5 Randomness and Probability
Brief Case
Global Markets A global survey firm reports data from surveys taken in several countries. The data file Global holds data for 800 respondents in each of five countries. The variables provide demographic information (sex, age, education, marital status) and responses to questions of interest to marketers on personal finance and purchasing. Write a report that discusses how decisions about personal finance and shopping vary by country and by sex. You’ll want to make contingency tables of some variables and consider the contingent probabilities that they show. You may also want to restrict your attention to one country and then consider relationships between variables within that country.
Exercises Section 5.1
Section 5.4
1. Indicate which of the following represent independent events. Explain briefly. a) The employment status of customers waiting in line at the post office. b) The first four digits of phone number of students attending a seminar. c) The outcomes when rolling a fair die twice.
5. A recent survey found that, despite airline requests, about 40% of passengers don’t fully turn off their cell phones during takeoff and landing (although they may put them in “airplane mode”). The two passengers across the aisle (in seats A and B) clearly do not know each other. a) What is the probability that the passenger in seat A does not turn off his phone? b) What is the probability that he does turn off his phone? c) What is the probability that both of them turn off their phones? d) What is the probability that at least one of them turns off his or her phone?
2. Indicate which of the following represent independent events. Explain briefly. a) Prices of houses on the same block. b) Successive measurements of your heart rate as you exercise on a treadmill. c) Measurements of the heart rates of all students in the gym.
Section 5.2 3. In many state lotteries, you can choose which numbers to play. Consider a common form in which you choose 5 numbers. Which of the following strategies can improve your chance of winning? If the method works, explain why. If not, explain why using appropriate statistics terms. a) Always play 1, 2, 3, 4, 5. b) Choose the numbers that did come up in the most recent lottery drawing because they are “hot.” 4. For the same kind of lottery as in Exercise 3, which of the following strategies can improve your chance of winning? If the method works, explain why. If not, explain why using appropriate statistics terms. a) Choose randomly from among the numbers that have not come up in the last 3 lottery drawings because they are “due.” b) Generate random numbers using a computer or calculator and play those.
M05_SHAR8696_03_SE_C05.indd 198
6. Assume that 15% of the people in your expedition trip own a mobile phone, and you are randomly assigned to two partners in your group. Find the probability that: a) The first partner will have a mobile phone. b) The first partner will not have a mobile phone. c) That both partners will have mobile phones. d) That at most one will have a mobile phone.
Section 5.5 7. The following contingency table shows opinion about global warming among U.S. adults, broken down by political party affiliation (based on a poll in October 2012 by Pew Research found at http://www.people-press .org/2012/10/15/more-say-there-is-solid-evidence-of -global-warming/).
14/07/14 7:26 AM
www.freebookslides.com
Exercises 199
Section 5.7
Political Party
Opinion on Global Warming Democratic Republican Independent Total
Nonissue
Serious Concern
Total
85 290 70 445
415 210 130 755
500 500 200 1200
a) What is the probability that a U.S. adult selected at random believes that global warming is a serious issue? b) What type of probability did you find in part a? c) What is the probability that a U.S. adult selected at random is a Republican and believes that global warming is a serious issue? d) What type of probability did you find in part c? 8. Multigenerational families can be categorized as having two adult generations such as parents living with adult children, “skip” generation families, such as grandparents living with grandchildren, and three or more generations living in the household. Pew Research surveyed multigenerational households. This table is based on their reported results. 2 Adult Gens
2 Skip Gens
3 or More Gens
509 139 119 61 828
55 11 32 1 99
222 142 99 48 511
White Hispanic Black Asian
786 292 250 110 1438
a) What is the probability that a multigenerational family is Hispanic? b) What is the probability that a multigenerational family selected at random is a Black, two-adult-generation family? c) What type of probability did you find in parts a and b?
Section 5.6 9. Using the table from Exercise 7, a) What is the probability that a randomly selected U.S. adult who is a Republican believes that global warming is a serious issue? b) What is the probability that a randomly selected U.S. adult is a Republican given that he or she believes global warming is a serious issue? c) What is P(Serious Concern Democratic)? 10. Using the table from Exercise 8, a) What is the probability that a randomly selected Black multigenerational family is a 2 Adult Generation family? b) What is the probability that a randomly selected multigenerational family is White, given that it is a “skip” generation family? c) What is P13 or more Generations Asian2?
M05_SHAR8696_03_SE_C05.indd 199
11. A locally generated report indicated that 40% of adults conducted their purchases online. The report also found that 30% are under the age of 40, and that 20% are under the age of 40 and conduct their purchases online. a) What percent of adults do not use the internet for making purchases? b) What type of probability is the 20% mentioned above? c) Construct a contingency table showing all related joint and marginal probabilities. d) Find the probability that a randomly selected individual conducts purchase using the internet given that the individual is under the age of 40. e) Are Conducting purchases online and Age independent events? 12. Facebook reports that 70% of their users are from outside the United States and that 50% of their users log on to Facebook every day. Suppose that 20% of their users are United States users who log on every day. a) What percentage of Facebook’s users are from the United States? b) What type of probability is the 20% mentioned above? c) Construct a contingency table showing all the joint and marginal probabilities. d) What is the probability that a user is from the United States given that he or she logs on every day? e) Are From United States and Log on Every Day independent? Explain.
Section 5.8 13. Summit Projects provides marketing services and website management for many companies that specialize in outdoor products and services (www.summitprojects.com). To understand customer Web behavior, the company experiments with different offers and website design. The results of such experiments can help to maximize the probability that customers purchase products during a visit to a website. Possible actions by the website include offering the customer an instant discount, offering the customer free shipping, or doing nothing. A recent experiment found that customers make purchases 6% of the time when offered the instant discount, 5% when offered free shipping, and 2% when no special offer was given. Suppose 20% of the customers are offered the discount and an additional 30% are offered free shipping. a) Construct a probability tree for this experiment. b) What percent of customers who visit the site made a purchase? c) Given that a customer made a purchase, what is the probability that they were offered free shipping? 14. The company in Exercise 13 performed another experiment in which they tested three website designs to see
14/07/14 7:26 AM
www.freebookslides.com 200
CHAPTER 5 Randomness and Probability
which one would lead to the highest probability of purchase. The first (design A) used enhanced product information, the second (design B) used extensive iconography, and the third (design C) allowed the customer to submit their own product ratings. After 6 weeks of testing, the designs delivered probabilities of purchase of 4.5%, 5.2%, and 3.8%, respectively. Equal numbers of customers were sent randomly to each website design. a) Construct a probability tree for this experiment. b) What percent of customers who visited the site made a purchase? c) What is the probability that a randomly selected customer was sent to design C? d) Given that a customer made a purchase, what is the probability that the customer had been sent to design C?
Section 5.9 15. According to U.S. Census data, 68% of the civilian U.S. labor force self-identifies as White, 11% as Black, and the remaining 21% as Hispanic/Latino or Other. Among Whites in the labor force, 54% are Male, and 46% Female. Among Blacks, 52% are Male and 48% Female, and among Hispanic/Latino/Other, 58% are Male and 42% are Female. a) Polling companies need to sample an appropriate number of respondents of each gender from each ethnic group. For a randomly selected U.S. worker, fill in the probabilities in this tree: White
Black
Hispanic/Latino/Other
Male Female Male Female Male Female
b) What is the probability that a randomly selected worker is a Black Female? c) For a randomly selected worker, what is P1Female White2? d) For a randomly selected worked what is P1White Female2? 16. U.S. Customs and Border Protection has been t esting automated kiosks that may be able to detect lies. (www.wired.com/threatlevel/2013/01/ff-lie-detector/ all/) One measurement used (among several) is involuntary eye movements. Using this method alone, tests show that it can detect 60% of lies, but incorrectly identifies 15% of true statements as lies. Suppose that 95% of those entering the country tell the truth. The immigration kiosk asks questions such as “Have you ever been arrested for a crime?” Naturally, all the applicants answer “No,” but the kiosk identifies some of those answers as lies, and refers the entrant to a human interviewer.
M05_SHAR8696_03_SE_C05.indd 200
a) Here is the outline of a probability tree for this situation. Fill in the probabilities: Truth
Lie
Kiosk says Lie Kiosk says True Kiosk says Lie Kiosk says True
b) What is the probability that a random person will be telling the truth and will be cleared by the Kiosk? c) What is the probability that a person who is rejected by the kiosk was actually telling the truth?
Chapter Exercises 17. What does it mean? part 1. Respond to the following questions: a) A casino claims that its roulette wheel is truly random. What should that claim mean? b) A reporter on Market Place says that there is a 50% chance that the NASDAQ will hit a new high in the next month. What is the meaning of such a phrase? 18. What does it mean? part 2. Respond to the following questions: a) After an unusually dry autumn, a radio announcer is heard to say, “Watch out! We’ll pay for these sunny days later on this winter.” Explain what he’s trying to say, and comment on the validity of his reasoning. b) A batter who had failed to get a hit in seven consecutive times at bat then hits a game-winning home run. When talking to reporters afterward, he says he was very confident that last time at bat because he knew he was “due for a hit.” Comment on his reasoning. 19. Airline safety. Even though commercial airlines have excellent safety records, in the weeks following a crash, airlines often report a drop in the number of passengers, probably because people are afraid to risk flying. a) A travel agent suggests that since the law of averages makes it highly unlikely to have two plane crashes within a few weeks of each other, flying soon after a crash is the safest time. What do you think? b) If the airline industry proudly announces that it has set a new record for the longest period of safe flights, would you be reluctant to fly? Are the airlines due to have a crash? 20. Economic predictions. An investment newsletter makes general predictions about the economy to help their clients make sound investment decisions. a) Recently they said that because the stock market had been up for the past three months in a row that it was “due for a correction” and advised their client to reduce their holdings. What “law” are they applying? Comment. b) They advised buying a stock that had gone down in the past four sessions because they said that it was clearly “due to bounce back.” What “law” are they applying? Comment.
14/07/14 7:27 AM
www.freebookslides.com
Exercises 201
21. Fire insurance. Insurance companies collect annual payments from homeowners in exchange for paying to rebuild houses that burn down. a) Why should you be reluctant to accept a $300 payment from your neighbor to replace his house should it burn down during the coming year? b) Why can the insurance company make that offer? 22. Casino gambling. Recently, the International Gaming Technology company issued the following press release: (LAS VEGAS, Nev.)—Cynthia Jay was smiling ear to ear as she walked into the news conference at the Desert Inn Resort in Las Vegas today, and well she should. Last night, the 37-year-old cocktail waitress won the world’s largest slot jackpot—$34,959,458—on a Megabucks machine. She said she had played $27 in the machine when the jackpot hit. Nevada Megabucks has produced 49 major winners in its 14-year history. The top jackpot builds from a base amount of $7 million and can be won with a 3-coin ($3) bet. a) How can the Desert Inn afford to give away millions of dollars on a $3 bet? b) Why did the company issue a press release? Wouldn’t most businesses want to keep such a huge loss quiet? 23. Toy company. A toy company is preparing to market an electronic game for young children that “randomly” generates a color. They suspect, however, that the way the random color is determined may not be reliable, so they ask the programmers to perform tests and report the frequencies of each outcome. Are each of the following probability assignments possible? Why or why not? Probabilities of … a) b) c) d) e)
Red
Yellow
Green
Blue
0.25 0.10 0.20 0 0.10
0.25 0.20 0.30 0 0.20
0.25 0.30 0.40 1.00 1.20
0.25 0.40 0.50 0 - 1.50
24. Store discounts. Many stores run “secret sales”: Shoppers receive cards that determine how large a discount they get, but the percentage is revealed by scratching off that black stuff (what is that?) only after the purchase has been totaled at the cash register. The store is required to reveal (in the fine print) the distribution of discounts available. Are each of these probability assignments plausible? Why or why not? Probabilities of … a) b) c) d) e)
M05_SHAR8696_03_SE_C05.indd 201
10% Off
20% Off
30% Off
50% Off
0.20 0.50 0.80 0.75 1.00
0.20 0.30 0.10 0.25 0
0.20 0.20 0.05 0.25 0
0.20 0.10 0.05 - 0.25 0
25. Quality control. A tire manufacturer recently announced a recall because 2% of its tires are defective. If you just bought a new set of four tires from this manufacturer, what is the probability that at least one of your new tires is defective? 26. Pepsi promotion. For a sales promotion, the manufacturer places winning symbols under the caps of 10% of all Pepsi bottles. If you buy a six-pack of Pepsi, what is the probability that you win something? 27. Auto warranty. In developing their warranty policy, an automobile company estimates that over a 1-year period 17% of their new cars will need to be repaired once, 7% will need repairs twice, and 4% will require three or more repairs. If you buy a new car from them, what is the probability that your car will need: a) No repairs? b) No more than one repair? c) Some repairs? 28. Consulting team. You work for a large global management consulting company. Of the entire work force of analysts, 55% have had no experience in the telecommunications industry, 32% have had limited experience (less than 5years), and the rest have had extensive experience (5 years or more). On a recent project, you and two other analysts were chosen at random to constitute a team. It turns out that part of the project involves telecommunications. What is the probability that the first teammate you meet has: a) Extensive telecommunications experience? b) Some telecommunications experience? c) No more than limited telecommunications experience? 29. Auto warranty, part 2. Consider again the auto repair rates described in Exercise 27. If you bought two new cars, what is the probability that: a) Neither will need repair? b) Both will need repair? c) At least one car will need repair? 30. Consulting team, part 2. You are assigned to be part of a team of three analysts of a global management consulting company as described in Exercise 28. What is the probability that of your other two teammates: a) Neither has any telecommunications experience? b) Both have some telecommunications experience? c) At least one has had extensive telecommunications experience? 31. Auto warranty, again. You used the Multiplication Rule to calculate repair probabilities for your cars in Exercise 29. a) What must be true about your cars in order to make that approach valid? b) Do you think this assumption is reasonable? Explain.
14/07/14 7:27 AM
www.freebookslides.com 202
CHAPTER 5 Randomness and Probability
32. Final consulting team project. You used the Multiplication Rule to calculate probabilities about the telecommunications experience of your consulting teammates in Exercise 30. a) What must be true about the groups in order to make that approach valid? b) Do you think this assumption is reasonable? Explain. 33. Real estate. In a sample of real estate ads, 64% of homes for sale had garages, 21% have swimming pools, and 17% have both features. What is the probability that a home for sale has: a) A pool, a garage, or both? b) Neither a pool nor a garage? c) A pool but no garage? 34. Human resource data. Employment data at a large company reveal that 72% of the workers are married, 44% are college graduates, and half of the college grads are married. What’s the probability that a randomly chosen worker is: a) Neither married nor a college graduate? b) Married but not a college graduate? c) Married or a college graduate? 35. Mars product information. The Mars company says that before the introduction of purple, yellow made up 20% of their plain M&M candies, red made up another 20%, and orange, blue, and green each made up 10%. The rest were brown. a) If you picked an M&M at random from a pre-purple bag of candies, what is the probability that it was: i) Brown? ii) Yellow or orange? iii) Not green? iv) Striped? b) Assuming you had an infinite supply of M&M’s with the older color distribution, if you picked three M&M’s in a row, what is the probability that: i) They are all brown? ii) The third one is the first one that’s red? iii) None are yellow? iv) At least one is green? 36. American Red Cross. The American Red Cross must track their supply and demand for various blood types. They estimate that about 45% of the U.S. population has Type O blood, 40% Type A, 11% Type B, and the rest Type AB. a) If someone volunteers to give blood, what is the probability that this donor: i) Has Type AB blood? ii) Has Type A or Type B blood? iii) Is not Type O? b) Among four potential donors, what is the probability that: i) All are Type O? ii) None have Type AB blood? iii) Not all are Type A? iv) At least one person is Type B?
M05_SHAR8696_03_SE_C05.indd 202
37. More Mars product information. In Exercise 35, you calculated probabilities of getting various colors of M&M’s. a) If you draw one M&M, are the events of getting a red one and getting an orange one disjoint or independent or neither? b) If you draw two M&M’s one after the other, are the events of getting a red on the first and a red on the second disjoint or independent or neither? c) Can disjoint events ever be independent? Explain. 38. American Red Cross, part 2. In Exercise 36, you calculated probabilities involving various blood types. a) If you examine one donor, are the events of the donor being Type A and the donor being Type B disjoint or independent or neither? Explain your answer. b) If you examine two donors, are the events that the first donor is Type A and the second donor is Type B disjoint or independent or neither? c) Can disjoint events ever be independent? Explain. 39. Tax accountant. A recent study of IRS audits showed that, for estates worth less than $5 million, about 1 out of 7 of all estate tax returns are audited, but that probability increases to 50% for estates worth over $5 million. Suppose a tax accountant has three clients who have recently filed returns for estates worth more than $5 million. What are the probabilities that: a) All three will be audited? b) None will be audited? c) At least one will be audited? d) What did you assume in calculating these probabilities? 40. Casinos. Because gambling is big business, calculating the odds of a gambler winning or losing in every game is crucial to the financial forecasting for a casino. A standard slot machine has three wheels that spin independently. Each has 10 equally likely symbols: 4 bars, 3 lemons, 2 cherries, and a bell. If you play once, what is the probability that you will get: a) 3 lemons? b) No fruit symbols? c) 3 bells (the jackpot)? d) No bells? e) At least one bar (an automatic loser)? 41. Spam filter. A company has recently replaced their e -mail spam filter because investigations had found that the volume of spam e-mail was interrupting productive work on about 15% of workdays. To see how bad the situation was, calculate the probability that during a 5-day work week, e-mail spam would interrupt work: a) On Monday and again on Tuesday? b) For the first time on Thursday? c) Every day? d) At least once during the week?
14/07/14 7:27 AM
www.freebookslides.com
Exercises 203
42. Tablet tech support. The technical support desk at a college has set up a special service for tablets. A survey shows that 54% of tablets on campus run Apple’s iOS, 43% run Google’s Android OS, and 3% run Microsoft’s Windows. Assuming that users of each of the operating systems are equally likely to call in for technical support what is the probability that of the next three calls: a) All are iOS? b) None are Android? c) At least one is a Windows machine? d) All are Windows machines? 43. Casinos, part 2. In addition to slot machines, casinos must understand the probabilities involved in card games. Suppose you are playing at the blackjack table, and the dealer shuffles a deck of cards. The first card shown is red. So is the second and the third. In fact, you are surprised to see 5 red cards in a row. You start thinking, “The next one is due to be black!” a) Are you correct in thinking that there’s a higher probability that the next card will be black than red? Explain. b) Is this an example of the Law of Large Numbers? Explain. 44. Inventory. A shipment of road bikes has just arrived at The Spoke, a small bicycle shop, and all the boxes have been placed in the back room. The owner asks her assistant to start bringing in the boxes. The assistant sees 20identicallooking boxes and starts bringing them into the shop at random. The owner knows that she ordered 10 women’s and 10 men’s bicycles, and so she’s surprised to find that the first six are all women’s bikes. As the seventh box is brought in, she starts thinking, “This one is bound to be a men’s bike.” a) Is she correct in thinking that there’s a higher probability that the next box will contain a men’s bike? Explain. b) Is this an example of the Law of Large Numbers? Explain. 45. Brazil’s future economic situation. As part of the Pew esearch Global Attitudes project, people from 39 counR tries are asked the question ‘Do you expect our country’s economic situation to improve, remain the same or worsen in the next year?’ As can be seen from the table, in 2013, Brazil, along with China, had the most optimistic responses with regard to the 2014 economic situation. Future Economic Situation Response
Percentage of Respondents
Improve Remain the same Worsen
79% 15% 6%
If we select a person at random from this Brazilian sample of adults: a) What is the probability that the person foresees an improving economy?
M05_SHAR8696_03_SE_C05.indd 203
b) What is the probability that the person foresees a stable economy or better? 46. More on Brazil’s future economic situation. Exercise 45 shows responses from a Brazilian sample about the country’s future economic situation. Suppose we select three adults at random from this sample. a) What is the probability that all three responded “Improve”? b) What is the probability that none responded “Improve”? c) What assumption did you make in computing these probabilities? d) Explain why you think that assumption is reasonable. 47. Mobile Technology in South Africa. According to the Pew Research report Emerging Nations Embrace Internet, Mobile Technology, emerging and developing countries adopt modern communication technology very rapidly, (http://www. pewglobal.org/2014/02/13/). South Africa is an example of such a country, as seen from the table. Cell Phone, Smartphone Ownership Response Own Smartphone Own Cell phone No Cell phone ownership
Percentage of Respondents 33% 91% 9%
a) If we select a random person from this sample of South African adults, what is the probability they own a smartphone? b) What is the probability of owning a cell phone that is not a smartphone? c) Are there more smartphones, or regular cell phones in South Africa? 48. Mobile Technology in South Africa, part 2. Exercise 47 shows the results of a poll that asked about cell phone ownership in South Africa. Suppose we select three adults at random from this sample. a) What is the probability that all three own a smartphone? b) What is the probability that none owns a smartphone? c) What assumption did you make in computing these probabilities? d) Explain why you think that assumption is reasonable. 49. Contract bidding. As manager for a construction firm, you are in charge of bidding on two large contracts. You believe the probability you get contract #1 is 0.8. If you get contract #1, the probability you also get contract #2 will be 0.2, and if you do not get #1, the probability you get #2 will be 0.4. a) Sketch the probability tree. b) What is the probability you will get both contracts? c) Your competitor hears that you got the second contract but hears nothing about the first contract. Given that you got the second contract, what is the probability that you also got the first contract?
14/07/14 7:27 AM
www.freebookslides.com 204
CHAPTER 5 Randomness and Probability
50. Extended warranties. A company that manufactures and sells consumer video cameras sells two versions of their popular hard disk camera, a basic camera for $750, and a deluxe version for $1250. About 75% of customers select the basic camera. Of those, 60% purchase the extended warranty for an additional $200. Of the people who buy the deluxe version, 90% purchase the extended warranty. a) Sketch the probability tree for total purchases. b) What is the percentage of customers who buy an extended warranty? c) What is the expected revenue of the company from a camera purchase (including warranty if applicable)? d) Given that a customer purchases an extended warranty, what is the probability that he or she bought the deluxe version?
but became popular after the sinking of the Titanic, during which 53% of the children and 73% of the women survived, but only 21% of the men survived. Part of the protocol stated that passengers enter lifeboats by ticket class as well. Here is a table showing survival by ticket class.
Alive Dead
First
Second
Third
Crew
Total
203 28.6% 122 8.2%
118 16.6% 167 11.2%
178 25.0% 528 35.4%
212 29.8% 673 45.2%
711 100% 1490 100%
a) Find the conditional probability of survival for each type of ticket. b) Draw a probability tree for this situation. c) Given that a passenger survived, what is the probability they had a first-class ticket?
51. Tweeting. According to the Pew 2012 News Consumption survey, 50% of adults who post news on Twitter (“tweet”) are younger than 30. But according to the U.S. Census, only 23% of adults are less than 30 years old. A separate survey by Pew in 2012 found that 15% of adults tweet. a) Find the probability that a random adult is both less than 30 years old and a Twitter poster. That is, find P(Tweet and 6 30). b) For a random young 1 6 302 adult, what is the probability he or she is a tweeter? That is, find P1Tweet 6 302 (Hint: use Bayes theorem.)
53. Coffeehouse survey. A 2011 Mintel report on coffeehouses asked consumers if they were spending more time in coffeehouses. The table below gives the responses classified by age: a) What is the probability that a randomly selected respondent is spending more time at coffeehouses and donut shops this year than last year? b) What is the probability that the person is younger than 25 years old? c) What is the probability that the person is younger than 25 years old and is spending more time at coffeehouses and donut shops compared to last year? d) What is the probability that the person is younger than 25 years old or is spending more time at coffeehouses and donut shops compared to last year?
52. Titanic survival. Of the 2201 people on the RMS T itanic, only 711 survived. The practice of “women and children first” was first used to describe the chivalrous actions of the sailors during the sinking of the HMS Birkenhead in 1852,
Age I am spending less time at coffeehouses and donut shops this year than last year. I am spending about the same time at coffeehouses and donut shops this year as last year. I am spending more time at coffeehouses and donut shops this year than last year. Total
18–24
25–34
35–44
45–54
55–64
65 +
Total
78
93
102
104
68
48
493
82
109
106
89
75
67
528
30
30
18
19
11
6
114
190
232
226
212
154
121
1135
Source: 2011 Mintel Report. Reprinted by permission of Mintel, a leading market research company. www.mintel.com
54. Electronic communications. A Mintel study asked consumers if electronic communications devices influenced whether or not they bought a certain car. The table on the next page gives the results classified by household income: If we select a person at random from this sample: a) What is the probability that electronic communication devices somewhat influenced their decisions?
M05_SHAR8696_03_SE_C05.indd 204
b) What is the probability that the person is earning at least $100K? c) What is the probability that the person was somewhat influenced by electronic communications and earns at least $100K? d) What is the probability that electronic communications somewhat influenced the purchase or that the person earns at least $100K?
14/07/14 7:27 AM
www.freebookslides.com
Exercises 205
Communications influence on car purchase, by household income, July 2011 Communication (e.g., hands-free calling):
Income * $50K
$50K–99.9K
$100K +
Total
30 26 23 79
57 39 39 135
41 62 35 138
128 127 97 352
Very Much Somewhat Not At All Total Source: Mintel
Rh
Blood Type O
A
B
AB
+
36.44%
28.27%
20.59%
5.06%
−
4.33%
3.52%
1.39%
0.45%
For a randomly selected human, what is the probability that he … a) Is Rh negative given that he is type O b) Is type O given that he is Rh negative c) A person with Type A - blood can accept donated blood only of types A - and O - . What is the probability that a randomly selected donor can donate to a recipient given that the recipient’s blood type is A - ? 56. Automobile inspection. Twenty percent of cars that are inspected have faulty pollution control systems. The cost of repairing a pollution control system exceeds $100 about 40% of the time. When a driver takes her car in for inspection, what’s the probability that she will end up paying more than $100 to repair the pollution control system? 57. Pharmaceutical company. A U.S. pharmaceutical company is considering manufacturing and marketing a pill that will help to lower both an individual’s blood pressure and cholesterol. The company is interested in understanding the demand for such a product. The joint probabilities that an adult American man has high blood pressure and/or high cholesterol are shown in the table. Cholesterol
Blood Pressure High OK
High 0.11 0.16
OK 0.21 0.52
a) What’s the probability that an adult American male has both conditions? b) What’s the probability that an adult American male has high blood pressure?
M05_SHAR8696_03_SE_C05.indd 205
c) What’s the probability that an adult American male with high blood pressure also has high cholesterol? d) What’s the probability that an adult American male has high blood pressure if it’s known that he has high cholesterol? 58. International relocation. A European department store is developing a new advertising campaign for their new U.S. location, and their marketing managers need to understand their target market better. A survey of adult shoppers found the probabilities that an adult would shop at their new U.S. store classified by age is shown below. Shop
Age
55. Red Cross Rh. Exercises 36 and 38 discussed the challenges faced by the Red Cross in finding enough blood of various types. But blood typing also depends on the Rh factor, which can be negative or positive. Here is a table of the estimated proportions worldwide for blood types categorized on both type and Rh factor:
* 20 20–40 + 40 Total
Yes
No
Total
0.26 0.24 0.12 0.62
0.04 0.10 0.24 0.38
0.30 0.34 0.36 1.00
a) What’s the probability that a survey respondent will shop at the U.S. store? b) What is the probability that a survey respondent will shop at the store given that they are younger than 20 years old? c) What is the probability that a survey respondent who is older than 40 shops at the store? d) What is the probability that a survey respondent is younger than 20 or will shop at the store? 59. Pharmaceutical company, again. Given the table of probabilities compiled for marketing managers in Exercise 57, are high blood pressure and high cholesterol independent? Explain. 60. International relocation, again. Given the table of probabilities compiled for a department store chain in Exercise 58, are age and shopping at the department store independent? Explain. 61. Coffeehouse survey, part 2. Look again at the data from the coffeehouse survey in Exercise 53. a) If we select a person at random, what’s the probability we choose a person between 18 and 24 years old who is spending more at coffeehouses?
14/07/14 7:27 AM
www.freebookslides.com CHAPTER 5 Randomness and Probability
b) Among the 18- to 24-year olds, what is the probability that the person responded that they are not spending more time at coffeehouses? c) What’s the probability that a person who spends the same amount of time at coffeehouses is between 35 and 44 years old? d) If the person responded that they spend more time, what’s the probability that they are at least 65 years old? e) What’s the probability that a person at least 65 years old spends the same amount of time? f) Are the responses to the question and age independent? 62. Electronic communications, part 2. Look again at the data in the electronic communications in Exercise 54. a) If we select a respondent at random, what’s the probability that we choose a person earning less than $50 K and responded “somewhat”? b) Among those earning $50–99.9K, what is the probability that the person responded “not at all”? c) What’s the probability that a person who responded “very much” was earning at least $100K? d) If the person responded “very much,” what is the probability that they earn between $50K and 99.9K? e) Are the responses to the question and income level independent? 63. Real estate, part 2. In the real estate research described in Exercise 33, 64% of homes for sale have garages, 21% have swimming pools, and 17% have both features. a) What is the probability that a home for sale has a garage, but not a pool? b) If a home for sale has a garage, what’s the probability that it has a pool, too? c) Are having a garage and a pool independent events? Explain. d) Are having a garage and a pool mutually exclusive? Explain. 64. Polling. Professional polling organizations face the challenge of selecting a representative sample of U.S. adults by telephone. This has been complicated by people who only use cell phones and by others whose landline phones are unlisted. A careful survey by Democracy Corps determined the following proportions: Cell phone only Both cell and landline Landline only listed Landline only unlisted
39% 29% 22% 7%
a) What’s the probability that a randomly selected U.S. adult has a landline? b) What’s the probability that a U.S. adult has a landline given that he or she has a cell phone?
M05_SHAR8696_03_SE_C05.indd 206
c) Are having a cell phone and a landline independent? Explain. d) Are having a cell phone and a landline disjoint? Explain. 65. Above or below average skillful. According OECD’s Skills Outlook 2013, labor market skills are crucial to the development of international economies. The following table indicates whether mean literacy and numeracy proficiencies of 16 to 24 year-olds are above or below the OECD average for a sample of six countries. Skills 16 to 24 Year Olds Relative to OECD Average Country
Literacy Proficiency
Numeracy Proficiency
Problem Solving Skills
Australia Finland France Germany Japan United States
above above below below above below
below above below above above below
below above above above below
a) In this sample, what proportion of countries score above average for numeracy? b) In this sample, what proportion of countries scoring above average for literacy, also score above average for numeracy? c) Are literacy and numeracy proficiency scores independent? Explain. 66. Above or below average skills, part 2. The third column from the table of Exercise 65 provides scores for the highest level of problem solving skills in technology-rich environments. Is the percentage of 16 to 24 year-olds above or below the OECD average of 9% achieving level 3? There is no score for France. Are problem solving skills in this example independent from numeracy skills, and from literacy skills? 67. Used cars. A business student is searching for a used car to purchase, so she posts an ad to a website saying she wants to buy a used Jeep between $18,000 and $20,000. From Kelly’s BlueBook.com, she learns that there are 149 cars matching that description within a 30-mile radius of her home. If we assume that those are the people who will call her and that they are equally likely to call her: Price
Car Make
206
Commander Compass Grand Cherokee Liberty Wrangler Total
$18,000–$18,999
$19,000–$19,999
Total
3 6 33 17 33 92
6 1 33 6 11 57
9 7 66 23 44 149
14/07/14 7:27 AM
www.freebookslides.com
Exercises 207
68. CEO relocation. The CEO of a mid-sized company has to relocate to another part of the country. To make it easier, the company has hired a relocation agency to help purchase a house. The CEO has 5 children and so has specified that the house have at least 5 bedrooms, but hasn’t put any other constraints on the search. The relocation agency has narrowed the search down to the houses in the table and has selected one house to showcase to the CEO and family on their trip out to the new site. The agency doesn’t know it, but the family has its heart set on a Cape Cod house with a fireplace. If the agency selected the house at random, without regard to this:
House Type
Fireplace? Cape Cod Colonial Other Total
No
Yes
Total
7 8 6 21
2 14 5 21
9 22 11 42
a) What is the probability that the selected house is a Cape Cod? b) What is the probability that the house is a Colonial with a fireplace? c) If the house is a Cape Cod, what is the probability that it has a fireplace? d) What is the probability that the selected house is what the family wants?
M05_SHAR8696_03_SE_C05.indd 207
*69. Computer reliability. Laptop computers have been growing in popularity according to a study by Current Analysis Inc. Laptops now represent more than half the computer sales in the United States. A campus bookstore sells both types of computers and in the last semester sold 56% laptops and 44% desktops. Reliability rates for the two types of machines are quite different, however. In the first year, 5% of desktops require service, while 15% of laptops have problems requiring service. a) Sketch a probability tree for this situation. b) What percentage of computers sold by the bookstore last semester required service? c) Given that a computer required service, what is the probability that it was a laptop?
Just C hecking Ans wers 1 The probability of going up on the next day is not
affected by the previous day’s outcome.
2 a) 0.30
b) 0.3010.302 = 0.09 c) 11 - 0.302 2 10.302 = 0.147 d) 1 - 11 - 0.302 5 = 0.832
3 a)
Weekday
Before Five
a) What is the probability that the first caller will be a Jeep Liberty owner? b) What is the probability that the first caller will own a Jeep Liberty that costs between $18,000 and $18,999? c) If the first call offers her a Jeep Liberty, what is the probability that it costs less than $19,000? d) Suppose she decides to ignore calls with cars whose cost is Ú +19,000. What is the probability that the first call she takes will offer to sell her a Jeep Liberty?
Yes
No
Total
Yes
0.07
0.41
0.48
No
0.20
0.32
0.52
Total
0.27
0.73
1.00
b) P1BF WD2 = P1BF and WD2 >P1WD2 = 0.07>0.27 = .259 c) No, shoppers can do both (and 7% do).
d) To be independent, we’d need P1BF WD2 = P1BF2. P1BF WD2 = 0.259, but P1BF2 = 0.48. They do not appear to be independent.
14/07/14 7:27 AM
www.freebookslides.com
M05_SHAR8696_03_SE_C05.indd 208
14/07/14 7:27 AM
6
www.freebookslides.com
Random Variables and Probability Models
Metropolitan Life Insurance Company In 1863, at the height of the U.S. Civil War, a group of businessmen in New York City decided to form a new company to insure Civil War soldiers against disabilities and injuries suffered from the war. After the war ended, they changed direction and decided to focus on selling life insurance. The new company was named Metropolitan Life (MetLife) because the bulk of the company’s clients were in the “metropolitan” area of New York City. Although an economic depression in the 1870s put many life insurance companies out of business, MetLife survived, modeling their business on similar successful programs in England. Taking advantage of spreading industrialism and the selling methods of British insurance agents, the company soon was enrolling as many as 700 new policies per day. By 1909, MetLife was the nation’s largest life insurer in the United States. During the Great Depression of the 1930s, MetLife expanded their public service by promoting public health campaigns, focusing on educating the urban poor in major U.S. cities about the risk of tuberculosis. Because the company invested primarily in urban and farm mortgages, as opposed to the stock market, they survived the crash of 1929 and ended up investing heavily in the postwar U.S. housing boom. They were the principal investors in both the Empire State Building (1929) and Rockefeller Center (1931). During World War II, the company was the single largest contributor to the Allied cause, investing more than half of their total assets in war bonds.
209
M06_SHAR8696_03_SE_C06.indd 209
14/07/14 7:25 AM
www.freebookslides.com 210
CHAPTER 6 Random Variables and Probability Models
Today, in addition to life insurance, MetLife manages pensions and investments. In 2000, the company held an initial public offering and entered the retail banking business in 2001 with the launch of MetLife Bank. The company’s public face is well known because of their use of Snoopy, the dog from the cartoon strip “Peanuts.”
I
nsurance companies make bets all the time. For example, they bet that you’re going to live a long life. Ironically, you bet that you’re going to die sooner. Both you and the insurance company want the company to stay in business, so it’s important to find a “fair price” for your bet. Of course, the right price for you depends on many factors, and nobody can predict exactly how long you’ll live. But when the company averages its bets over enough customers, it can make reasonably accurate estimates of the amount it can expect to collect on a policy before it has to pay out the benefit. To do that effectively, it must model the situation with a probability model. Using the resulting probabilities, the company can find the fair price of almost any situation involving risk and uncertainty. Here’s a simple example. An insurance company offers a “death and disability” policy that pays $100,000 when a client dies or $50,000 if the client is permanently disabled. It charges a premium of $500 per year for this benefit. Is the company likely to make a profit selling such a plan? To answer this question, the companyneeds to know the probability that a client will die or become disabled in any year. From such actuarial information and the appropriate model, the company can calculate the expected value of this policy.
6.1
N o t at i o n A l e r t The most common letters for r andom variables are X, Y, and Z, but any capital letter might be used.
Expected Value of a Random Variable To model the insurance company’s risk, we need to define a few terms. The amount the company pays out on an individual policy is an example of a random variable, called that because its value is based on the outcome of a random event. We use a capital letter, in this case, X, to denote a random variable. We’ll denote a particular value that it can have by the corresponding lowercase letter, in this case, x. For the insurance company, x can be $100,000 (if you die that year), $50,000 (if you are disabled), or $0 (if neither occurs). Because we can list all the outcomes, we call this random variable a discrete random variable. A random variable that can take on any value (possibly bounded on one or both sides) is called a continuous random variable. Continuous random variables are common in business applications for modeling physical quantities like heights and weights, and monetary quantities such as profits, revenues, and spending. Sometimes it is obvious whether to treat a random variable as discrete or continuous, but at other times the choice is more subtle. Age, for example, might be viewed as discrete if it is measured only to the nearest decade with possible values 10, 20, 30, . . . . In a scientific context, however, it might be measured more precisely and treated as continuous. For both discrete and continuous variables, the collection of all the possible values and the probabilities associated with them is called the probability model for the random variable. For a discrete random variable, we can list the probability of all possible values in a table, or describe it by a formula. For example, to model the possible outcomes of a fair die, we can let X be the number showing on the face. The probability model for X is simply: P1X = x2 = b
M06_SHAR8696_03_SE_C06.indd 210
1>6 0
if x = 1, 2, 3, 4, 5, or 6 otherwise
14/07/14 7:25 AM
www.freebookslides.com
Expected Value of a Random Variable
211
Suppose in our insurance risk example that the death rate in any year is 1 out of every 1000 people and that another 2 out of 1000 suffer some kind of disability. The loss, which we’ll denote as X, is a discrete random variable because it takes on only 3 possible values. We can display the probability model for X in a table, as in Table 6.1. Policyholder Outcome
Payout x (cost)
Death
100,000
Disability
Probability P 1 X = x2
50,000
Neither
1 1000 2 1000 997 1000
Table 6.1 Probability model for an insurance policy.
N o t at i o n A l e r t The expected value (or mean) of a random variable is written E(X ) or m. (Be sure not to confuse the mean of a random variable, calculated from probabilities, with the mean of a collection of data values which is denoted by y or x.)
Of course, we can’t predict exactly what will happen during any given year, but we can say what we expect to happen—in this case, what we expect the profit of a policy will be. The expected value of a policy is a parameter (a numerically valued attribute) of the probability model. In fact, it’s the mean. We’ll signify this with the notation E1X2, for expected value (or sometimes m to indicate that it is a mean). This isn’t an average of data values, so we won’t estimate it. Instead, we calculate it directly from the probability model for the random variable. Because it comes from a model and not data, we use the parameter m to denote it (and not y or x) . To see what the insurance company can expect, think about some convenient number of outcomes. For example, imagine that they have exactly 1000 clients and that the outcomes in one year followed the probability model exactly: 1 died, 2were disabled, and 997 survived unscathed. Then our expected payout would be: m = E1X2 =
100,000112 + 50,000122 + 019972 = 200 1000
So our expected payout comes to $200 per policy. Instead of writing the expected value as one big fraction, we can rewrite it as separate terms, each divided by 1000. m = E1X2 = +100,000 a = +200
1 2 997 b + +50,000 a b + +0a b 1000 1000 1000
Writing it this way, we can see that for each policy, there’s a 1>1000 chance that we’ll have to pay $100,000 for a death and a 2>1000 chance that we’ll have to pay $50,000 for a disability. Of course, there’s a 997>1000 chance that we won’t have to pay anything. So the expected value of a (discrete) random variable is found by multiplying each possible value of the random variable by the probability that it occurs and then summing all those products. This gives the general formula for the expected value of a discrete random variable:1 E1X2 = g x P1x2.
1
The concept of expected values for continuous random variables is similar, but the calculation requires calculus and is beyond the scope of this text.
M06_SHAR8696_03_SE_C06.indd 211
14/07/14 7:25 AM
www.freebookslides.com 212
CHAPTER 6 Random Variables and Probability Models
Be sure that every possible outcome is included in the sum. Verify that you have a valid probability model to start with—the probabilities should each be between 0 and 1 and should sum to one. (Recall the rules of probability in Chapter 5.)
Calculating the expected value of a random variable
For Example
Questions A fund-raising lottery offers 500 tickets for $3 each. If the grand prize
is $250 and 4 second prizes are $50 each, what is the expected value of a single ticket? (Don’t count the cost of the ticket in this yet.) Now, including its cost, what is the expected value of the ticket? (Knowing this value, does it make any “sense” to buy a lottery ticket?) The fund-raising group has a target of $1000 to be raised by the lottery. Can they expect to make this much?
Answers Each ticket has a 1>500 chance of winning the grand prize of $250, a 4>500
chance of winning $50, and a 495>500 chance of winning nothing. So E1X2 = 11>5002 * +250 + 14>5002 * +50 + 1495>5002 * +0 = +0.50 + +0.40 + $0.00 = $0.90. Including the cost, the expected value is +0.90 - +3 = - +2.10. Although no single person will lose $2.10 (they either lose $3 or win $50 or $250), $2.10 is the amount, on average, that the lottery gains per ticket. Therefore, they can expect to make 500 * +2.10 = +1050.
6.2
Standard Deviation of a Random Variable Of course, this expected value (or mean) is not what actually happens to any particular policyholder. No individual policy actually costs the company $200. We are dealing with random events, so some policyholders receive big payouts and others nothing. Because the insurance company must anticipate this variability, it needs to know the standard deviation of the random variable. For data, we calculate the standard deviation by first computing the deviation of each data value from the mean and squaring it. We perform a similar calculation when we compute the standard deviation of a (discrete) random variable as well. First, we find the deviation of each payout from the mean (expected value). (See Table 6.2.) Policyholder Outcome Death Disability Neither
Payout x (cost) 100,000
Probability P 1 X = x2 1 1000
50,000
2 1000
997 1000
Deviation 1 x − EV 2
1100,000 - 2002 = 99,800
150,000 - 2002 = 49,800 10 - 2002 = - 200
Table 6.2 Deviations between the expected value and each payout (cost).
Next, we square each deviation. The variance is the expected value of those squared deviations. To find it, we multiply each by the appropriate probability and sum those products: 1 2 997 b + 49,8002 a b + 1 -2002 2 a b 1000 1000 1000 = 14,960,000.
Var1X2 = 99,8002 a
M06_SHAR8696_03_SE_C06.indd 212
14/07/14 7:25 AM
www.freebookslides.com
213
Standard Deviation of a Random Variable
Finally, we take the square root to get the standard deviation: SD1X2 = 214,960,000 ≈ +3867.82
The insurance company can expect an average payout of $200 per policy, with a standard deviation of $3867.82. Think about that. The company charges $500 for each policy and expects to pay out $200 per policy. Sounds like an easy way to make $300. (In fact, most of the time—probability 997>1000—the company pockets the entire $500.) But would you be willing to take on this risk yourself and sell all your friends policies like this? The problem is that occasionally the company loses big. With a probability of 1>1000, it will pay out $100,000, and with a probability of 2>1000, it will pay out $50,000. That may be more risk than you’re willing to take on. The standard deviation of $3867.82 gives an indication of the uncertainty of the profit, and that seems like a fairly large spread (and risk) for an average profit of $300. Here are the formulas for these arguments. Because these are parameters of our probability model, the variance and standard deviation can also be written as s2 and s, respectively (sometimes with the name of the random variable as a subscript). You should recognize both kinds of notation: s2 = Var1X2 = g 1x - m2 2P1x2 = g 1x - E1X22 2 P1x2, and s = SD1X2 = 2Var1X2.
For Example
Calculating the standard deviation of a random variable
Question In the lottery example on page 212, we found the expected gain per ticket to be $2.10. What is the standard deviation? What does it say about your chances in the lottery? Comment.
Answer
s2 = Var1X2 = a 1x - E1X22 2P1x2 = a 1x - 2.102 2P1x2 = 1250 - 2.102 2
= 61,454.41 *
1 4 495 + 150 - 2.102 2 + 10 - 2.102 2 500 500 500
1 4 495 + 2,294.41 * + 4.41 * 500 500 500
= 145.63 so s = 2145.63 = +12.07
That’s a lot of variation for a mean of $2.10, which reflects the fact that there is a small chance that you’ll win a lot but a large chance you’ll win nothing.
Guided Example
Computer Inventory As the head of inventory for a computer company, you’ve had a challenging couple of weeks. One of your warehouses recently had a fire, and you had to flag all the computers stored there to be recycled. On the positive side, you were thrilled that you had managed to ship two computers to your biggest client last week. But then you discovered that your assistant hadn’t heard about the fire and had mistakenly transported a whole truckload of computers from the damaged warehouse into the shipping center. It turns out that 30% of all the computers shipped last week were (continued )
M06_SHAR8696_03_SE_C06.indd 213
14/07/14 7:26 AM
www.freebookslides.com 214
CHAPTER 6 Random Variables and Probability Models
damaged. You don’t know whether your biggest client received two damaged computers, two undamaged ones, or one of each. Computers were selected at random from the shipping center for delivery. If your client received two undamaged computers, everything is fine. If the client gets one damaged computer, it will be returned at your expense—$100— and you can replace it. However, if both computers are damaged, the client will cancel all other orders this month, and you’ll lose $10,000.
Question: What is the expected value and the standard deviation of your loss under this scenario?
Plan
Do
Setup State the problem.
We want to analyze the potential consequences of shipping damaged computers to a large client. We’ll look at the expected value and standard deviation of the amount we’ll lose. Let X = amount of loss. We’ll denote the receipt of an undamaged computer by U and the receipt of a damaged computer by D. The three possibilities are: two undamaged computers (U and U), two damaged computers (D and D), and one of each (UD or DU). Because the computers were selected randomly and the number in the warehouse is large, we can assume independence.
Model List the possible values
Because the events are independent, we can use the multiplication rule (Chapter 5) and find:
of the random variable, and compute all the values you’ll need to determine the probability model.
P1UU2 = P1U2 * P1U2 = 0.7 * 0.7 = 0.49 P1DD2 = P1D2 * P1D2 = 0.3 * 0.3 = 0.09 So, P1UD or DU2 = 1 - 10.49 + 0.092 = 0.42
We have the following model for all possible values of X. Outcome Two damaged One damaged Neither damaged
100 0
P1X = x2 P1DD2 = 0.09 P1UD or DU2 = 0.42 P1UU2 = 0.49
Mechanics Find the expected value.
E1X2 = 010.492 + 10010.422 + 10,00010.092 = +942.00
Find the variance.
Var1X2 = 10 - 9422 2 * 10.492
Find the standard deviation.
Report
x 10,000
Conclusion Interpret your results in context.
M06_SHAR8696_03_SE_C06.indd 214
+ 1100 - 9422 2 * 10.422 + 110,000 - 9422 2 * 10.092 = 8,116,836 SD1X2 = 28,116,836 = $2849.01
Memo Re: Damaged computers The recent shipment of two computers to our large client may have some serious problems. Even though there is about a 50% chance that they will receive two perfectly good computers, there is a 9% chance that they will
14/07/14 7:26 AM
www.freebookslides.com
Properties of Expected Values and Variances
215
receive two damaged computers and will cancel the rest of their monthly order. We have analyzed the expected loss to the firm as $942 with a standard deviation of $2849.01. The large standard deviation reflects the fact that there is a real possibility of losing $10,000 from the mistake. Both numbers seem reasonable. The expected value of $942 is between the extremes of $0 and $10,000, and there’s great variability in the outcome values.
6.3
Properties of Expected Values and Variances Our example insurance company expected to pay out an average of $200 per policy, with a standard deviation of about $3868. The expected profit then was +500 - +200 = +300 per policy. Suppose that the company decides to lower the price of the premium by $50 to $450. It’s pretty clear that the expected profit would drop an average of $50 per policy, to +450 - +200 = +250. What about the standard deviation? We know that adding or subtracting a constant from data shifts the mean but doesn’t change the variance or standard deviation. The same is true of random variables:2 E1X { c2 = E1X2 { c, Var1X { c2 = Var1X2, and SD1X { c2 = SD1X2. What if the company decides to double all the payouts—that is, pay $200,000 for death and $100,000 for disability? This would double the average payout per policy and also increase the variability in payouts. In general, multiplying each value of a random variable by a constant multiplies the mean by that constant and multiplies the variance by the square of the constant: E1aX2 = aE1X2, and Var1aX2 = a2Var1X2. Taking square roots of the last equation shows that the standard deviation is multiplied by the absolute value of the constant: SD1aX2 = a SD1X2. This insurance company sells policies to more than just one person. We’ve just seen how to compute means and variances for one person at a time. What happens to the mean and variance when we have a collection of customers? The profit on a group of customers is the sum of the individual profits, so we’ll need to know how to find expected values and variances for sums. To start, consider a simple case with just two customers who we’ll call Mr. Ecks and Ms. Wye. With an expected payout of $200 on each policy, we might expect a total of +200 + +200 = +400 to be paid out on the two policies—nothing surprising there. In other words, we have the A ddition Rule for Expected Values of Random Variables: The expected value of the sum (or difference) of random variables is the sum (or difference) of their expected values: E1X { Y2 = E1X2 { E1Y2. 2
The rules in this section are true for both discrete and continuous random variables.
M06_SHAR8696_03_SE_C06.indd 215
14/07/14 7:26 AM
www.freebookslides.com 216
CHAPTER 6 Random Variables and Probability Models
The variability is another matter. Is the risk of insuring two people the same as the risk of insuring one person for twice as much? We wouldn’t expect both clients to die or become disabled in the same year. In fact, because we’ve spread the risk, the standard deviation should be smaller. Indeed, this is the fundamental principle behind insurance. By spreading the risk among many policies, a company can keep the standard deviation quite small and predict costs more accurately. It’s much less risky to insure thousands of customers than one customer when the total expected payout is the same, assuming that the events are independent. Catastrophic events such as hurricanes or earthquakes that affect large numbers of customers at the same time destroy the independence assumption, and often the insurance company along with it. But how much smaller is the standard deviation of the sum? It turns out that, if the random variables are independent, we have the Addition Rule for Variances of (Independent) Random Variables: The variance of the sum or difference of two independent random variables is the sum of their individual variances: Var1X { Y2 = Var1X2 + Var1Y2 if X and Y are independent.
Math Box Pythagorean Theorem of Statistics We often use the standard deviation to measure variability, but when we add independent random variables, we use their variances. Think of the Pythagorean Theorem. In a right triangle (only), the square of the length of the hypotenuse is the sum of the squares of the lengths of the other two sides: c 2 = a2 + b2.
c
b
a
For independent random variables (only), the square of the standard deviation of their sum is the sum of the squares of their standard deviations: SD2 1X + Y2 = SD2 1X2 + SD2 1Y2. It’s simpler to write this with variances:
Var1X + Y2 = Var1X2 + Var1Y2, but we’ll use the standard deviation formula often as well: SD1X + Y2 = 2Var1X2 + Var1Y2. For Mr. Ecks and Ms. Wye, the insurance company can expect their outcomes to be independent, so (using X for Mr. Ecks’s payout and Y for Ms. Wye’s): Var1X + Y2 = Var1X2 + Var1Y2 = 14,960,000 + 14,960,000 = 29,920,000.
M06_SHAR8696_03_SE_C06.indd 216
14/07/14 7:26 AM
www.freebookslides.com
Properties of Expected Values and Variances
217
Let’s compare the variance of writing two independent policies to the variance of writing only one for twice the size. If the company had insured only Mr. Ecks for twice as much, the variance would have been Var12X2 = 22Var1X2 = 4 * 14,960,000 = 59,840,000, or
For Random Variables, Does X + X + X = 3X? Maybe, but be careful. As we’ve just seen, insuring one person for $300,000 is not the same risk as insuring three people for $100,000 each. When each instance represents a different outcome for the same random variable, though, it’s easy to fall into the trap of writing all of them with the same symbol. Don’t make this common mistake. Make sure you write each instance as a different random variable. Just because each random variable describes a similar situation doesn’t mean that each random outcome will be the same. What you really mean is X1 + X2 + X3. Written this way, it’s clear that the sum shouldn’t necessarily equal 3 times anything.
twice as big as with two independent policies, even though the expected payout is the same. Of course, variances are in squared units. The company would prefer to know standard deviations, which are in dollars. The standard deviation of the payout for two independent policies is SD1X + Y2 = 2Var1X + Y2 = 229,920,000 = +5469.92. But the standard deviation of the payout for a single policy of twice the size is twice the standard deviation of a single policy: SD12X2 = 2SD1X2 = 21+3867.822 = +7735.64, or about 40% more than the standard deviation of the sum of the two independent policies. If the company has two customers, then it will have an expected annual total payout (cost) of $400 with a standard deviation of about $5470. If they write one policy with an expected annual payout of $400, they increase the standard deviation by about 40%. Spreading risk by insuring many independent customers is one of the fundamental principles in insurance and finance. Let’s review the rules of expected values and variances for sums and differences. • T he expected value of the sum of two random variables is the sum of the expected values. • The expected value of the difference of two random variables is the difference of the expected values: E1X t Y2 = E1X2 t E1Y2. • I f the random variables are independent, the variance of their sum or difference is always the sum of the variances: Var1X t Y2 = Var1X2 + Var1Y2. Do we always add variances? Even when we take the difference of two random quantities? Yes! Think about the two insurance policies. Suppose we want to know the mean and standard deviation of the difference in payouts to the two clients. Since each policy has an expected payout of $200, the expected difference is +200 - +200 = +0. If we computed the variance of the difference by subtracting variances, we would get $0 for the variance. But that doesn’t make sense. Their difference won’t always be exactly $0. In fact, the difference in payouts could range from $100,000 to - +100,000, a spread of $200,000. The variability in differences increases as much as the variability in sums. If the company has two customers, the difference in payouts has a mean of $0 and a standard deviation of about $5470.
For Example
Sums of random variables
You are considering investing $1000 into one or possibly two different investment funds. Historically, each has delivered 5% a year in profit with a standard deviation of 3%. So, a $1000 investment would produce $50 with a standard deviation of $30.
Question Assuming the two funds are independent, what are the relative advantages and disadvantages of putting $1000 into one, or splitting the $1000 and putting $500 into each? Compare the means and SDs of the profit from the two strategies. (continued )
M06_SHAR8696_03_SE_C06.indd 217
14/07/14 7:26 AM
www.freebookslides.com 218
CHAPTER 6 Random Variables and Probability Models
Answer Let X = amount gained by putting $1000 into one E1X2 = 0.05 * 1000 = +50 and SD1X2 = 0.03 * 1000 = +30. Let W = amount gained by putting $500 into each. W1 and W2 are the amounts f r o m e a c h f u n d r e s p e c t i v e l y. E1W1 2 = E1W2 2 = 0.05 * 500 = +25. S o E1W2 = E1W1 2 + E1W2 2 = +25 + +25 = +50. The expected values of the two strategies are the same. You expect on average to earn $50 on $1000. SD1W2 = 2SD2 1W1 2 + SD2 1W2 2
= 210.03 * 5002 2 + 10.03 * 5002 2 = 2152 + 152 = +21.213
The standard deviation of the amount earned is $21.213 by splitting the investment amount compared to $30 for investing in one. The expected values are the same. Spreading the investment into more than one vehicle reduces the variation. On the other hand, keeping it all in one vehicle increases the chances of both extremely good and extremely bad returns. Which one is better depends on an individual’s appetite for risk.3
Covariance In Chapter 4 we saw that the association of two variables could be measured with their correlation. What about random variables? We can talk about the correlation between random variables, too. But it’s easier to start with a related concept called covariance. If X is a random variable with expected value E1X2 = m and Y is a random variable with expected value E1X2 = n, then the covariance of X and Y is defined as Cov1X, Y2 = E11X - m21Y - n22. The covariance, like the correlation, measures how X and Y vary together (co = together). The covariance gives us the extra information we need to find the variance of the sum or difference of two random variables when they are not independent: Var1X { Y2 = Var1X2 + Var1Y2 { 2Cov1X, Y2. When X and Y are independent, their covariance is zero, so we have the Pythagorean theorem of statistics, as we saw earlier. When the variables are positively associated, the variance of their sum (or difference) is increased. This is why it’s riskier to invest in two things that are related and less risky to diversify by finding two investments that are nearly independent (and thus have covariance near zero.) Covariance, unlike correlation, doesn’t have to be between -1 and 1, which can make it harder to interpret. To fix this “problem,” we divide the covariance by each of the standard deviations to get the correlation: Corr1X, Y2 =
Cov1X, Y2 . sX sY
3
The assumption of independence is crucial, but not always (or ever) reasonable. As a March 3, 2010, article on CNN Money stated: “It’s only when economic conditions start to return to normal . . . that investors, and investments, move independently again. That’s when diversification reasserts its case. . . .” money.cnn.com/2010/03/03/pf/funds/diversification.moneymag/index.htm
M06_SHAR8696_03_SE_C06.indd 218
14/07/14 7:26 AM
www.freebookslides.com
219
Bernoulli Trials
This is the random variable analogue of the correlation coefficient, r, which we saw in Chapter 4 for data. For random variables, correlation is usually denoted with the greek letter r .
Ju s t Che c k i n g a) How long do you expect to wait for your turn to get tickets? seats at the ticket window of a baseball park is a random vari- b) What’s the standard deviation of your wait time? able with a mean of 100 seconds and a standard deviation of c) What assumption did you make about the two customers in 50 seconds. When you get there, you find only two people in finding the standard deviation? line in front of you.
1 Suppose that the time it takes a customer to get and pay for
6.4
Bernoulli Trials When Google Inc. designed their web browser Chrome, they worked hard to minimize the probability that their browser would have trouble displaying a website. Before releasing the product, they had to test many websites to discover those that might fail. Although web browsers are relatively new, quality control inspection such as this is common throughout manufacturing worldwide and has been in use in industry for nearly 100 years. The developers of Chrome sampled websites, recording whether the browser displayed the website correctly or had a problem. We call the act of inspecting a website a trial. There are two possible outcomes—either the website is displayed correctly or it isn’t. The developers thought that whether any particular website displayed correctly was independent from other sites. Situations like this occur often and are called Bernoulli trials. To summarize, trials are Bernoulli if:
Calvin and Hobbes © 1993 Watterson. Distributed by Universal Uclick. Reprinted with permission. All rights reserved.
N o t at i o n A l e r t Now we have two more reserved letters. Whenever we deal with Bernoulli trials, p represents the probability of success, and q represents the probability of failure. (Of course, q = 1 - p.)
• There are only two possible outcomes (called success and failure) for each trial. • The probability of success, denoted p, is the same on every trial. (The probability of failure, 1 - p is often denoted q.) • The trials are independent. Finding that one website does not display correctly does not change what might happen with the next website. Common examples of Bernoulli trials include tossing a coin, collecting responses on Yes/No questions from surveys, or even shooting free throws in a basketball game. Bernoulli trials are remarkably versatile and can be used to model a wide variety of real-life situations. The specific question you might ask in different situations will give rise to different random variables that, in turn, have different probability models. Of course, the Chrome developers wanted to find websites that wouldn’t display so they could fix any problems in the browser. So for them a “success” was finding a failed website. The labels “success” and “failure” are often applied arbitrarily, so be sure you know what they mean in any particular situation. One of the important requirements for Bernoulli trials is that the trials be independent. Sometimes that’s a reasonable assumption. Is it true for our example? It’s easy to imagine that related sites might have similar problems, but if the sites are selected at random, whether one has a problem should be independent of others. The 10% Condition: Bernoulli trials must be independent. In theory, we need to sample from a population that’s infinitely big. However, if the population is finite, it’s still okay to proceed as long as the sample is smaller than 10% of the population. In Google’s case, they just happened to have a directory of millions of websites, so most samples would easily satisfy the 10% condition.
M06_SHAR8696_03_SE_C06.indd 219
14/07/14 7:26 AM
www.freebookslides.com 220
CHAPTER 6 Random Variables and Probability Models
6.5
Discrete Probability Models Sam Savage, Professor at Stanford University, says in his book, The Flaw of Averages, that plans based only on averages are, on average, wrong. Unfortunately, many business owners make decisions based on averages—the average amount sold last year, the average number of customers seen last month, etc. But averages are just too simple to represent real-world business practice. Fortunately, we can do better by modeling business situations with a probability model. Probability models can play an important role in helping decision makers predict both the outcomes and the consequences of their decision alternatives. In this section we’ll see that some fairly simple models let us model a wide variety of business phenomena.
The Uniform Model
Daniel Bernoulli (1700–1782) was the nephew of Jacob, whom you saw in Chapter 5. He was the first to work out the mathematics for what we now call Bernoulli trials.
We’ll start with the simplest probability model of all, the Uniform model. When we first studied probability in Chapter 5, we saw that equally likely events were the simplest case. For example, a single die can turn up 1, 2, . . . , 6 on one toss. A probability model for the toss is Uniform because each of the outcomes has the same probability 11>62 of occurring. Similarly if X is a random variable with possible outcomes 1, 2, Á , n and P1X = i2 = 1>n for each value of i, then we say X has a discrete Uniform distribution, U31, c, n4. Unfortunately, some business decision makers take only one step away from averages and assume that all their unknown outcomes are equally likely. That can put them with the lottery ticket purchaser who thought her chances were 50/50: “either I win or I don’t.” Let’s look at some more realistic (and more useful) probability models.
The Geometric Model What’s the probability that when Google tests Chrome on new websites, the first website that fails to display is the second one that they test? They can use Bernoulli trials to build a probability model. Let X denote the number of trials (websites) until the first such “success.” For X to be 2, the first website must have displayed correctly (which has probability 1 - p), and then the second one must have not displayed correctly—a success, with probability p.4 Since the trials are independent, these probabilities can be multiplied, and so P1X = 22 = 11 - p21p2 or qp. Maybe Google won’t find a success until the fifth trial. What are the chances of that? Chrome would have to display the first four websites correctly and then choke on the fifth one, so P1X = 52 = 11 - p2 4 1p2 = q4p. Whenever the question is how long (how many trials) it will take to achieve the first success, the model that gives this probability is the geometric probability model. Geometric models are completely specified by one parameter, p, the probability of success. We denote them Geom( p). Geometric Probability Model for Bernoulli Trials: Geom(p) p = probability of success (and q = 1 - p = probability of failure) X = number of trials until the first success occurs P1X = x2 = qx - 1p Expected value: m =
1 p
Standard deviation: s =
q A p2
4 This is an example of applying the term “success” to something we care about—a failure of the browser. Don’t be confused.
M06_SHAR8696_03_SE_C06.indd 220
14/07/14 7:26 AM
www.freebookslides.com
Discrete Probability Models
221
The geometric distribution can tell Google something important about its software. No large complex program is entirely free of bugs. So before releasing a program or upgrade, developers typically ask not whether it is free of bugs, but how long it is likely to be until the next bug is discovered. If the expected number of trials until the next bug discovery is high enough, then it makes business sense to ship the product rather than wait for that next bug report.
The Binomial Model Suppose Google tests 5 websites. What’s the probability that exactly 2 of them have problems (2 “successes”)? The geometric model tells how long it should take until the first success. Now we want to find the probability of getting exactly 2 successes among the 5 trials. We are still talking about Bernoulli trials, but we’re asking a different question. This time we’re interested in the number of successes in the 5 trials, which we’ll denote by X. We want to find P1X = 22. Whenever the random variable of interest is the number of successes in a series of Bernoulli trials, it’s called a Binomial random variable. It takes two parameters to define this Binomial probability model: the number of trials, n, and the probability of success, p. We denote this model Binom1n, p2. Suppose that in an early phase of development, 10% of the sites exhibited some sort of problem so that p = 0.10. Exactly 2 successes in 5 trials means 2 successes and 3 failures. It seems logical that the probability should be 1p2 2 11 - p2 3. Unfortunately, it’s not quite that easy. That calculation would give you the probability of finding two successes and then three failures—in that order. But you could find the two successes in a lot of other ways, for example in the 2nd and 4th website you test. The probability of that sequence is 11 - p2p11 - p2p11 - p2 which is also p2 11 - p2 3. In fact, as long as there are two successes and three failures, the probability will always be the same, regardless of the order of the sequence of successes and failures. The probability will be 1p2 2 11 - p2 3. To find the probability of getting 2 successes in 5 trials in any order, we just need to know how many ways that outcome can occur. Fortunately, all the sequences that lead to the same number of successes are disjoint. (For example, if your successes came on the first two trials, they couldn’t come on the last two.) So once we find all the different sequences, we can add up their probabilities. And since the probabilities are all the same, we just need to find how many sequences there are and multiply 1p2 2 11 - p2 3 by that number. Each different order in which we can have k successes in n trials is called a n “combination.” The total number of ways this can happen is written a b or nCk k and pronounced “n choose k:” n n! a b = nCk = where n! = n * 1n - 12 * g * 1. k k!1n - k2!
For 2 successes in 5 trials,
15 * 4 * 3 * 2 * 12 15 * 42 5 5! a b = = = = 10. 2 2!15 - 22! 12 * 1 * 3 * 2 * 12 12 * 12
So there are 10 ways to get 2 successes in 5 websites, and the probability of each is 1p2 2 11 - p2 3. To find the probability of exactly 2 successes in 5 trials, we multiply the probability of any particular order by the number of possible different orders: P1exactly 2 successes in 5 trials2 = 10p2 11 - p2 3 = 1010.102 2 10.902 3 = 0.0729
In general, we can write the probability of exactly k successes in n trials as n P1X = k2 = a b pkqn - k. k If the probability that any single website has a display problem is 0.10, what’s the expected number of websites with problems if we test 100 sites? You probably said 10. We suspect you didn’t use the formula for expected value that involves multiplying each value times its probability and adding them up. In fact, there is an easier way to
M06_SHAR8696_03_SE_C06.indd 221
14/07/14 7:26 AM
www.freebookslides.com 222
CHAPTER 6 Random Variables and Probability Models
find the expected value for a Binomial random variable. You just multiply the probability of success by n. In other words, E1X2 = np. We prove this in the next Math Box. The standard deviation is less obvious and you can’t just rely on your intuition. Fortunately, the formula for the standard deviation also boils down to something simple: SD1X2 = 1npq. If you’re curious to know where that comes from, it’s in the Math Box, too. In our website example, with n = 100, E1X2 = np = 10010.102 = 10 so we expect to find 10 successes out of the 100 trials. The standard deviation is 2100 * 0.10 * 0.90 = 3 websites. Binomial Model for Bernoulli Trials: Binom An, pB
n = number of trials p = probability of success (and q = 1 - p = probability of failure) X = number of successes in n trials n n n! P1X = x2 = a b px qn - x, where a b = x x x!1n - x2!
Mean: m = np Standard deviation: s = 1npq
Math Box M ean and Standard Deviation of the Binomial Model To derive the formulas for the mean and standard deviation of the Binomial model we start with the most basic situation. Consider a single Bernoulli trial with probability of success p. Let’s find the mean and variance of the number of successes. x P 1 X = x2
Here’s the probability model for the number of successes: Find the expected value: Now the variance:
0 q
1 p
E1X2 = 0q + 1p E1X2 = p Var1X2 = = = = Var1X2 =
10 - p2 2q + 11 - p2 2p p2q + q2p pq1p + q2 pq112 pq
What happens when there is more than one trial? A Binomial model simply counts the number of successes in a series of n independent Bernoulli trials. That makes it easy to find the mean and standard deviation of a binomial random variable, Y. Let Y = X1 + X2 + X3 + g + Xn E1Y2 = E1X1 + X2 + X3 + g + Xn 2 = E1X1 2 + E1X2 2 + E1X3 2 + g + E1Xn 2 = p + p + p + g + p 1There are n terms.2
So, as we thought, the mean is E1Y2 = np.
M06_SHAR8696_03_SE_C06.indd 222
14/07/14 7:26 AM
www.freebookslides.com
223
Discrete Probability Models
And since the trials are independent, the variances add: Var1Y2 = = = Var1Y2 =
Var1X1 + X2 + X3 + g + Xn 2 Var1X1 2 + Var1X2 2 + Var1X3 2 + g + Var1Xn 2 pq + pq + pq + g + pq 1Again, n terms.2 npq
Voila! The standard deviation is SD1Y2 = 1npq.
Guided Example
The American Red Cross Every two seconds someone in America needs blood. The American Red Cross is a nonprofit organization that runs like a large business. It serves over 3000 hospitals around the United States, providing a wide range of high-quality blood products and blood donor and patient testing services. It collects blood from over 4 million donors, provides blood to millions of patients, and is dedicated to meeting customer needs. Balancing supply and demand is complicated not only by the logistics of finding donors that meet health criteria, but also by the fact that the blood type of donor and patient must be matched. People with O-negative blood are called “universal donors” because O-negative blood can be given to patients with any blood type. Only about 6% of people have O-negative blood, which presents a challenge in managing and planning. This is especially true, since, unlike a manufacturer who can balance supply by planning to produce or to purchase more or less of a key item, the Red Cross gets its supply from volunteer donors who show up more-or-less at random (at least in terms of blood type). Modeling the arrival of samples with various blood types helps the Red Cross managers to plan their blood allocations. Here’s a small example of the kind of planning required. Of the next 20donors to arrive at a blood donation center, how many universal donors can be expected? Specifically, what are the mean and standard deviation of the number of universal donors? What is the probability that there are 2 or 3 universal donors?
Question 1: What are the mean and standard deviation of the number of universal donors? Question 2: What is the probability that there are exactly 2 or 3 universal donors out of the 20 donors?
Plan
Setup State the question. Check to see that these are Bernoulli trials.
We want to know the mean and standard deviation of the number of universal donors among 20 people and the probability that there are 2 or 3 of them. ✓ There are two outcomes: success = O-negative failure = other blood types ✓ p = 0.06 ✓ 10% Condition: Fewer than 10% of all possible donors have shown up.
Variable Define the random variable.
Let X = number of O-negative donors among n = 20 people.
Model Specify the model.
We can model X with a Binom(20, 0.06). (continued )
M06_SHAR8696_03_SE_C06.indd 223
14/07/14 7:26 AM
www.freebookslides.com 224
CHAPTER 6 Random Variables and Probability Models
Do
Mechanics Find the expected value and standard deviation. Calculate the probability of 2 or 3 successes.
Report
Conclusion Interpret your results in context.
E1X2 = np = 2010.062 = 1.2 SD1X2 = 2npq = 22010.06210.942 ≈ 1.06 P1X = 2 or 32 = P1X = 22 + P1X = 32 20 = a b 10.062 2 10.942 18 2 20 + a b 10.062 3 10.942 17 3 ≈ 0.2246 + 0.0860 = 0.3106
Memo Re: Blood Drive In groups of 20 randomly selected blood donors, we’d expect to find an average of 1.2 universal donors, with a standard deviation of 1.06. About 31% of the time, we’d expect to find exactly 2 or 3 universal donors among the 20 people.
The Poisson Model Simeon Denis Poisson was
a French mathematician interested in rare events. He originally derived his model to approximate the Binomial model when the probability of a success, p, is very small and the number of trials, n, is very large. Poisson’s contribution was providing a simple approximation to find that probability. When you see the formula, however, you won’t necessarily see the connection to the Binomial.
Not all discrete events can be modeled as Bernoulli trials. Sometimes we’re interested simply in the number of events that occur over a given interval of time or space. For example, we might want to model the number of customers arriving in our store in the next ten minutes, the number of visitors to our website in the next minute, or the number of defects that occur in a computer monitor of a certain size. In cases like these, the number of occurrences can be modeled by a Poisson model. The Poisson’s parameter, the mean of the distribution, is usually denoted by l. Poisson Probability Model for Occurrences: Poisson 1 L2 l = mean number of occurrences X = number of occurrences
e -llx x! E1X2 = l
P1X = x2 =
Expected value:
Standard deviation:
SD1X2 = 1l
For example, data show an average of about 4 hits per minute to a small business website during the afternoon hours from 1:00 to 5:00 p.m. We can use the Poisson model to find the probability of any number of hits arriving. For example, if we let X e -llx e -44x be the number of hits arriving in the next minute, then P1X = x2 = = , x! x! using the given average rate of 4 per minute. So, the probability of no hits during e -440 the next minute would be P1X = 02 = = e -4 = 0.0183 (The constant e is 0! the base of the natural logarithms and is approximately 2.71828). One interesting and useful feature of the Poisson model is that it scales according to the interval size. For example, suppose we want to know the probability of no hits to our website in the next 30 seconds. Since the mean rate is 4 hits per
M06_SHAR8696_03_SE_C06.indd 224
14/07/14 7:26 AM
www.freebookslides.com
225
Discrete Probability Models
W. S. Gosset, the quality control
chemist at the Guinness brewery in the early 20th century who developed the methods of Chapters 11 and 12, was one of the first to use the Poisson in industry. He used it to model and predict the number of yeast cells so he’d know how much to add to the stock. The Poisson is a good model to consider whenever your data consist of counts of occurrences. It requires only that the events be independent and that the mean number of occurrences stays constant.
minute, it’s 2 hits per 30 seconds, so we can use the model with l = 2 instead. If we let Y be the number of hits arriving in the next 30 seconds, then: P1Y = 02 =
e -220 = e -2 = 0.1353. 0!
(Recall that 0! = 1.) The Poisson model has been used to model phenomena such as customer arrivals, hot streaks in sports, and disease clusters. Whenever or wherever rare events happen closely together, people want to know whether the occurrence happened by chance or whether an underlying change caused the unusual occurrence. The Poisson model can be used to find the probability of the occurrence and can be the basis for making the judgment. e and Compound Interest The constant e equals 2.7182818 Á (to 7 decimal places). One of the places e originally turned up was in calculating how much money you’d earn ifyou could get interest compounded more often. If you earn 100% per year simple interest, at the end of the year, you’d have twice as much money as when you started. But if the interest were compounded and paid at the end of every month, each month you’d earn 1>12 of 100% interest. At the year’s end you’d have 11 + 1>122 12 = 2.613 times as much instead of 2. If the interest were paid every day, you’d get 11 + 1>3652 365 = 2.715 times as much. If the interest were paid every second, you’d get 11 + 1>31536002 3153600 = 2.7182812 times as much. This is where e shows up. If you could get the interest c ompounded continually, you’d get e times as much. In other words, as n gets large, the limit of 11 + 1>n2 n = e. This unexpected result was discovered by Jacob Bernoulli in 1683.
Jus t C h e c k in g Roper Worldwide reports that they are able to contact 76% of the randomly selected households drawn for a telephone survey.
4 Roper also reports that even after they contacted a house-
hold, only 38% of the contacts agreed to be interviewed. Sothe probability of getting a completed interview from 2 Explain why these phone calls can be considered Bernoulli trials. a randomly selected household is only 0.29 (38% of 76%). 3 Which of the models of this chapter (Geometric, Binomial, Which of the models of this chapter would you use to model or Poisson) would you use to model the number of successful the number of households Roper has to call before they get contracts from a list of 1000 sampled households? the first completed interview?
For Example
Probability models
A venture capital firm has a list of potential investors who have previously invested in new technologies. On average, these investors invest about 5% of the time. A new client of the firm is interested in finding investors for a mobile phone application that enables financial transactions, an application that is finding increasing acceptance in much of the developing world. An analyst at the firm starts calling potential investors.
Questions
1. What is the probability that the first person she calls will want to invest? 2. What is the probability that none of the first five people she calls will be interested? 3. How many people will she have to call until the probability of finding someone
interested is at least 0.50?
4. How many investors will she have to call, on average, to find someone interested? 5. If she calls 10 investors, what is the probability that exactly 2 of them will be interested? 6. What assumptions are you making to answer these questions?
(continued )
M06_SHAR8696_03_SE_C06.indd 225
14/07/14 7:26 AM
www.freebookslides.com 226
CHAPTER 6 Random Variables and Probability Models
Answers
1. Each investor has a 5% or 1>20 chance of wanting to invest, so the chance that the
first person she calls is interested is 1>20.
2. P (first one not interested) = 1 - 1>20 = 19>20. Assuming the trials are
independent, P(none are interested) = P (1st not interested) * P (2nd not interested) * g * P (5th not interested) = 119>202 5 = 0.774.
3. By trial and error, 119>202 13 = 0.513 and 119>202 14 = 0.488, so she would need
to call 14 people to have the probability of no one interested drop below 0.50, therefore making the probability that someone is interested greater than 0.50.
4. This uses a geometric model. Let X = number of people she calls until the first
interested person. E1X2 = 1>p = 1> 11>202 = 20 people.
5. Using the Binomial model, let Y = number of people interested in 10 calls, then
P1Y = 22 = a
10 2 10 * 9 bp 11 - p2 8 = 11>202 2 119>202 8 = 0.0746 . 2 2
6. We are assuming that the trials are independent and that the probability of being
interested in investing is the same for all potential investors.
What Can Go Wrong? • Probability models are still just models. Models can be useful, but they
are not reality. Think about the assumptions behind your models. Question probabilities as you would data. • If the model is wrong, so is everything else. Before you try to find the mean or standard deviation of a random variable, check to make sure the probability model is reasonable. As a start, the probabilities should all be between 0 and 1 and they should add up to 1. If not, you may have calculated a probability incorrectly or left out a value of the random variable. • Watch out for variables that aren’t independent. You can add expected values of any two random variables, but you can only add variances of independent random variables. Suppose a survey includes questions about the number of hours of sleep people get each night and also the number of hours they are awake each day. From their answers, we find the mean and standard deviation of hours asleep and hours awake. The expected total must be 24 hours; after all, people are either asleep or awake. The means still add just fine. Since all the totals are exactly 24 hours, however, the standard deviation of the total will be 0. We can’t add variances here because the number of hours you’re awake depends on the number of hours you’re asleep. Be sure to check for independence before adding variances. • Don’t write independent instances of a random variable with notation that looks like they are the same variables. Make sure you write each
instance as a different random variable. Just because each random variable describes a similar situation doesn’t mean that each random outcome will be the same. These are random variables, not the variables you saw in Algebra. Write X1 + X2 + X3 rather than X + X + X. • Don’t forget: Variances of independent random variables add. Standard deviations don’t. • Don’t forget: Variances of independent random variables add, even when you’re looking at the difference between them. • Be sure you have Bernoulli trials. Be sure to check the requirements first: two possible outcomes per trial (“success” and “failure”), a constant probability of success, and independence. Remember that the 10% Condition provides a reasonable substitute for independence.
M06_SHAR8696_03_SE_C06.indd 226
14/07/14 7:26 AM
www.freebookslides.com
What Have We Learned?
227
Ethics in Action
K
urt Williams was about to open a new SEP IRA account and was interested in exploring various investment options. Although he had some ideas about how to invest his money, Kurt thought it best to seek the advice of a professional, so he made an appointment with Keith Klingman, a financial advisor at James, Morgan, and Edwards, LLC. Prior to their first meeting, Kurt told Keith that he preferred to keep his investments simple and wished to allocate his money to only two funds. Also, he mentioned that while he was willing to take on some risk to yield higher returns, he was concerned about taking on too much risk given the recent volatility in the markets. After their conversation, Keith began to prepare for their first meeting. Because Kurt was interested in investing his SEP IRA money in only two funds, Keith decided to compile figures on the expected annual return and standard deviation (a measure of risk) for a potential SEP IRA account consisting of different combinations of two funds. If X and Y represent the annual returns for two different funds, Keith knew he could represent the expected annual return for any combination of funds as aX + (1 - a) Y, where a is the fraction of funds Kurt will allocate to X. Keith calculated the expected annual return using the formula E1aX + 11 - a2Y2 = aE1X2 + 11 - a2E1Y2. Keith
knew that this formula would be true for all funds X and Y even if their per formances were correlated. To find the variance if the combined investment he calculated Var1aX + 11 - a2Y2 = a2 Var1X2 + 11 - a2 2 Var1Y2. Keith knew that the variance calculation assumed that the two funds were independent, but he figured that the formula was close enough even if the funds performances were correlated, and he wanted to keep the presentation to Kurt simple. Keith presented a variety of combinations of funds and allocations to Kurt. Because some equity funds delivered the best expected return, Keith advised Kurt to put all his money in two equity funds (funds that also generated higher brokerage fees) rather than allocating any money to a simple fixed income fund. Kurt was surprised to see that even under various market conditions, all the equity fund combinations seemed fairly safe in terms of volatility as evidenced by the fairly low standard deviations of the combined funds, and Keith assured him that these scenarios were realistic. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • P ropose an ethical solution that considers the welfare of all stakeholders.
What Have We Learned? Learning Objectives
Understand how probability models relate values to probabilities.
• For discrete random variables, probability models assign a probability to each possible outcome. Know how to find the mean, or expected value, of a discrete probability model from M = gxP 1X = x2 and the standard deviation from S = 2g 1x − M2 2P 1x2. Foresee the consequences of shifting and scaling random variables, specifically E(X { c) = E(X ) { c E(aX ) = aE(X)
Var(X { c) = Var(X ) Var(aX ) = a2Var(X)
SD(X { c) = SD(X ) SD(aX ) = ƒ a ƒ SD(X)
Understand that when adding or subtracting random variables the expected values add or subtract as well: E 1X t Y2 = E 1 X2 t E 1 Y2 . However, when adding or subtracting independent random variables, the variances add: Var(X { Y) = Var(X ) + Var(Y )
Be able to explain the properties and parameters of the Uniform, the Binomial, the Geometric, and the Poisson distributions.
M06_SHAR8696_03_SE_C06.indd 227
14/07/14 7:26 AM
www.freebookslides.com 228
CHAPTER 6 Random Variables and Probability Models
Terms E1X { Y2 = E1X2 { E1Y2
Addition Rule for Expected Values of Random Variables Addition Rule for Variances of (Independent) Random Variables Bernoulli trials
(Pythagorean Theorem of Statistics) If X and Y are independent: Var1X { Y2 = Var1X2 + Var1Y2, and SD1X { Y2 = 2Var1X2 + Var1Y2.
A sequence of n trials are called Bernoulli trials if:
1. There are exactly two possible outcomes (usually denoted success and failure). 2. The probability of success is constant. 3. The trials are independent. Binomial probability model
A Binomial model is appropriate for a random variable that counts the number of successes in a series of Bernoulli trials.
Changing a random variable by a constant Continuous random variable Covariance
Discrete random variable Expected value
E1X { c2 = E1X2 { c
Var1X { c2 = Var1X2
SD1X { c2 = SD1X2
E1aX2 = aE1X2
Var1aX2 = a2Var1X2
SD1aX2 = a SD1X2
A random variable that can take on any value (possibly bounded on one or both sides). The covariance of random variables X and Y is Cov1X, Y2 = E11X - m21Y - n22 where m = E1X2 and n = E1Y2. In general (no need to assume independence) Var1X { Y2 = Var1X2 + Var1Y2 { 2Cov1X, Y2 A random variable that can take one of a finite number5 of distinct outcomes. The expected value of a random variable is its theoretical long-run average value, the center of its model. Denoted m or E1X2, it is found (if the random variable is discrete) by summing the products of variable values and probabilities: m = E1X2 = gxP1x2
Geometric probability model
A model appropriate for a random variable that counts the number of Bernoulli trials until the first success.
Parameter
A numerically valued attribute of a model, such as the values of m and s representing the mean and standard deviation.
Poisson model Probability model Random variable Standard deviation of a random variable
A discrete model often used to model the number of arrivals of events such as customers arriving in a queue or calls arriving into a call center. A function that associates a probability P with each value of a discrete random variable X, denoted P1X = x2, or with any interval of values of a continuous random variable. Assumes any of several different values as a result of some random event. Random variables are denoted by a capital letter, such as X. Describes the spread in the model and is the square root of the variance.
Uniform model, Uniform distribution
For a discrete uniform distribution over a set of n values, each value has probability 1>n.
Variance of a random variable
The variance of a random variable is the expected value of the squared deviations from the mean. For discrete random variables, it can be calculated as: s2 = Var1X2 = g 1x - m2 2P1x2.
5
Technically, there could be an infinite number of outcomes as long as they’re countable. Essentially, that means we can imagine listing them all in order, like the counting numbers 1, 2, 3, 4, 5, . . .
M06_SHAR8696_03_SE_C06.indd 228
14/07/14 7:26 AM
www.freebookslides.com
Technology Help
229
Technology help: Random Variables and Probability Models Most statistics packages (and graphics calculators) offer functions that compute probabilities for various probability models. The important differences among these functions are in what they are named and the order of their arguments. In these functions, “pdf” stands for “probability density function”—what we’ve been calling a probability model. The letters “cdf” stand for “cumulative distribution function,” the technical term when we want to accumulate probabilities over a range of values. These technical terms show up in many of the function names. Many packages allow the computation of a probability given a value based on a given distribution and also the calculation of a value based on the probability.
For example, in Excel, Binomdist(x, n, prob, cumulative) computes Binomial probabilities. If cumulative is set to false, the calculation is only forone value of x.
Excel The following commands can be used to calculate Binomial and Poisson distribution probabilities. In Excel, the value for “cumulative” will give either a cdf 1cumulative = TRUE2 or a pdf 1cumulative = FALSE2. The commands can either be typed directly into a cell or into the function bar at the top of the screen.
JMP
SPSS
• Create a new data table: File + New + New Data Table.
• In Data View, type values of the parameters for the desired distribution in the first row. For example, for a binomial PDF, type the number of successes in the first column, first row; trials in second column, first row; and probability of success in the third column, first row.
• Right click on the header Column 1 and select Formula. • Select Discrete Probability. • Select any of: • Binomial Probability (prob, n, k) for the pdf • Binomial Distribution (prob, n, k) for the cdf • Poisson Probability (l) for the pdf • Poisson Distribution (l) for the cdf • Click OK twice. The probability will be displayed in the first row of Column 1.
Minitab • Choose Probability Distributions from the Calc menu. • Choose Binomial from the Probability Distributions submenu. • To calculate the probability of getting x successes in n trials, choose Probability. • To calculate the probability of getting x or fewer successes among n trials, choose Cumulative Probability.
• Choose Transform: Compute Variable. • Type a name for the variable that will contain the result. • In the box under “Numeric Expression”, type the desired probability to be calculated, using the labels for the variables where the parameter values are stored: • PDF.BINOM(x, n, prob) • CDF.BINOM(x, n, prob) • PDF.Poisson(x, mean) • CDF.Poisson(x, mean) • Click OK and the probability will be calculated and stored in the next column. Adjust the column width to show more decimals—the value will be rounded to 2 decimal places by default.
• For Poisson, choose Poisson from the Probability Distribution submenu.
M06_SHAR8696_03_SE_C06.indd 229
14/07/14 7:26 AM
www.freebookslides.com 230
CHAPTER 6 Random Variables and Probability Models
Brief Case
Investment Options A young entrepreneur has just raised $30,000 from investors, and she would like to invest it while she continues her fund-raising in hopes of starting her company one year from now. She wants to do due diligence and understand the risk of each of her investment options. After speaking with her colleagues in finance, she believes that she has three choices: (1) she can purchase a $30,000 certificate of deposit (CD); (2) she can invest in a mutual fund with a balanced portfolio; or (3) she can invest in a growth stock that has a greater potential payback but also has greater volatility. Each of her options will yield a different payback on her $30,000, depending on the state of the economy. During the next year, she knows that the CD yields a constant annual percentage rate, regardless of the state of the economy. If she invests in a balanced mutual fund, she estimates that she will earn as much as 12% if the economy remains strong, but could possibly lose as much as 4% if the economy takes a downturn. Finally, if she invests all $30,000 in a growth stock, experienced investors tell her that she can earn as much as 40% in a strong economy, but may lose as much as 40% in a poor economy. Estimating these returns, along with the likelihood of a strong economy, is challenging. Therefore, a “sensitivity analysis” is often conducted, where figures are computed using a range of values for each of the uncertain parameters in the problem. Following this advice, this investor decides to compute measures for a range of interest rates for CDs, a range of returns for the mutual fund, and a range of returns for the growth stock. In addition, the likelihood of a strong economy is unknown, so she will vary these probabilities as well. Assume that the probability of a strong economy over the next year is 0.3, 0.5, or 0.7. To help this investor make an informed decision, evaluate the expected value and volatility of each of her investments using the following ranges of rates of growth: CD: Look up the current annual rate for the return on a 3-year CD and use this value {0.5%. Mutual Fund: Use values of 8%, 10%, and 12% for a strong economy and values of 0%, -2%, and -4% for a weak economy. Growth Stock: Use values of 10%, 25%, and 40% in a strong economy and values of -10%, -25%, and -40% in a weak economy. Discuss the expected returns and uncertainty of each of the alternative investment options for this investor in each of the scenarios you analyzed. Be sure to compare the volatility of each of her options.
Exercises Section 6.1 1. A company’s employee database includes data on whether or not the employee includes a dependent child in his or her health insurance. a) Is this variable discrete or continuous? b) What are the possible values it can take on?
M06_SHAR8696_03_SE_C06.indd 230
2. The database also, of course, includes each employee’s compensation. a) Is this variable discrete or continuous? b) What are the possible values it can take on? 3. Suppose that the probabilities of a customer purchasing 0, 1, or 2 books at a book store are 0.5, 0.3, and 0.2, respectively. What is the expected number of books a customer will purchase?
14/07/14 7:26 AM
www.freebookslides.com
Exercises 231
4. A day trader buys an option on a stock that will return $100 profit if the stock goes up today and lose $400 if it goes down. If the trader thinks there is a 75% chance that the stock will go up, a) What is her expected value of the option’s profit? b) What do you think of this option?
E1X2 = +100, E1Y2 = +90, SD1X2 = +12, and SD1Y2 = +8. Find each of the following. a) E1X + 102 and SD1X + 102 b) E15Y2 and SD15Y2 c) E1X + Y2 and SD1X + Y2 d) What assumption must you make in part c?
Section 6.2
12. A shipping company delivers certain types of computer sets to local and international clients. From previous data, it is noted that 2% of its deliveries will break during shipment. If the company ships two computer sets separately to undisclosed address, find the probability that both computers will not arrive safely. Comment on any assumptions you have made.
5. Find the standard deviation of the book purchases in Exercise 3. 6. Find the standard deviation of the day trader’s option value in Exercise 4. 7. An orthodontist has three financing packages, and each has a different service charge. He estimates that 30% of patients use the first plan, which has a $10 finance charge; 50% use the second plan, which has a $20 finance charge; and 20% use the third plan, which has a $30 finance charge. a) Find the expected value of the service charge. b) Find the standard deviation of the service charge. 8. A motor home sales department has created three plans for purchasing a new or used motor home for leisure to increase potential sales of its fleet. They estimate that 20% will choose plan 1, which includes no down payment with 10 years finance option; 40% will choose plan 2, which includes a 20% down payment with a 7-year finance option; and 40% will choose plan 3, which includes 40% down payment and a 5-year finance option. a) Find the expected value of the type of down payment potential customers will need. b) Find the standard deviation of the type of down payment potential customers will need.
Section 6.3 9. Given independent random variables, X and Y, with means and standard deviations as shown, find the mean and standard deviation of each of the variables in parts a to d. a) 3X Mean SD b) Y + 6 X 10 2 c) X + Y Y 20 5 d) X - Y 10. Given independent random variables, X and Y, with means and standard deviations as shown, find the mean and standard deviation of each of the variables in parts a to d. a) X - 20 Mean SD b) 0.5Y X 80 12 c) X + Y Y 12 3 d) X - Y 11. A broker has calculated the expected values of two different financial instruments X and Y. Suppose that
M06_SHAR8696_03_SE_C06.indd 231
Section 6.4 13. Which of these situations fit the conditions for using Bernoulli trials? Explain. a) You are rolling 5 dice and need to get at least two 6s to win the game. b) We record the distribution of home states of customers visiting our website. c) A committee consisting of 11 men and 8 women selects a delegation of 4 to attend a professional meeting at random. What is the probability they choose all women? d) A study (softwaresecure.typepad.com/multiple_ choice/2007/05/cheat_cheat_nev.html) found that 56% of M.B.A. students admit to cheating. A business school dean surveys all the students in the graduating class and gets responses that admit to cheating from 250 of 481 students. 14. At a border entry point between two different states, a computerized system is used to randomly check whether a car should be stopped for inspection. If the chance of any vehicle to be stopped for inspection is 15%, can we use a Bernoulli model to model if your car will be stopped for inspection? Check all of the conditions.
Section 6.5 15. At many airports, a traveler entering the U.S. is sent randomly to one of several stations where his passport and visa are checked. If each of the 6 stations is equally likely, can the probabilities of which station a traveler will be sent be modeled with a Uniform model? 16. Through the career services office, you have arranged preliminary interviews at four companies for summer jobs. Each company will either ask you to come to their site for a follow-up interview or not. Let X be the random variable equal to the total number of follow-up interviews that you might have. a) List all the possible values of X. b) Is the random variable discrete or continuous? c) Do you think a uniform distribution might be appropriate as a model for this random variable? Explain briefly.
14/07/14 7:26 AM
www.freebookslides.com 232
CHAPTER 6 Random Variables and Probability Models
17. The U.S. Census Bureau’s 2007 Survey of Business Owners showed that 28.7% of all non-farm businesses are owned by women. You are phoning local businesses and assume that the national percentage is true in your area. You wonder how many calls you will have to make before you find one owned by a woman. What probability model should you use? (Specify the parameters as well.) 18. As in Exercise 17, you are phoning local businesses. You call three firms. What is the probability that all three are owned by women? 19. A manufacturer of clothing knows that the probability of a button flaw (broken, sewed on incorrectly, or missing) is 0.002. An inspector examines 50 shirts in an hour, each with 6 buttons. Using a Poisson probability model: a) What is the probability that she finds no button flaws? b) What is the probability that she finds at least one? 20. Replacing the buttons with snaps increases the probability of a flaw to 0.003, but the inspector can check 70 shirts an hour (still with 6 snaps each). Now what is the probability she finds no snap flaws?
Chapter Exercises 21. New website. You have just launched the website for your company that sells nutritional products online. Suppose X = the number of different pages that a customer hits during a visit to the website. a) Assuming that there are n different pages in total on your website, what are the possible values that this random variable may take on? b) Is the random variable discrete or continuous? 22. New website, part 2. For the website described in Exercise 21, let Y = the total time (in minutes) that a customer spends during a visit to the website. a) What are the possible values of this random variable? b) Is the random variable discrete or continuous? 23. Repairs. The probability model below describes the number of repair calls that an appliance repair shop may receive during an hour. Repair Calls
1
2
3
Probability
0.1
0.3
0.4
0.2
a) How many calls should the shop expect per hour? b) What is the standard deviation? 24. Software company. A small software company will bid on a major contract. It anticipates a profit of $50,000 if it gets it, but thinks there is only a 30% chance of that happening. a) What’s the expected profit? b) Find the standard deviation for the profit.
M06_SHAR8696_03_SE_C06.indd 232
25. Commuting to work. A commuter must pass through five traffic lights on her way to work and will have to stop at each one that is red. After keeping a record for several months, she developed the following probability model for the number of red lights she hits: X = # of Red p1 X = x 2
1
2
3
4
5
0.05
0.25
0.35
0.15
0.15
0.05
a) How many red lights should she expect to hit each day? b) What’s the standard deviation? 26. Defects. A consumer organization inspecting new cars found that many had appearance defects (dents, scratches, paint chips, etc.). While none had more than three of these defects, 7% had three, 11% had two, and 21% had one defect. a) Find the expected number of appearance defects in a new car. b) What is the standard deviation? 27. Cricket tournament. A cricket goods shop was asked to sponsor the local team in two tournaments. They claim the probability that the team will win the first tournament is 0.4. If the team wins the first tournament, they estimate the probability of also winning the second is 0.6. They guess that if the team loses the first tournament, the probability that it will win the second is 0.3. a) According to their estimates, are the two tournaments independent? Explain your answer. b) What’s the probability that they lose both tournaments? c) What’s the probability they win both tournaments? d) Let random variable X be the number of tournaments they win. Find the probability model for X. e) What are the expected value and standard deviation ofX? 28. Contracts. Your company bids for two contracts. You believe the probability that you get contract #1 is 0.8. If you get contract #1, the probability that you also get contract #2 will be 0.2, and if you do not get contract #1, the probability that you get contract #2 will be 0.3. a) Are the outcomes of the two contract bids independent? Explain. b) Find the probability you get both contracts. c) Find the probability you get neither contract. d) Let X be the number of contracts you get. Find the probability model for X. e) Find the expected value and standard deviation of X. 29. Battery recall. A company has discovered that a recent batch of batteries had manufacturing flaws, and has issued a recall. You have 10 batteries covered by the recall, and 3 are dead. You choose 2 batteries at random from your package of 10. a) Has the assumption of independence been met? Explain. b) Create a probability model for the number of good batteries chosen.
14/07/14 7:26 AM
www.freebookslides.com
Exercises 233
c) What’s the expected number of good batteries? d) What’s the standard deviation?
c) If gamblers play this machine 1000 times in a day, what are the mean and standard deviation of the casino’s profit?
30. Dormitory accommodation. The housing manager of dormitory accommodation for exchange students believes that the mean number of dormitories in need for some form of repair after each semester is 0.6 per block of 20 dormitories on the campus, with a standard deviation of 0.5. He is responsible for 3 such blocks of 20 dormitories each. a) How many repairs after the semester does he expect to get? b) What’s the standard deviation? c) Is it necessary to assume the blocks are independent? Why?
35. Skate board sale. A sports shop plans to offer 2 specially priced skate board models at a sidewalk sale. The basic model will return a profit of $120 and the sports model $150. Past experience indicates that sales of the basic model will have a mean of 5.4 skate boards with a standard deviation of 1.2, and sales of the sports model will have a mean of 3.2 skate boards with a standard deviation of 0.8. The cost of setting up for the sidewalk sale is $200. a) Define random variables and use them to express the shop’s net profit. b) What’s the mean of the net profit? c) What’s the standard deviation of the net profit? d) Do you need to make any assumptions in calculating the mean? How about the standard deviation? 36. Farmers’ market. A farmer has 100 lb of apples and 50lb of potatoes for sale. The market price for apples (per pound) each day is a random variable with a mean of 0.5 dollars and a standard deviation of 0.2 dollars. Similarly, for a pound of potatoes, the mean price is 0.3 dollars and the standard deviation is 0.1 dollars. It also costs him 2 dollars to bring all the apples and potatoes to the market. The market is busy with eager shoppers, so we can assume that he’ll be able to sell all of each type of produce at that day’s price. a) Define your random variables, and use them to express the farmer’s net income. b) Find the mean of the net income. c) Find the standard deviation of the net income. d) Do you need to make any assumptions in calculating the mean? How about the standard deviation?
31. Commuting, part 2. A commuter finds that she waits an average of 14.8 seconds at each of five stoplights, with a standard deviation of 9.2 seconds. Find the mean and the standard deviation of the total amount of time she waits at all five lights. What, if anything, did you assume? 32. Defective pixels. For warranty purposes, analysts want to model the number of defects on a screen of the new tablet they are manufacturing. Let X = the number of defective pixels per screen. If X can be modeled by: X = # of Defective Pixels P1 X = x 2
1
2
3
4 or more
0.95
0.04
0.008
0.002
a) What is the expected number of defective pixels per screen? b) What is the standard deviation of the number of defective pixels per screen? c) What is the expected number of defective pixels in the next 100 screens? d) What is the standard deviation of the number of defective pixels in the next 100 screens? 33. Repair calls. Suppose that the appliance shop in Exercise 23 plans an 8-hour day. a) Find the mean and standard deviation of the number of repair calls they should expect in a day. b) What assumption did you make about the repair calls? c) Use the mean and standard deviation to describe what a typical 8-hour day will be like. d) At the end of a day, a worker comments “Boy, I’m tired. Today was sure unusually busy!” How many repair calls would justify such an observation. 34. Casino. At a casino, people play the slot machines in hopes of hitting the jackpot, but most of the time, they lose their money. A certain machine pays out an average of $0.92 (for every dollar played), with a standard deviation of $120. a) Why is the standard deviation so large? b) If a gambler plays 5 times, what are the mean and standard deviation of the casino’s profit?
M06_SHAR8696_03_SE_C06.indd 233
37. Cancelled flights. Mary is deciding whether to book the cheaper flight home college after her final exams, but she’s unsure when her last exam will be. She thinks there is only a 20% chance that the exam will be scheduled after the last day she can get a seat on the cheaper flight. If it is and she has to cancel the flight, she will lose $150. If she can take the cheaper flight, she will save $100. a) If she books the cheaper flight, what can she expect to gain, on average? b) What is the standard deviation? 38. Day trading. An option to buy a stock is priced at $200. If the stock closes above 30 on May 15, the option will be worth $1000. If it closes below 20, the option will be worth nothing, and if it closes between 20 and 30 (inclusively), the option will be worth $200. A trader thinks there is a 50% chance that the stock will close in the 20–30 range, a 20% chance that it will close above 30, and a 30% chance that it will fall below 20 on May 15. a) How much does she expect to gain? b) What is the standard deviation of her gain? c) Should she buy the stock option? Discuss the pros and cons in terms of your answers to (a) and (b).
14/07/14 7:26 AM
www.freebookslides.com 234
CHAPTER 6 Random Variables and Probability Models
39. eBay. An Australian collector purchased a quantity of cricket cards and is going to sell them on eBay. He has 19 Ashes Cricket Card sets. In recent auctions, the mean selling price of such cards has been $12.11, with a standard deviation of $1.38. He also has 13 Australian Champion Cricketers signed index cards which have had a mean selling price of $10.19, with a standard deviation of $0.77. His insertion fee will be $0.55 on each item, and the closing fee will be 8.75% of the selling price. He assumes all will sell without having to be relisted. a) Define your random variables, and use them to create a random variable for the collector’s net revenue. b) Find the mean (expected value) of the net revenue. c) Find the standard deviation of the net revenue. d) Do you have to assume independence for the sales on eBay? Explain. 40. Real estate. A real-estate broker in the university town of Maastricht, in the Netherlands, purchased 3 two-bedroom houses in a depressed market for a combined cost of $500,000. He expects the cleaning and repair costs on each house to average $50,000 with a standard deviation of $5,000. When he sells them, after subtracting taxes and other closing costs, he expects to realize an average of $225,000 per house, with a standard deviation of $6,000. a) Define your random variables, and use them to create a random variable for the broker’s net profit. b) Find the mean (expected value) of the net profit. c) Find the standard deviation of the net profit. d) Do you have to assume independence for the repairs and sale prices of the houses? Explain. 41. Bernoulli. Can we use probability models based on Bernoulli trials to investigate the following situations? Explain. a) Each week a doctor rolls a single die to determine which of his six office staff members gets the preferred parking space. b) A medical research lab has samples of blood collected from 120 different individuals. How likely is it that the majority of them are Type A blood, given that Type A is found in 43% of the population? c) From a workforce of 13 men and 23 women, all five promotions go to men. How likely is that, if promotions are based on qualifications rather than gender? d) We poll 500 of the 3000 stockholders to see how likely it is that the proposed budget will pass. e) A company realizes that about 10% of its packages are not being sealed properly. In a case of 24 packages, how likely is it that more than 3 are unsealed? 42. Bernoulli, part 2. Can we use probability models based on Bernoulli trials to investigate the following situations? Explain. a) You survey 500 potential customers to determine their color preference. b) A manufacturer recalls a doll because about 3% have buttons that are not properly attached. Customers return
M06_SHAR8696_03_SE_C06.indd 234
37 of these dolls to the local toy store. How likely are they to find any buttons not properly attached? c) A city council of 11 Republicans and 8 Democrats picks a committee of 4 at random. How likely are they to choose all Democrats? d) An executive reads that 74% of employees in his industry are dissatisfied with their jobs. How many dissatisfied employees can he expect to find among the 481 employees in his company? 43. Closing sales. A salesman normally makes a sale (closes) on 80% of his presentations. Assuming the presentations are independent, find the probability of each of the following. a) He fails to close for the first time on his fifth attempt. b) He closes his first presentation on his fourth attempt. c) The first presentation he closes will be on his second attempt. d) The first presentation he closes will be on one of his first three attempts. 44. Computer chip manufacturer. Suppose a computer chip manufacturer rejects 2% of the chips produced because they fail presale testing. Assuming the bad chips are independent, find the probability of each of the following. a) The fifth chip they test is the first bad one they find. b) They find a bad one within the first 10 they examine. c) The first bad chip they find will be the fourth one they test. d) The first bad chip they find will be one of the first three they test. 45. Side effects. Researchers testing a new medication find that 7% of users have side effects. To how many patients would a doctor expect to prescribe the medication before finding the first one who has side effects? 46. Credit cards. College students are a major target for advertisements for credit cards. At a university, 65% of students surveyed said they had opened a new credit card account within the past year. If that percentage is accurate, how many students would you expect to survey before finding one who had not opened a new account in the past year? 47. Missing pixels. A company that manufactures large LCD screens knows that not all pixels on their screen light, even if they spend great care when making them. In a sheet 6 ft by 10 ft (72 in. by 120 in.) that will be cut into smaller screens, they find an average of 4.7 blank pixels. They believe that the occurrences of blank pixels are independent. Their warranty policy states that they will replace any screen sold that shows more than 2 blank pixels. a) What is the mean number of blank pixels per square foot? b) What is the standard deviation of blank pixels per square foot? c) What is the probability that a 2 ft by 3 ft screen will have at least one defect?
14/07/14 7:26 AM
www.freebookslides.com
Exercises 235
d) What is the probability that a 2 ft by 3 ft screen will be replaced because it has too many defects? 48. Bean bags. Cellophane that is going to be formed into bags for items such as dried beans or bird seed is passed over a light sensor to test if the alignment is correct before it passes through the heating units that seal the edges. Small adjustments can be made by the machine automatically. But if the alignment is too bad, the process is stopped and an operator has to manually adjust it. These misalignment stops occur randomly and independently. On one line, the average number of stops is 52 per 8-hour shift. a) What is the mean number of stops per hour? b) What is the standard deviation of stops per hour? 49. Hurricane insurance. An insurance company needs to assess the risks associated with providing hurricane insurance. During the 22 years from 1990 through 2011, F lorida was hit by 27 major hurricanes (level 3 and above). If hurricanes are independent and the mean has not changed, what is the probability of having a year in Florida with each of the following? a) No hits? b) Exactly 1 hit? c) More than 1 hit? 50. Hurricane insurance, part 2. During the 18 years from 1995 through 2012, there were 144 hurricanes in the Atlantic basin. Assume that hurricanes are independent and the mean has not changed. a) What is the mean number of major hurricanes per year? b) What is the standard deviation of the annual frequency of major hurricanes? c) What is the probability of having a year with no major hurricanes? d) What is the probability of going three years in a row without a major hurricane? 51. Lefties. A manufacturer of game controllers is concerned that their controller may be difficult for left-handed users. They set out to find lefties to test. About 13% of the population is left-handed. If they select a sample of five customers at random in their stores, what is the probability of each of these outcomes? a) The first lefty is the fifth person chosen. b) There are some lefties among the 5 people. c) The first lefty is the second or third person. d) There are exactly 3 lefties in the group. e) There are at least 3 lefties in the group. f) There are no more than 3 lefties in the group. 52. Arrows. An Olympic archer is able to hit the bull’s-eye 80% of the time. Assume each shot is independent of the others. If she shoots 6 arrows, what’s the probability of each of the following results? a) Her first bull’s-eye comes on the third arrow. b) She misses the bull’s-eye at least once.
M06_SHAR8696_03_SE_C06.indd 235
c) Her first bull’s-eye comes on the fourth or fifth arrow. d) She gets exactly 4 bull’s-eyes. e) She gets at least 4 bull’s-eyes. f) She gets at most 4 bull’s-eyes. 53. Satisfaction survey. An internet provider wants to contact customers in a particular telephone exchange to see how satisfied they are with the improved download speed the company has provided. All numbers are in the 62 exchange, so there are 10,000 possible numbers from 620000 to 62-9999. If they select the numbers with equal probability: a) What distribution would they use to model the selection? b) What is the probability the number selected will be an even number? c) What is the probability the number selected will end in 000? 54. Manufacturing quality. In an effort to check the quality of their cell phones, a manufacturing manager decides to take a random sample of 10 cell phones from yesterday’s production run, which produced cell phones with serial numbers ranging (according to when they were produced) from 43005000 to 43005999. If each of the 1000 phones is equally likely to be selected: a) What distribution would they use to model the selection? b) What is the probability that a randomly selected cell phone will be one of the last 100 to be produced? c) What is the probability that the first cell phone selected is either from the last 200 to be produced or from the first 50 to be produced? d) What is the probability that the first two cell phones are both from the last 100 to be produced? 55. Web visitors. A website manager has noticed that during the evening hours, about 3 people per minute check out from their shopping cart and make an online purchase. She believes that each purchase is independent of the others and wants to model the number of purchases per minute. a) What model might you suggest to model the number of purchases per minute? b) What is the probability that in any 1 minute at least one purchase is made? c) What is the probability that no one makes a purchase in the next 2 minutes? 56. Quality control. The manufacturer in Exercise 54 has noticed that the number of faulty cell phones in a production run of cell phones is usually small and that the quality of one day’s run seems to have no bearing on the next day. a) What model might you use to model the number of faulty cell phones produced in one day? b) If the mean number of faulty cell phones is 2 per day, what is the probability that no faulty cell phones will be produced tomorrow? c) If the mean number of faulty cell phones is 2 per day, what is the probability that 3 or more faulty cell phones were produced in today’s run?
14/07/14 7:26 AM
www.freebookslides.com 236
CHAPTER 6 Random Variables and Probability Models
57. Lefties, redux. Consider our group of 5 people from Exercise 51. a) How many lefties do you expect? b) With what standard deviation? c) If we keep picking people until we find a lefty, how long do you expect it will take? 58. More arrows. Consider our archer from Exercise 52. a) How many bull’s-eyes do you expect her to get? b) With what standard deviation? c) If she keeps shooting arrows until she hits the bull’s-eye, how long do you expect it will take? 59. Still more lefties. Suppose we choose 12 people instead of the 5 chosen in Exercise 57 a) Find the mean and standard deviation of the number of right-handers in the group. b) What’s the probability that they’re not all righthanded? c) What’s the probability that there are no more than 10 righties? d) What’s the probability that there are exactly 6 of each? e) What’s the probability that the majority is righthanded?
M06_SHAR8696_03_SE_C06.indd 236
60. Still more arrows. Suppose the archer from Exercise 58 shoots 10 arrows. a) Find the mean and standard deviation of the number of bull’s-eyes she may get. b) What’s the probability that she never misses? c) What’s the probability that there are no more than 8 bull’s-eyes? d) What’s the probability that there are exactly 8 bull’s-eyes? e) What’s the probability that she hits the bull’s-eye more often than she misses?
Just C hecking Answers 1 a) 100 + 100 = 200 seconds
b) 2502 + 502 = 70.7 seconds
c) The times for the two customers are independent.
2 There are two outcomes (contact, no contact), the
probability of contact stays constant at 0.76, and random calls should be independent.
3 Binomial 4 Geometric
14/07/14 7:26 AM
7
www.freebookslides.com
The Normal and Other Continuous Distributions
The NYSE The New York Stock Exchange (NYSE) was founded in 1792 by 24 stockbrokers who signed an agreement under a buttonwood tree on Wall Street in New York. The first offices were in a rented room at 40 Wall Street. In the 1830s traders who were not part of the Exchange did business in the street. They were called “curbstone brokers.” It was the curbstone brokers who first made markets in gold and oil stocks and, after the Civil War, in small industrial companies such as the emerging steel, textile, and chemical industries. By 1903 the New York Stock Exchange was established at its current home at 18 Broad Street. The curbstone brokers finally moved indoors in 1921 to a building on Greenwich street in lower Manhattan. In 1953 the curb market changed its name to the American Stock Exchange. In 1993 the American Stock Exchange pioneered the market for derivatives by introducing the first exchange-traded fund, Standard & Poor’s Depositary Receipts (SPDRs). The NYSE Euronext holding company was created in 2007 as a combination of the NYSE Group, Inc., and Euronext N.V. And in 2008, NYSE Euronext merged with the American Stock Exchange. The combined exchange is the world’s largest and most liquid exchange group.
237
M07_SHAR8696_03_SE_C07.indd 237
14/07/14 7:31 AM
www.freebookslides.com 238
CHAPTER 7 The Normal and Other Continuous Distributions
7.1
WHO WHAT WHEN WHY
Months CAPE10 values for the NYSE 1880 through early 2013 Investment guidance
The Standard Deviation as a Ruler Investors have always sought ways to help them decide when to buy and when to sell. Such measures have become increasingly sophisticated. But all rely on identifying when the stock market is in an unusual state—either unusually undervalued (buy!) or unusually overvalued (sell!). One such measure is the Cyclically Adjusted Price/ Earnings Ratio (CAPE10) developed by Yale professor Robert Shiller. The CAPE10 is based on the standard Price/Earnings (P/E) ratio of stocks, but designed to smooth out short-term fluctuations by “cyclically adjusting” them. The CAPE10 has been as low as 4.78, in 1920, and as high as 44.20, in late 1999. The long-term average CAPE10 (since year 1881) is 16.47. Investors who follow the CAPE10 use the metric to signal times to buy and sell. One mutual fund strategy buys only when the CAPE10 is 33% lower than the long-term average and sells (or “goes into cash”) when the CAPE10 is 50% higher than the long-term average. Between January 1, 1971, and October 23, 2009, this strategy would have outperformed such standard measures as the Wiltshire 5000 in both average return and volatility, but it is important to note that the strategy would have been completely in cash from just before the stock market crash of 1987 all the way to March of 2009! 87 Shiller popularized the strategy in his book Irrational Exuberance. Figure 7.1 shows a time series plot of the CAPE10 values for the New York Stock Exchange from 1880 until the beginning of 2013. Generally, the CAPE10 hovers around 15. But occasionally, it can take a large excursion. One such time was in 1999 and 2000, when the CAPE10 exceeded 40. But was this just a random peak or were these values really extraordinary?
Figure 7.1 CAPE10 values for the NYSE from1880 to 2013.
50
CAPE10
40 30 20 10 1880
1900
1920
1940
1960
1980
2000
Date
To answer this question, we can look at the overall distribution of CAPE10 values. Figure 7.2 shows a histogram of the same values. Now we don’t see patterns over time, but we may be able to make a better judgment of whether values are extraordinary. Overall, the main body of the distribution looks unimodal and reasonably symmetric. But then there’s a tail of values that trails off to the high end. How can we assess how extraordinary they are? Investors follow a wide variety of measures that record various aspects of stocks, bonds, and other investments. They are usually particularly interested in identifying times when these measures are extraordinary because those often represent times of increased risk or opportunity. But these are quantitative values, not categories. How can we characterize the behavior of a random variable that can take on any value in a range of values? The distributions of Chapter 6 won’t provide the tools we need, but many of the basic concepts still work. The random variables we need are continuous.
M07_SHAR8696_03_SE_C07.indd 238
14/07/14 7:31 AM
www.freebookslides.com
239
The Standard Deviation as a Ruler
Figure 7.2 The distribution of the CAPE10 values shown in Figure 7.1.
200
# of Months
150
100
50
10
20
30
40
CAPE10
We saw in Chapter 3 that z-scores provide a standard way to compare values. In a sense, we use the standard deviation as a ruler, asking how many standard deviations a value is from the mean. That’s what a z-score reports; the number of standard deviations away from the mean. We can convert the CAPE10 values to z-scores by subtracting their mean (16.47) and dividing by their standard deviation (6.55). Figure 7.3 shows the resulting distribution. Figure 7.3 The CAPE10 values as z-scores.
200
# of Months
150
100
50
–2
–1
1
2
3
4
CAPE10 z-scores
It’s easy to see that the z-scores have the same distribution as the original values, but now we can also see that the largest of them is above 4. How extraordinary is it for a value to be four standard deviations away from the mean? Fortunately, there’s a fact about unimodal, symmetric distributions that can guide us.1
The 68–95–99.7 Rule In a unimodal, symmetric distribution, about 68% of the values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the mean, and about 99.7%—almost all—fall within 3 standard deviations of the mean. Calling this rule the 68–95–99.7 Rule provides a mnemonic for these three values.2 1
All of the CAPE10 values in the right tail occurred after 1993. Until that time the distribution ofCAPE10 values was quite symmetric and clearly unimodal. 2 This rule is also called the “Empirical Rule” because it originally was observed without any proof. Itwas first published by Abraham de Moivre in 1733, 75 years before the underlying reason for it—which we’re about to see—was known.
M07_SHAR8696_03_SE_C07.indd 239
14/07/14 7:31 AM
www.freebookslides.com 240
CHAPTER 7 The Normal and Other Continuous Distributions
Figure 7.4 The 68–95–99.7 Rule tells us how much of most unimodal, symmetric models isfound within one, two, or three standard deviations of the mean.
68% 95% 99.7% –3s
–2s
For Example
–1s
2s
1s
3s
An extraordinary day for the Dow?
After the financial crisis of 2007/2008, the Dow Jones Industrial Average (DJIA) improved from a low of 7278 on March 20, 2009, to new records over 15,000 just 4 years later. But on August 8, 2011, the Dow dropped 634.8 points. Although that wasn’t the most ever lost in a day, it sent shock waves through the financial community. During the year from mid-2011 to mid-2012, the mean daily change in the DJIA was 1.87, with a standard deviation of 155.28 points. A histogram of day-to-day changes in the DJIA looked like this: 50
# of Days
40 30 20 10 0 –600
–400
0 –200 Change in DJIA
200
400
Question Use the 68–95–99.7 Rule to characterize how extraordinary the change on August 8, 2011 was. Is the rule appropriate? Answer The histogram is unimodal and symmetric, so the 68–95–99.7 Rule is an appropriate model. The z-score corresponding to the August 8 change is -634.8 - 1.87 = -4.10 155.28 A z-score bigger than 3 in magnitude will occur with a probability of less than 0.0015, or about once every 3 years for daily values. A z-score of 4 is even less likely. This was a truly extraordinary event.
7.2 “All models are wrong—but some areuseful.” —George Box, famous statistician
M07_SHAR8696_03_SE_C07.indd 240
The Normal Distribution The 68–95–99.7 Rule is useful in describing how unusual a z-score is. But often in business we want a more precise answer than one of these three values. To say more about how big we expect a z-score to be, we need to model the data’s distribution. There is no universal standard for z-scores, but there is a model that shows up over and over in Statistics. You’ve probably heard of “bell-shaped curves.” Statisticians call them Normal (or Gaussian) distributions. Normal distributions are appropriate models for distributions whose shapes are unimodal and roughly
14/07/14 7:31 AM
www.freebookslides.com
The Normal Distribution
N otat i on A l e r t N ( m,s) always denotes a Normal. The m, pronounced “mew,” is the Greek letter for “m,” and always represents the mean in a model. The s, sigma, is the lowercase Greek letter for “s,” and always represents the standard deviation in a model.
Is Normal Normal? Don’t be misled. The name “Normal” doesn’t mean that these are the usual shapes for histograms. The name follows a tradition of positive thinking in Mathematics and Statistics in which functions, equations, and relationships that are easy to work with or have other nice properties are called “normal,” “common,” “regular,” “natural,” or similar terms. It’s as if by calling them ordinary, we could make them actually occur more often and make our lives simpler.
241
symmetric. There is a Normal distribution for every possible combination of mean and standard deviation. We write N (m, s) to represent a Normal distribution with a mean of m and a standard deviation of s. We use Greek symbols here because this mean and standard deviation are parameters of the model, not summaries based on data. We can compute z-scores based on this model by using the parameters m and s. We still call these standardized values z-scores. We write z =
y - m . s
Standardized values have mean 0 and standard deviation 1, so by doing this to our values, we’ll need only one model—the model N10,12. The Normal distribution with mean 0 and standard deviation 1 is called the standard Normal distribution (or the standard Normal model). But be careful. You shouldn’t use a Normal model for just any data set. Remember that standardizing won’t change the shape of the distribution. If the distribution is not unimodal and symmetric to begin with, standardizing won’t make it Normal.
Just C hecking 1 Your Accounting teacher has announced that the lower of your two tests will be
dropped. You got a 90 on test 1 and an 80 on test 2. You’re all set to drop the 80 until she announces that she grades “on a curve.” She standardized the scores in order to decide which is the lower one. If the mean on the first test was 88 with astandard deviation of 4 and the mean on the second was 75 with a standard deviation of 5, a) Which one will be dropped? b) Does this seem “fair”?
Is the Standard Normal aStandard?
Figure 7.5 The standard Normal density function (with mean 0 and standard deviation 1). The probability of finding a z-score in any interval is the area over that interval under the curve. For example, the probability that the z-score falls between - 1 and 1 is about 68%, which can be seen approximately from the density function or found more precisely from a table or technology.
M07_SHAR8696_03_SE_C07.indd 241
The Normal distribution differs from the discrete probability distributions we saw in Chapter 6 because now, the random variable can take on any value. So, we need a continuous random variable. For any continuous random variable, the distribution of its probability can be shown with a curve called the probability density function (pdf), usually denoted as f1x2. The curve we use to work with the Normal distribution is called the Normal probability density function. The probability density function (pdf) doesn’t give the probability directly as the probability models for discrete random variables did. Instead the pdf gives the probability from the area below its curve. For the standard Normal, shown in Figure 7.5, we can see that the area below the curve between -1 and 1 is about 68%, which is where the 68–95–99.7 Rule comes from.
0.4 Density
Yes. We call it the “Standard Normal” because it models standardized values. It is also a “standard” because this is the particular Normal model that we almost always use.
0.3 0.2 0.1 0.0 –3
–2
–1
0 Normal
1
2
3
14/07/14 7:31 AM
www.freebookslides.com 242
CHAPTER 7 The Normal and Other Continuous Distributions
It’s important to remember that the probability density function f1x2 isn’t equal to P1X = x2. In fact, for a continuous random variable, X, P1X = x2 is 0 for e very value of x! That may seem strange at first, but since the probability is the area under the curve over an interval, as the interval gets smaller and smaller, the probability does too. Finally, when the interval is just a point, there is no area—and no probability. (See the box below.)
How Can Every Value Have Probability 0? We can find a probability for any interval of z-scores. But the probability for a single z-score is zero. How can that be? Let’s look at the standard Normal random variable, Z. We could find (from a table, website, or computer program) that the probability that Z lies between 0 and 1 is 0.3413.
Density
0.4 0.3 0.2 0.1 0.0
–3
–2
–1
0 Z
1
2
3
That’s the area under the Normal pdf (in red) between the values 0 and 1. So, what’s the probability that Z is between 0 and 1>10?
Density
0.4 0.3 0.2 0.1 0.0 –3
–2
–1
0 Z
1
2
3
That area is only 0.0398. What is the chance then that Z will fall between 0 and 1>100? There’s not much area—the probability is only 0.0040. If we kept going, the probability would keep getting smaller. The probability that Z is between 0 and 1>100,000 is less than 0.0001.
Density
0.4 0.3 0.2 0.1 0.0 –3
–2
–1
0 Z
1
2
3
So, what’s the probability that Z is exactly 0? Well, there’s no area under the curve right at x = 0, so the probability is 0. It’s only intervals that have positive probability, but that’s OK. In real life we never mean exactly 0.0000000000 or any other value. If you say “exactly 164 pounds,” you might really mean between 163.5 and 164.5 pounds or even between 163.99 and 164.01 pounds, but realistically not 164.000000000 Á pounds.
M07_SHAR8696_03_SE_C07.indd 242
14/07/14 7:31 AM
www.freebookslides.com
The Normal Distribution
243
Finding Normal Percentiles Finding the probability that a value is at least 1 SD above the mean is easy. We know that 68% of the values lie within 1 SD of the mean, so 32% lie farther away. Since the Normal distribution is symmetric, half of those 32% (or 16%) are more than 1 SD above the mean. But what if we want to know the percentage of observations that fall more than 1.8 SD above the mean? We already know that no more than 16% of observations have z-scores above 1. By similar reasoning, no more than 2.5% of the observations have a z-score above 2. Can we be more precise with our answer than “between 16% and 2.5%”? Figure 7.6 A table of Normal percentiles ( Table Z in Appendix B) lets us find the percentage of individuals in a standard Normaldistribution falling below any specified z-score value.
1.80
–3
–2
1
–1
2
1.7
.00 .01 0.9554 0.9564
1.8
0.9641 0.9649
1.9
0.9713 0.9719
3
With a z-score we can use the standard Normal distribution to find the probabilities we’re interested in. These days, we can find probabilities associated with z-scores using technology such as calculators, statistical software, and websites. We can also look up these values in a table of Normal percentiles. 3 Tables use the standard Normal distribution, so we’ll have to convert our data to z-scores before using the table. Our value 1.8 SD above the mean is a z-score of 1.80. To use a table, as shown in Figure 7.7, find the z-score by looking down the left column for the first two digits (1.8) and across the top row for the third digit, 0. The table gives the percentile as 0.9641. That means that 96.4% of the z-scores are less than 1.80. Since the total area is always 1, and 1 - 0.9641 = 0.0359 we know that only 3.6% of all observations from a Normal distribution have z-scores higher than 1.80.
For Example
GMAT scores and the Normal model
The Graduate Management Admission Test (GMAT) has scores from 200 to 800. Scores are supposed to follow a distribution that is roughly unimodal and symmetric and is designed to have an overall mean of 500 and a standard deviation of 100. In any one year, the mean and standard deviation may differ from these target values by a small amount, but we can use these values as good overall approximations.
Question Suppose you earned a 600 on your GMAT test. From that information andthe 68–95–99.7 Rule, where do you stand among all students who took the GMAT? Answer Because we’re told that the distribution is unimodal and symmetric, we
can approximate the distribution with a Normal model. We are also told the scores have a mean of 500 and an SD of 100. So, we’ll use a N A500,100B. It’s good practice at this point to draw the distribution. Find the score whose percentile you want to know and locate it on the picture. When you finish the calculation, you should check to make sure that it’s a reasonable percentile from the picture.
200
300
400
500
600
700
800
(continued) 3
See Table Z in Appendix B. Many calculators and statistics computer packages do this as well.
M07_SHAR8696_03_SE_C07.indd 243
14/07/14 7:31 AM
www.freebookslides.com 244
CHAPTER 7 The Normal and Other Continuous Distributions
A score of 600 is 1 SD above the mean. That corresponds to one of the points in the 68–95–99.7% Rule. About 32% 1100% - 68%2 of those who took the test were more than one standard deviation from the mean, but only half of those were on the high side. So about 16% (half of 32%) of the test scores were better than 600.
For Example
More GMAT scores
Question Assuming the GMAT scores are nearly Normal with N (500,100), what proportion of GMAT scores falls between 450 and 600? Answer The first step is to find the z-scores associated with each value. Standardiz-
ing the scores we are given, we find that for 600, z = 1600 - 5002 >100 = 1.0 and for 450, z = 1450 - 5002 >100 = - 0.50. Then, we can label the axis below thepicture either in the original values or the z-scores or even use both scales as the following picture shows. –0.5
1.0
0.533
–3 200
–2 300
–1 400
0 500
1 600
2 700
3 800
From Table Z, we find the area z … 1.0 = 0.8413, which means that 84.13% of scores fall below 1.0, and the area z … - 0.50 = 0.3085, which means that 30.85% of the values fall below -0.5, so the proportion of z-scores between them is 84.13% - 30.85% = 53.28%. So, the Normal model estimates that about 53.3% of GMAT scores fall between 450 and 600.
Finding areas from z-scores is the simplest way to work with the Normal distribution. Sometimes we start with areas and need to work backward to find the corresponding z-score or even the original data value. For instance, what z-score represents the first quartile, Q1, in a Normal distribution? In our first set of examples, we knew the z-score and used the table or technology to find the percentile. Now we want to find the cut point for the 25th percentile. Make a picture, shading the leftmost 25% of the area. Look in Table Z for an area of 0.2500. The exact area is not there, but 0.2514 is the closest number. That shows up in the table with -0.6 in the left margin and 0.07 in the top margin. The z-score for Q1, then, is approximately z = -0.67. Computers and calculators can determine the cut point more precisely (and more easily).4
For Example
An exclusive MBA program
Question Suppose an MBA program says it admits only people with GMAT scores among the top 10%. How high a GMAT score does it take to be eligible?
4 We’ll often use those more precise values in our examples. If you’re finding the values from the table you may not get exactly the same number to all decimal places as your classmate who’s using a computer package.
M07_SHAR8696_03_SE_C07.indd 244
14/07/14 7:31 AM
www.freebookslides.com
245
The Normal Distribution
Answer The program takes the top 10%, so their cutoff score is the 90th percentile. Draw an approximate picture like this one.
0.07
0.08
0.09
1.0
0.8577
0.8599
0.8621
1.2
0.8980
0.8997
0.9015
1.1 1.3 1.4
10% −3 200
−2 300
−1 400
0 500
1 600
2 700
0.8790 0.9147 0.9292
0.8810 0.9162 0.9306
0.8830 0.9177 0.9319
3 800
From our picture we can see that the z-value is between 1 and 1.5 (if we’ve judged 10% of the area correctly), and so the cutoff score is between 600 and 650 or so. Using technology, you may be able to select the 10% area and find the z-value directly. U sing a table, such as Table Z, locate 0.90 (or as close to it as you can; here 0.8997 is closer than 0.9015) in the interior of the table and find the corresponding z-score (see table above). Here the 1.2 is in the left margin, and the 0.08 is in the margin above the entry. Putting them together gives 1.28. Now, convert the z-score back to the original units. A z-score of 1.28 is 1.28 standard deviations above the mean. Since the standard deviation is 100, that’s 128 GMAT points. The cutoff is 128 points above the mean of 500, or 628. Because the program wants GMAT scores in the top 10%, the cutoff is 628. (Actually since GMAT scores are reported only in multiples of 10, you’d have to score at least a 630.)
Guided Example
Cereal Company A cereal manufacturer has a machine that fills the boxes. Boxes are labeled “16 oz,” so the company wants to have that much cereal in each box. But since no packaging process is perfect, there will be minor variations. If the machine is set at exactly 16 oz and the Normal distribution applies (or at least the distribution is roughly symmetric), then about half of the boxes will be underweight, making consumers unhappy and exposing the company to bad publicity and possible lawsuits. To prevent underweight boxes, the manufacturer has to set the mean a little higher than 16.0 oz. Based on their experience with the packaging machine, the company believes that the amount of cereal in the boxes fits a Normal distribution with a standard deviation of 0.2 oz. The manufacturer decides to set the machine to put an average of 16.3 oz in each box. Let’s use that model to answer a series of questions about these cereal boxes.
Question 1: What fraction of the boxes will be underweight?
Plan
Setup State the variable and the objective.
The variable is weight of cereal in a box. We want to determine what fraction of the boxes risk being underweight.
Model Check to see if a Normal distribution is appropriate.
We have no data, so we cannot make a histogram. But we are told that the company believes the distribution of weights from the machine is Normal.
Specify which Normal distribution to use.
We use an N (16.3, 0.2) model. (continued )
M07_SHAR8696_03_SE_C07.indd 245
14/07/14 7:31 AM
www.freebookslides.com 246
CHAPTER 7 The Normal and Other Continuous Distributions
Do
Mechanics Make a graph of this Normal distribution. Locate the value you’re interested in on the picture, label it, and shade the appropriate region. 15.7
Estimate from the picture the percentage of boxes that are underweight. (This will be useful later to check that your answer makes sense.) Convert your cutoff value into a z-score.
Report
15.9 16.0 16.1
16.3
16.5
16.7
16.9
(It looks like a low percentage—maybe less than 10%.) We want to know what fraction of the boxes will weigh less than 16 oz.
z =
y - m 16 - 16.3 = = -1.50. s 0.2
Look up the area in the Normal table, or use technology.
P1y 6 162 = P1z 6 -1.502 = 0.0668
Conclusion State your conclusion
We estimate that approximately 6.7% of the boxes will contain less than 16 oz of cereal.
in the context of the problem.
Question 2: The company’s lawyers say that 6.7% is too high. They insist that no more than 4% of the boxes can be underweight. So the company needs to set the machine to put a little more cereal in each box. What mean setting do they need?
Plan
Do
Setup State the variable and the objective.
The variable is weight of cereal in a box. We want to determine a setting for the machine.
Model Check to see if a Normal model is appropriate.
We have no data, so we cannot make a histogram. But we are told that a Normal model applies.
Specify which Normal distribution to use. This time you are not given a value for the mean!
We don’t know m, the mean amount of cereal. The standard deviation for this machine is 0.2 oz. The model, then, is N1m, 0.22.
We found out earlier that setting the machine to m = 16.3 oz made 6.7% of the boxes too light. We’ll need to raise the mean a bit to reduce this fraction.
We are told that no more than 4% of the boxes can be below 16 oz.
Mechanics Make a graph of this Normal distribution. Center it at m (since you don’t know the mean) and shade the region below 16 oz. 16
Using the Normal table, a calculator, or software, find the z-score that cuts off the lowest 4%.
M07_SHAR8696_03_SE_C07.indd 246
m
The z-score that has 0.04 area to the left of it is z = -1.75.
14/07/14 7:31 AM
www.freebookslides.com
247
The Normal Distribution
Report
Use this information to find m. It’s located 1.75 standard deviations to the right of 16.
Since 16 must be 1.75 standard deviations below the mean, we need to set the mean at 16 + 1.75 # 0.2 = 16.35.
Conclusion State your
The company must set the machine to average 16.35 oz of cereal per box.
conclusion in the context of the problem.
Question 3: The company president vetoes that plan, saying the company should give away less free cereal, not more. Her
goal is to set the machine no higher than 16.2 oz and still have only 4% underweight boxes. The only way to accomplish this is to reduce the standard deviation. What standard deviation must the company achieve, and what does that mean about the machine?
Plan
Setup State the variable and the objective.
The variable is weight of cereal in a box. We want to determine the necessary standard deviation to have only 4% of boxes underweight.
Model Check that a Normal model is appropriate.
The company believes that the weights are described by a Normal distribution.
Specify which Normal distribution to use. This time you don’t know s.
Now we know the mean, but we don’t know the standard deviation. The model is therefore N116.2, s2.
We know the new standard deviation must be less than 0.2 oz.
Do
Mechanics Make a graph of this Normal distribution. Center it at 16.2, and shade the area you’re interested in. We want 4% of the area to the left of 16 oz. 16
Find the z-score that cuts off the lowest 4%. Solve for s. (Note that we need 16 to be 1.75 s’s below 16.2, so 1.75 s must be 0.2 oz. You could just start with that equation.)
16.2
We already know that the z-score with 4% below it is z = -1.75. z =
y - m s
16 - 16.2 s 1.75s = 0.2 -1.75 =
s = 0.114.
Report
Conclusion State your conclusion in the context of the problem. As we expected, the standard deviation is lower than before— actually, quite a bit lower.
M07_SHAR8696_03_SE_C07.indd 247
The company must get the machine to box cereal with a standard deviation of only 0.114 oz. This means the machine must be more consistent (by nearly a factor of 2) in filling the boxes.
14/07/14 7:31 AM
www.freebookslides.com 248
CHAPTER 7 The Normal and Other Continuous Distributions
Ju s t Che c k i n g 2 As a group, the Dutch are among the tallest people in the
world. The average Dutch man is 184 cm tall—just over 6feet (and the average Dutch woman is 170.8 cm tall— just over 5′7″). If a Normal model is appropriate and the standard deviation for men is about 8 cm, what percentage of all Dutch men will be over 2 meters 16′6″2 tall?
3 Suppose it takes you 20 minutes, on average, to drive to
work, with a standard deviation of 2 minutes. Suppose a
7.3
Normal model is appropriate for the distributions of driving times. a) How often will you arrive at work in less than 22 minutes? b) How often will it take you more than 24 minutes? c) Do you think the distribution of your driving times is unimodal and symmetric? d) What does this say about the accuracy of your prediction? Explain.
Normal Probability Plots Before using a Normal model you should check that the data follow a distribution that is at least close to Normal. You can check that the histogram is unimodal and symmetric, but there is also a specialized graphical display that can help you to decide whether the Normal model is appropriate: the Normal probability plot. If the distribution of the data is roughly Normal, the plot is roughly a diagonal straight line. Deviations from a straight line indicate that the distribution is not Normal. This plot is usually able to show deviations from Normality more clearly than the corresponding histogram, but it’s usually easier to understand how a distribution fails to be Normal by looking at its histogram. Normal probability plots are difficult to make by hand, but are provided by most statistics software. Some data on a car’s fuel efficiency provide an example of data that are nearly Normal. The overall pattern of the Normal probability plot is straight. The two trailing low values correspond to the values in the histogram that trail off the low end. They’re not quite in line with the rest of the data set. The Normal probability plot shows us that they’re a bit lower than we’d expect of the lowest two values in a Normal distribution. 29
24 mpg
Figure 7.7 Histogram and Normal probability plot for gas mileage (mpg) recorded for a Nissan Maxima. The vertical axes are the same, so each dot on the probability plot would fall into the bar on the histogram immediately to its left.
19
14 –1.25
0.00 1.25 Normal Scores
2.50
By contrast, the Normal probability plot of a sample of men’s Weights in Figure7.8 from a study of lifestyle and health is far from straight. The weights Figure 7.8 Histogram and Normal probability plot for men’s weights. Note how a skewed distribution corresponds to a bent probability plot. Weights
300 225 150 –2
M07_SHAR8696_03_SE_C07.indd 248
1 0 –1 Normal Scores
2
14/07/14 7:31 AM
www.freebookslides.com
249
The Distribution of Sums of Normals
are skewed to the high end, and the plot is curved. We’d conclude from these pictures that approximations using the Normal model for these data would not be very accurate.
For Example
Using a Normal probability plot
A Normal probability plot of the CAPE10 prices from page 239 looks like this:
CAPE
40 30 20 10
–2
0 Normal Scores
2
Question What does this plot say about the distribution of the CAPE10 scores? Answer The bent shape of the probability plot indicates a deviation from Normality. The upward bend is because the distribution is skewed to the high end. The “kink” in that bend suggest a collection of values that don’t continue that skewness consistently. We should probably not use a Normal model for these data.
How Does a Normal Probability Plot Work? Figure 7.9 shows a Normal probability plot for 100 fuel efficiency measures for a car. The smallest of these has a z-score of -3.16. The Normal model can tell us what value to expect for the smallest z-score in a batch of 100 if a Normal model were appropriate. That turns out to be -2.58. So our first data value is smaller than we would expect from the Normal. We can continue this and ask a similar question for each value. For example, the 14th-smallest fuel efficiency has a z-score of almost exactly -1, and that’s just what we should expect ( -1.1 to be exact). The easiest way to make the comparison, of course, is to graph it.5 If our observed values look like a sample from a Normal model, then the probability plot stretches out in a straight line from lower left to upper right. But if our values deviate from what we’d expect, the plot will bend or have jumps in it. The values we’d expect from a Normal model are called Normal scores, or sometimes nscores. You can’t easily look them up in the table, so probability plots are best made with technology and not by hand. The best advice on using Normal probability plots is to see whether they are straight. If so, then your data look like data from a Normal model. If not, make a histogram to understand how they differ from the model.
29
mpg
24 19 14 –1.25 0.00 1.25 Normal Scores
2.50
Figure 7.9 A Normal probability plot lines up the sorted data values against the Normal scores that we’d expect for a sample of that size. The straighter the line, the closer the data are to a Normal model. The mileage data look quite Normal.
7.4
The Distribution of Sums of Normals Another reason Normal models show up so often is that they have some special properties. An important one is that the sum or difference of two independent Normal random variables is also Normal. 5
Sometimes the Normal probability plot switches the two axes, putting the data on the x-axis and the z-scores on the y-axis.
M07_SHAR8696_03_SE_C07.indd 249
14/07/14 7:31 AM
www.freebookslides.com 250
CHAPTER 7 The Normal and Other Continuous Distributions
A company manufactures small stereo systems. At the end of the production line, the stereos are packaged and prepared for shipping. Stage 1 of this process is called “packing.” Workers must collect all the system components (a main unit, two speakers, a power cord, an antenna, and some wires), put each in plastic bags, and then place everything inside a protective form. The packed form then moves on to Stage 2, called “boxing,” in which workers place the form and a packet of instructions in a cardboard box and then close, seal, and label the box for shipping. The company says that times required for the packing stage are unimodal and symmetric and can be described by a Normal distribution with a mean of 9 minutes and standard deviation of 1.5 minutes. (See Figure 7.10.) The times for the boxing stage can also be modeled as Normal, with a mean of 6 minutes and standard deviation of 1 minute. Figure 7.10 The Normal model for the packing stage with a mean of 9 minutes and standard deviation of 1.5 minutes.
Density
0.20 0.10 0.0 4
6
8 10 Normal
12
14
The company is interested in the total time that it takes to get a system through both packing and boxing, so they want to model the sum of the two random variables. Fortunately, the special property that adding independent Normals yields another Normal allows us to apply our knowledge of Normal probabilities to questions about the sum or difference of independent random variables. To use this property of Normals, we’ll need to check two assumptions: that the variables are Independent and that they can be modeled by the Normal distribution.
Guided Example
Packaging Stereos Consider the company that manufactures and ships small stereo systems that we discussed previously. If the time required to pack the stereos can be described by a Normal distribution, with a mean of 9 minutes and standard deviation of 1.5 minutes, and the times for the boxing stage can also be modeled as Normal, with a mean of 6 minutes and standard deviation of 1 minute, what is the probability that packing an order of two systems takes over 20 minutes? What percentage of the stereo systems takes longer to pack than to box?
Question 1:
Plan
What is the probability that packing an order of two systems takes more than 20 minutes?
Setup State the problem.
We want to estimate the probability that packing an order of two systems takes more than 20 minutes.
Variables Define your random
Let P1 P2 T T
variables. Write an appropriate equation for the variables you need.
M07_SHAR8696_03_SE_C07.indd 250
= = = =
time for packing the first system time for packing the second system total time to pack two systems P1 + P2
14/07/14 7:31 AM
www.freebookslides.com
The Distribution of Sums of Normals Think about the model assumptions.
Do
Mechanics Find the expected value. (Expected values always add.)
Find the variance. For sums of independent random variables, variances add. (In general, we don’t need the variables to be Normal for this to be true—just independent.) Find the standard deviation. Now we use the fact that both random variables follow Normal distributions to say that their sum is also Normal.
✓ Normal Model Assumption. We are told that packing times are well modeled by a Normal model, and we know that the sum of two Normal random variables is also Normal. ✓ Independence Assumption. There is no reason to think that the packing time for one system would affect the packing time for the next, so we can reasonably assume the two are independent.
E1T2 = E1P1 + P2 2 = E1P1 2 + E1P2 2 = 9 + 9 = 18 minutes Since the times are independent, Var1T2 = = = Var1T2 = SD1T2 =
Var1P1 + P2 2 Var1P1 2 + Var1P2 2 1.52 + 1.52 4.50 24.50 ≈ 2.12 minutes
We can model the time, T, with a N (18, 2.12) model. 0.94
Sketch a picture of the Normal distribution for the total time, shading the region representing over 20 minutes. Find the z-score for 20 minutes. Use technology or a table to find the probability.
Report
Conclusion Interpret your result in context.
Question 2:
Plan
251
18
20
20 - 18 = 0.94 2.12 P1T 7 202 = P1z 7 0.942 = 0.1736 z =
Memo Re: Computer systems packing Using past history to build a model, we find slightly more than a 17% chance that it will take more than 20 minutes to pack an order of two stereo systems.
What percentage of stereo systems take longer to pack than to box?
Setup State the question.
We want to estimate the percentage of the stereo systems that takes longer to pack than to box.
Variables Define your random
Let P = time for packing a system B = time for boxing a system
variables.
D = difference in times to pack and box a system (continued )
M07_SHAR8696_03_SE_C07.indd 251
14/07/14 7:31 AM
www.freebookslides.com 252
CHAPTER 7 The Normal and Other Continuous Distributions
Do
Write an appropriate equation.
D = P - B
What are we trying to find? Notice that we can tell which of two quantities is greater by subtracting and asking whether the difference is positive or negative.
A system that takes longer to pack than to box will have P 7 B, and so D will be positive. We want to find P1D 7 02.
Remember to think about the assumptions.
✓ N ormal Model Assumption. We are told that both random variables are well modeled by Normal distributions, and we know that the difference of two Normal random variables is also Normal. ✓ Independence Assumption. There is no reason to think that the packing time for a system will affect its boxing time, so we can reasonably assume the two are independent.
Mechanics Find the expected
E1D2 = E1P - B2 = E1P2 - E1B2 = 9 - 6 = 3 minutes
value.
For the difference of independent random variables, the variance is the sum of the individual variances.
Since the times are independent, Var1D2 = Var1P - B2 = Var1P2 + Var1B2 = 1.52 + 12 Var1D2 = 3.25
Find the standard deviation. State what model you will use. Sketch a picture of the Normal distribution for the difference in times and shade the region representing a difference greater than zero.
SD1D2 = 23.25 ≈ 1.80 minutes
We can model D with N (3, 1.80). –1.67
Find the z-score. Then use a table or technology to find the probability.
Report
Conclusion Interpret your result in context.
M07_SHAR8696_03_SE_C07.indd 252
3
0 - 3 = -1.67 1.80 P1D 7 02 = P1z 7 -1.672 = 0.9525 z =
Memo Re: Computer systems packing In our second analysis, we found that just over 95% of all the stereo systems will require more time for packing than for boxing.
14/07/14 7:31 AM
www.freebookslides.com
The Normal Approximation for the Binomial
7.5
Recall That This Notation: a
1000 b 120
means “1000 choose 120”. We first saw this notation in Chapter 6 on page 221. Look back there if you need a reminder.
253
The Normal Approximation for the Binomial In the previous chapter we modeled the number of successes of a series of trials with a Binomial. Suppose we send out 1000 flyers advertising a free cup of coffee at our new cafe and we think that the probability that someone will come is about 0.10. We might want to know the chance that at least 120 people will come to claim their coffee. We could use the binomial to calculate that with n = 1000 and p = 0.10. We know that the probability that exactly 120 people will come is 1000 a b * 10.102 120 * 10.902 880 (about 0.005). But that’s not the answer. We 120 want to know the probability that at least 120 will show up, so we have to calculate a probability for 121, 122, 123, … and all the way up to 1000. There must be a better way. And there is. The Normal distribution can approximate the Binomial. The Binomial model for our cafe has mean np = 100 and standard deviation 1npq ≈ 9.5. We might just try to approximate its distribution with a Normal distribution using the same mean and standard deviation. Remarkably enough, that turns out to be a very good approximation. Using that mean and standard deviation, we can find the probability: P1X Ú 1202 = P a z Ú
120 - 100 b ≈ P1z Ú 2.112 ≈ 0.0174 9.5
There seems to be only about a 1.7% chance that at least 120 people will show up. (Adding up all 881 probabilities using the Binomial agrees with this to 3decimal places!) We can’t always use a Normal distribution to make estimates of Binomial probabilities. The success of the approximation depends on the sample size. Suppose we are searching for a prize in cereal boxes, where the probability of finding a prize is 20%. If we buy five boxes, the actual Binomial probabilities that we get 0, 1, 2, 3, 4, or 5 prizes are 33%, 41%, 20%, 5%, 1%, and 0.03%, respectively. The histogram just below shows that this probability model is skewed. We shouldn’t try to estimate these probabilities by using a Normal model.
1
2
3
4
5
But if we open 50 boxes of this cereal and count the number of prizes we find, we’ll get the histogram below. It is centered at np = 5010.22 = 10 prizes, as expected, and it appears to be fairly symmetric around that center.
M07_SHAR8696_03_SE_C07.indd 253
5
10
15
20
14/07/14 7:31 AM
www.freebookslides.com 254
CHAPTER 7 The Normal and Other Continuous Distributions
A Normal distribution is a close enough approximation to the Binomial only for a large enough number of trials. And what we mean by “large enough” depends on the probability of success. We’d need a larger sample if the probability of success were very low (or very high). It turns out that a Normal distribution works pretty well if we expect to see at least 10 successes and 10 failures. We can check the Success/Failure Condition. Success/Failure Condition: A Binomial model is approximately Normal if we expect at least 10 successes and 10 failures: np Ú 10 and nq Ú 10. Why 10? Well, actually it’s 9, as revealed in the following Math Box.
Math Box Why Check np Ú 10? It’s easy to see where the magic number 10 comes from. You just need to remember how Normal models work. The problem is that a Normal model extends infinitely in both directions. But a Binomial model must have between 0 and n successes, so if we use a Normal to approximate a Binomial, we have to cut off its tails. That’s not very important if the center of the Normal model is so far from 0 and n that the lost tails have only a negligible area. More than three standard deviations should do it because a Normal model has little probability past that. So the mean needs to be at least 3 standard deviations from 0 and at least 3 standard deviations from n. Let’s look at the 0 end. We require: Or, in other words: For a Binomial that’s: Squaring yields: Now simplify: Since q … 1, we require:
m - 3s 7 0 m 7 3s np 7 3 1npq n2p2 7 9npq np 7 9q np 7 9
For simplicity we usually demand that np (and nq for the other tail) be at least 0 to use the Normal approximation which gives the Success/Failure Condition.6
*The Continuity Correction When we use a continuous model to model a set of discrete events, we may need to make an adjustment called the continuity correction. We approximated the Binomial distribution (50, 0.2) with a Normal distribution. But what does the Normal distribution say about the probability that X = 10? Every specific value in the Normal probability model has probability 0. That’s not the answer we want.
5
10
15
20
6 Looking at the final step, we see that we need np 7 9 in the worst case, when q (or p) is near 1, making the Binomial model quite skewed. When q and p are near 0.5—for example, between 0.4 and 0.6—the Binomial model is nearly symmetric, and np 7 5 ought to be safe enough. Although we’ll always check for 10expected successes and failures, keep in mind that for values of p near 0.5, we can be somewhat more forgiving.
M07_SHAR8696_03_SE_C07.indd 254
14/07/14 7:31 AM
www.freebookslides.com
Other Continuous Random Variables
255
Because X is really discrete, it takes on the exact values 0, 1, 2, … , 50, each with positive probability. The histogram holds the secret to the correction. Look at the bin corresponding to X = 10 in the histogram. It goes from 9.5 to 10.5. What we really want is to find the area under the Normal curve between 9.5 and 10.5. So when we use the Normal distribution to approximate discrete events, we go halfway to the next value on the left and/or the right. We approximate P1X = 102 by finding P19.5 … X … 10.52. For a Binomial 150, 0.22, m = 10 and s = 2.83. So P19.5 … X … 10.52 ≈ P a
9.5 - 10 10.5 - 10 … z … b 2.83 2.83
= P1 - 0.177 … z … 0.1772 = 0.1405
By comparison, the exact Binomial probability is 0.1398.
For Example
Using the Normal distribution
Some LCD panels have stuck or “dead” pixels that have defective transistors and are permanently unlit. If a panel has too many dead pixels, it must be rejected. A manufacturer knows that, when the production line is working correctly, the probability of rejecting a panel is 0.07.
Questions
1. How many screens do they expect to reject in a day’s production run of 500 screens? What is the standard deviation? 2. If they reject 40 screens today, is that a large enough number that they should be concerned that something may have gone wrong with the production line? 3. In the past week of 5 days of production, they’ve rejected 200 screens—an average of 40 per day. Should that raise concerns?
Answers
1. m = 0.07 * 500 = 35 is the expected number of rejects s = 2npq = 2500 * 0.07 * 0.93 = 5.7
40 - 35 b = P1z Ú 0.8772 ≈ 0.19, not an 5.7 extraordinarily large number of rejects
2. P 1X Ú 402 = Paz Ú
3. Using the Normal approximation: m = 0.07 * 2500 = 175 s = 22500 * 0.07 * 0.93 = 12.757 P1X Ú 2002 = Paz Ú
200 - 175 b = P1z Ú 1.962 ≈ 0.025 12.757
Yes, this seems to be a number of rejects that would occur by chance rarely if nothing were wrong.
7.6
Other Continuous Random Variables Many phenomena in business can be modeled by continuous random variables. The Normal model is important, but it is only one of many different models. Entire courses are devoted to studying which models work well in different
M07_SHAR8696_03_SE_C07.indd 255
14/07/14 7:31 AM
www.freebookslides.com 256
CHAPTER 7 The Normal and Other Continuous Distributions
situations, but we’ll introduce just two others that are commonly used: the uniform and the exponential.
The Uniform Distribution We’ve already seen the discrete version of the uniform probability model. A continuous uniform shares the principle that all events should be equally likely, but with a continuous distribution we can’t talk about the probability of a particular value because each value has probability zero. Instead, for a continuous random variable X, we say that the probability that X lies in any interval depends only on the length of that interval. Not surprisingly the density function of a continuous uniform random variable looks flat. It can be defined by the formula 1 f1x2 = • b - a 0
if a … x … b otherwise
Figure 7.11 The density function of a continuous uniform random variable on the interval from a to b.
f (x )
1 b−a
0 a
x
b
From Figure 7.11, it’s easy to see that the probability that X lies in any interval between a and b is the same as any other interval of the same length. In fact, the probability is just the ratio of the length of the interval to the total length: b - a. In other words: For values c and d 1c … d2 both within the interval 3a, b4: 1d - c2 P1c … X … d2 = 1b - a2
As an example, suppose you arrive at a bus stop and want to model how long you’ll wait for the next bus. The sign says that busses arrive about every 20 minutes, but no other information is given. You might assume that the arrival is equally likely to be anywhere in the next 20 minutes, and so the density function would be 1 20 • f 1x2 = 0
if
0 … x … 20 otherwise
and would look as shown in Figure 7.12.
M07_SHAR8696_03_SE_C07.indd 256
14/07/14 7:31 AM
www.freebookslides.com
257
0.10
f (x )
Figure 7.12 The density function of a continuous uniform random variable on the interval [0,20]. Notice that the mean (the balancing point) of the distribution isat10 minutes and that the area of the boxis 1.
Other Continuous Random Variables
0.05
0 0
5
10 x
15
20
Just as the mean of a data distribution is the balancing point of a histogram, the mean of any continuous random variable is the balancing point of the density function. Looking at Figure 7.12, we can see that the balancing point is halfway between the end points at 10 minutes. In general, the expected value is: E1X2 =
a + b 2
for a uniform distribution on the interval 1a, b2. With a = 0 and b = 20, the e xpected value would be 10 minutes. The variance and standard deviation are less intuitive: 1b - a2 2 1b - a2 2 . ; SD1X2 = A Var1X2 = 12 12
Using these formulas, our bus wait will have an expected value of 10 minutes with a 120 - 02 2 standard deviation of A = 5.77 minutes. 12
The Exponential Model We saw in Chapter 6 that the Poisson distribution is a good model for the arrival, or occurrence, of events. We found, for example, the probability that x visits to our website will occur within the next minute. The exponential distribution with parameter l can be used to model the time between those events. Its density function has the form: f 1x2 = le - lx
for x Ú 0 and l 7 0
The use of the parameter l again is not coincidental. It highlights the relationship between the exponential and the Poisson. Figure 7.13 The exponential density function with l = 1.
1.0
f (x )
0.8 0.6 0.4 0.2 0.0 0
1
2
3
4
5
x
If a discrete random variable can be modeled by a Poisson model with rate l, then the times between those events can be modeled by an exponential model with the same parameter l. The mean of the exponential is 1>l. The inverse relationship between the two means makes intuitive sense. If l increases and we expect more
M07_SHAR8696_03_SE_C07.indd 257
14/07/14 7:31 AM
www.freebookslides.com 258
CHAPTER 7 The Normal and Other Continuous Distributions
hits per minute, then the expected time between hits should go down. The standard deviation of an exponential random variable is 1>l. Like any continuous random variable, probabilities of an exponential random variable can be found only through the density function. Fortunately, the area under the exponential density between any two values, s and t 1s … t2, has a particularly easy form: P1s … X … t2 = e - ls - e - lt.
In particular, by setting s to be 0, we can find the probability that the waiting time will be less than t from P1X … t2 = P10 … X … t2 = e - l0 - e - lt = 1 - e - lt. The function P1X … t2 = F1t2 is called the cumulative distribution function (cdf) of the random variable X. If arrivals of hits to our website can be well modeled by a Poisson with l = 4>minute, then the probability that we’ll have to wait less than 20 seconds (1>3 of a minute) for the next hit is F11>32 = P10 … X … 1>32 = 1 - e - 4>3 = 0.736. That seems about right. Arrivals are coming about every 15 seconds on average, so we shouldn’t be surprised that nearly 75% of the time we won’t have to wait more than 20 seconds for the next hit.
What Can Go Wrong? • Probability models are still just models. Models can be useful, but they are not reality. Think about the assumptions behind your models. Question probabilities as you would data. • Don’t assume everything’s Normal. Just because a random variable is continuous or you happen to know a mean and standard deviation doesn’t mean that a Normal model will be useful. You must think about whether the Normality Assumption is justified. Using a Normal model when it really does not apply will lead to wrong answers and misleading conclusions. A sample of CEOs has a mean total compensation of $10,307,311.87 with a standard deviation of $17,964,615.16. Using the Normal model rule, we should expect about 68% of the CEOs to have compensations between - +7,657,303.29 and $28,271,927.03. In fact, more than 90% of the CEOs have annual compensations in this range. What went wrong? The distribution is skewed, not symmetric. Using the 68–95–99.7 Rule for data like these will lead to silly results. 250
# of CEOs
200 150 100 50
0 10,000,000
30,000,000
50,000,000
70,000,000
90,000,000 110,000,000 130,000,000 150,000,000 170,000,000 190,000,000 210,000,000 230,000,000 Annual Compensation ($)
• Don’t use the Normal approximation with small n. To use a Normal a pproximation in place of a Binomial model, there must be at least 10 expected successes and 10 expected failures.
M07_SHAR8696_03_SE_C07.indd 258
14/07/14 7:31 AM
www.freebookslides.com
259
What Have We Learned?
Ethics in Action
G
reen River Army Depot’s main business is the repair and refurbishment of electronics, mainly satellite and communication systems, in partnership with the Department of Defense (DOD). Recently, DOD has put a great deal of effort into continuous quality improvement, focusing on the length of time it takes to complete a project. Dave Smith, head of the Productivity and Quality Improvement (PQI) directorate is responsible for (among other things) facilitating lean improvement events throughout the Depot. These events bring together cross-functional teams with the goal of streamlining processes. PQI staff guide these teams in value stream mapping, identifying non–value added activities, and redesigning processes to eliminate waste. Dave is concerned that his group may not meet the new standards for quality which have specified that 97% of all projects must be completed within 60 days of their start dates. In preparation for a meeting with the Depot commander, Dave decides to review data on lean improvement events. He finds that of the past 137 projects, 30% went beyond 60 days. The commander suggests that Dave be “creative” in his presentation of the statistics to make it look like the group is actually in compliance with the DOD.
The completion times are very skewed to the high end, which doesn’t surprise Dave in the least. After all, no project can take less than 0 time, but a few always seem to go on for a much longer time than planned. He is pleased to find that the average completion time is only 40 days and that the standard deviation is 10 days. Dave knows that a Normal model is a poor representation of the completion times because of the skewness, but decides to use a N(40,10) model. With this model, only about 2.5% of the projects would be expected to take more than 60 days! He explains his model to this supervisor who is pleased with the results. Even though they know the data don’t fit the model well, they also know that the DOD analysts are very familiar with Normal models and will be pleased to know that they are in compliance using it. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • P ropose an ethical solution that considers the welfare of all stakeholders.
What Have We Learned? Learning Objectives
Recognize Normally distributed data by making a histogram and checking whether it is unimodal, symmetric, and bell-shaped, or by making a Normal probability plot using technology and checking whether the plot is roughly a straight line.
• The Normal model is a distribution that will be important for much of the rest of this course. • Before using a Normal model, we should check that our data are plausibly from a Normallydistributed population. • A Normal probability plot provides evidence that the data are Normally distributed if it is linear. Understand how to use the Normal model to judge whether a value is extreme.
• Standardize values to make z-scores and obtain a standard scale. Then refer to a standard Normal distribution. • Use the 68–95–99.7 Rule as a rule-of-thumb to judge whether a value is extreme. Know how to refer to tables or technology to find the probability of a value randomly selected from a Normal model falling in any interval.
• Know how to perform calculations about Normally distributed values and probabilities.
M07_SHAR8696_03_SE_C07.indd 259
14/07/14 7:31 AM
www.freebookslides.com 260
CHAPTER 7 The Normal and Other Continuous Distributions
Recognize when independent random Normal quantities are being added or subtracted.
• The sum or difference will also follow a Normal model. • The variance of the sum or difference will be the sum of the individual variances. • The mean of the sum or difference will be the sum or difference, respectively, of the means. Recognize when other continuous probability distributions are appropriate models.
Terms 68–95–99.7 Rule (or Empirical Rule)
In a Normal model, 68% of values fall within one standard deviation of the mean, 95% fall within two standard deviations of the mean, and 99.7% fall within three standard deviations of the mean. This is also approximately true for most unimodal, symmetric distributions.
Continuous random variable
A random variable that can take any numeric value within a range of values. The range may be infinite or bounded at either or both ends.
Cumulative distribution function (cdf ) Exponential Distribution Normal Distribution Normal percentile
A function for a continuous probability model that gives the probability of all values below a givenvalue. A continuous distribution appropriate for modeling the times between events whose occurrences follow a Poisson model. A unimodal, symmetric, “bell-shaped” distribution that appears throughout Statistics. The Normal percentile corresponding to a z-score gives the percentage of values in a standard Normal distribution found at that z-score or below.
Normal probability plot
A display to help assess whether a distribution of data is approximately Normal. If the plot is nearly straight, the data satisfy the Nearly Normal Condition.
Probability Density Function (pdf )
A function for any continuous probability model that gives the probability of a random value falling between any two values as the area under the pdf between those two values.
Standard Normal model or Standard Normal distribution Uniform Distribution
A Normal model, N1m, s2 with mean m = 0 and standard deviation s = 1.
A continuous distribution that assigns a probability to any range of values (between 0 and 1) proportional to the difference between the values.
Technology Help: Probability Calculations and Plots The best way to tell whether your data can be modeled well by a Normal model is to make a picture or two. We've already talked aboutmaking histograms. Normal probability plots are almost never made by hand because the values of the Normal scores are tricky to find. But most statistics software can make Normal plots, though various packages call the same plot by different names and array the information differently.
Excel Excel offers a “Normal probability plot” as part of the Regression command in the Data Analysis extension, but (as of this writing) it is not a correct Normal probability plot and should not be used.
M07_SHAR8696_03_SE_C07.indd 260
As discussed in Chapter 6, functions that calculate probabilities for continuous probability distributions will calculate either pdf (“probability density function”—what we've been calling a p robability model) or cdf (“cumulative distribution function”— accumulate probabilities over a range of values). These technical terms show up in many of the function names. Excel uses the “cumulative” part of the command to determine whether you want a probability as your result (cumulative = true; this is the cdf) or a number as your result given a probability (cumulative = false; this is the pdf).
14/07/14 7:31 AM
www.freebookslides.com
Brief Case
261
To calculate Continuous Distribution Probabilities in Excel:
Note that the commands here are for Excel 2013. These functions are available in earlier versions of Excel with similar commands (e.g., the pre-2010 Excel command for NORM.DIST was NORMDIST). In general, the functions ending in DIST will calculate a probability given a value from the distribution and the INV functions will calculate a value given a probability. When using the function bar or typing into a cell, Excel will search the functions to find what matches the typed characters, and this can be used to find the proper function.
Comments JMP places the ordered data on the vertical axis and the Normal scores on the horizontal axis. The vertical axis aligns with the histogram's axis, a useful feature.
Minitab To make a “Normal Probability Plot” in MINITAB, • Choose Probability Plot from the Graph menu. • Select “Single” for the type of plot. Click OK.
XLStat
• Enter the name of the variable in the “Graph variables” box. Click OK.
XLStat can make Normal probability plots (XLStat calls these Q-Q plots):
Comments MINITAB places the ordered data on the horizontal axis and the Normal scores on the vertical axis.
• Select Visualizing data, and then Univariate plots. • On the General tab, click the Quantitative data box and then select the data on your worksheet. • Click OK.
SPSS To make a Normal “P-P plot” in SPSS,
• If prompted, click Continue.
• Choose Descriptives + P-P Plots from the Analyze menu. • Select the variable to be displayed and add to “Variable”.
JMP To make a “Normal Quantile Plot” in JMP, • Make a histogram using Distributions from the Analyze menu. • Click on the drop-down menu next to the variable name. • Choose Normal Quantile Plot from the drop-down menu.
• Make sure that “Normal” is selected under “Test Distribution”. Leave all other defaults set. Comments SPSS places the ordered data on the horizontal axis and the Normal scores on the vertical axis.
• JMP opens the plot next to the histogram.
Brief Case
Price/Earnings and Stock Value The CAPE10 index is based on the Price/Earnings (P/E) ratios of stocks. We can examine the P/E ratios without applying the smoothing techniques used to find the CAPE10. The file CAPE10 holds the data, giving dates, various economic variables, CAPE10 values, and P/E values. Examine the P/E values. Split the data into two periods: 1870–1989 and 1990 to the present. Would you judge that a Normal model would be appropriate for those values from the 1880s through the 1980s? Explain (and show the plots you made.) Now consider the more recent P/E values in this context. Do you think they have been extreme? What years, if any, appear to be particularly problematic? Explain.
M07_SHAR8696_03_SE_C07.indd 261
14/07/14 7:31 AM
www.freebookslides.com 262
CHAPTER 7 The Normal and Other Continuous Distributions
Exercises Normal model calculations can be performed using a variety of technology or with the tables in Appendix B. Different methods may yield slightly different results.
Section 7.1 1. An incoming MBA student took placement exams in economics and mathematics. In economics, she scored 82 and in math 86. The overall results on the economics exam had a mean of 72 and a standard deviation of 8, while the mean math score was 68, with a standard deviation of 12. On which exam did she do better compared with the other students? 2. The first Statistics exam had a mean of 65 and a standard deviation of 10 points; the second had a mean of 80and a standard deviation of 5 points. Derrick scored an 80 on both tests. Julie scored a 70 on the first test and a 90on the second. They both totaled 160 points on the two exams, but Julie claims that her total is better. Explain. 3. Your company’s Human Resources department administers a test of “Executive Aptitude.” They report test grades as z-scores, and you got a score of 2.20. What does this mean? 4. After examining a child at his 2-year checkup, the boy’s pediatrician said that the z-score for his height relative to American 2-year-olds was -1.88. Write a sentence to explain to the parents what that means. 5. Your company will admit to the executive training program only people who score in the top 3% on the executive aptitude test discussed in Exercise 3. a) With your z-score of 2.20, did you make the cut? b) What do you need to assume about test scores to find your answer in part a? 6. The pediatrician in Exercise 4 explains to the parents that the most extreme 5% of cases often require special treatment or attention. a) Does this child fall into that group? b) What do you need to assume about the heights of 2-year-olds to find your answer to part a?
Section 7.2 7. The Environmental Protection Agency (EPA) fuel economy estimates for automobiles suggest a mean of 24.8mpg and a standard deviation of 6.2 mpg for highway driving. Assume that a Normal model can be applied. a) Draw the model for auto fuel economy. Clearly label it, showing what the 68–95–99.7 Rule predicts about miles per gallon. b) In what interval would you expect the central 68% of autos to be found? c) About what percent of autos should get more than 31 mpg?
M07_SHAR8696_03_SE_C07.indd 262
d) About what percent of cars should get between 31 and 37.2 mpg? e) Describe the gas mileage of the worst 2.5% of all cars. 8. Some IQ tests are standardized to a Normal model with a mean of 100 and a standard deviation of 16. a) Draw the model for these IQ scores. Clearly label it, showing what the 68–95–99.7 Rule predicts about the scores. b) In what interval would you expect the central 95% of IQ scores to be found? c) About what percent of people should have IQ scores above 116? d) About what percent of people should have IQ scores between 68 and 84? e) About what percent of people should have IQ scores above 132? 9. Assuming a standard Normal model, what is the probability for each of the following cases?. Be sure to draw a picture first. a) z 7 -1.5 b) z 6 1.75 c) -2 6 z 6 1.35 d) z 7 0.35 10. What percent of a standard Normal model is found in each region? Draw a picture first. a) z 7 -2.05 b) z 6 -0.33 c) 1.2 6 z 6 1.8 d) z 6 1.28 11. In a standard Normal model, what value(s) of z cut(s) off the region described? Don’t forget to draw a picture. a) The highest 20% b) The highest 75% c) The lowest 5% d) The middle 90% 12. In a standard Normal model, what value(s) of z cut(s) off the region described? Don’t forget to draw a picture a) The lowest 20% b) The highest 15% c) The highest 20% d) The middle 50%
Section 7.3 13. Speeds of cars were measured as they passed one point on a road to study whether traffic speed controls were needed. Here’s a histogram and normal probability plot of the measured speeds. Is a Normal model appropriate for these data? Explain.
14/07/14 7:31 AM
www.freebookslides.com
Exercises 263
the times their swimmers have posted and creates a model based on the following assumptions: • The swimmers’ performances are independent. • Each swimmer’s times follow a Normal model. • The means and standard deviations of the times (in seconds) are as shown here.
32
15 Speed (mph)
# of cars
20
10 5 15.0
22.5
28 24 20
30.0 –1.25
Speed (mph)
0 1.25 Nscores
14. Has the Consumer Price Index (CPI) fluctuated around its mean according to a Normal model? Here are some displays. Is a Normal model appropriate for these data? Explain. 800
600
Mean
SD
1 (backstroke) 2 (breaststroke) 3 (butterfly) 4 (freestyle)
50.72 55.51 49.43 44.91
0.24 0.22 0.25 0.21
a) What are the mean and standard deviation for the relay team’s total time in this event? b) The team’s best time so far this season was 3:19.48. (That’s 199.48 seconds.) What is the probability that they will beat that time in the next event?
Section 7.5
400
200
0.0
75.0
150.0
225.0
CPI 200 150 CPI
Swimmer
100 50
–2
0 Nscores
2
Section 7.4 15. For a new type of tire, a NASCAR team found the average distance a set of tires would run during a race is 168 miles, with a standard deviation of 14 miles. Assume that tire mileage is independent and follows a Normal model. a) If the team plans to change tires twice during a 500-mile race, what is the expected value and standard deviation of miles remaining after two changes? b) What is the probability they won’t have to change tires a third time (and use a fourth set of tires) before the end of a 500-mile race? 16. In the 4 * 100 medley relay event, four swimmers swim 100 yards, each using a different stroke. A college team preparing for the conference championship looks at
M07_SHAR8696_03_SE_C07.indd 263
17. Because many passengers who make reservations do not show up, airlines often overbook flights (sell more tickets than there are seats). A Boeing 767-400ER holds 245 passengers. If the airline believes the rate of passenger no-shows is 5% and sells 255 tickets, is it likely they will not have enough seats and someone will get bumped? a) Use the Normal model to approximate the Binomial to determine the probability of at least 246 passengers showing up. b) Should the airline change the number of tickets they sell for this flight? Explain. 18. Shortly after the introduction of the Belgian euro coin, newspapers around the world published articles claiming the coin is biased. The stories were based on reports that someone had spun the coin 250 times and gotten 140 heads—that’s 56% heads. a) Use the Normal model to approximate the Binomial to determine the probability of spinning a fair coin 250 times and getting at least 140 heads. b) Do you think this is evidence that spinning a Belgian euro is unfair? Would you be willing to use it at the beginning of a sports event? Explain.
Section 7.6 19. A cable provider wants to contact customers in a particular telephone exchange to see how satisfied they are with the new digital TV service the company has provided. All numbers are in the 452 exchange, so there are 10,000 possible numbers from 452-0000 to 452-9999. If they select the numbers with equal probability: a) What distribution would they use to model the selection?
14/07/14 7:31 AM
www.freebookslides.com 264
CHAPTER 7 The Normal and Other Continuous Distributions
b) The new business “incubator” was assigned the 200 numbers between 452-2500 and 452-2699, but these businesses don’t subscribe to digital TV. What is the probability that the randomly selected number will be for an incubator business? c) Numbers above 9000 were only released for domestic use last year, so they went to newly constructed residences. What is the probability that a randomly selected number will be one of these? 20. In an effort to check the quality of their cell phones, a manufacturing manager decides to take a random sample of 10 cell phones from yesterday’s production run, which produced cell phones with serial numbers ranging (according to when they were produced) from 43005000 to 43005999. If each of the 1000 phones is equally likely to be selected: a) What distribution would they use to model the selection? b) What is the probability that a randomly selected cell phone will be one of the last 100 to be produced? c) What is the probability that the first cell phone selected is either from the last 200 to be produced or from the first 50 to be produced? 21. Lifetimes of electronic components can often be modeled by an exponential model. Suppose quality control engineers want to model the lifetime of a hard drive to have a mean lifetime of 3 years. a) What value of l should they use? b) With this model, what would the probability be that a hard drive lasts 5 years or less? 22. Suppose occurrences of sales on a small company’s website are well modeled by a Poisson model with l = 5>hour. a) If a sale just occurred, what it the expected waiting time until the next sale? b) What is the probability that the next sale will happen in the next 6 minutes?
Chapter Exercises For Exercises 23–30, use the 68–95–99.7 Rule to approximate the probabilities rather than using technology to find the values more precisely. Answers given for probabilities or percentages from Exercise 31 on assume that a calculator or software has been used. Answers found from using Z-tables may vary slightly. 23. Mutual fund returns 2013. In the first quarter of 2013, a group of domestic equity mutual funds had a mean return of 6.2% with a standard deviation of 1.8%. If a Normal model can be used to model them, what percent of the funds would you expect to be in each region? Be sure to draw a picture first. a) Returns of 8.0% or more b) Returns of 6.2% or less
M07_SHAR8696_03_SE_C07.indd 264
c) Returns between 2.6% and 9.8% d) Returns of more than 11.6% 24. Human resource testing. Although controversial and the subject of some recent law suits (e.g., Satchell et al. vs. FedEx Express), some human resource departments administer standard IQ tests to all employees. The Stanford-Binet test scores are well modeled by a Normal model with mean 100 and standard deviation 16. If the applicant pool is well modeled by this distribution, a randomly selected applicant would have what probability of scoring in the following regions? a) 100 or below b) Above 148 c) Between 84 and 116 d) Above 132 25. Mutual funds, again. From the mutual funds in Exercise 23 with quarterly returns that are well modeled by a Normal model with a mean of 6.2% and a standard deviation of 1.8%, find the cutoff return value(s) that would separate the a) highest 50%. b) highest 16%. c) lowest 2.5%. d) middle 68%. 26. Human resource testing, again. For the IQ test administered by human resources and discussed in Exercise 24, what cutoff value would separate the a) lowest 0.15% of all applicants? b) lowest 16%? c) middle 95%? d) highest 2.5%? 27. Currency exchange rates. The daily exchange rates for the five-year period 2008 to 2013 between the euro (EUR) and the British pound (GBP) can be modeled by a Normal distribution with mean 1.19 euros (to pounds) and standard deviation 0.043 euros. Given this model, what is the probability that on a randomly selected day during this period, the pound was worth a) less than 1.19 euros? b) more than 1.233 euros? c) less than 1.104 euros? d) Which would be more unusual, a day on which the pound was worth less than 1.126 euros or more than 1.298 euros? 28. Stock prices. For the 300 trading days from January 11, 2012 to March 22, 2013, the daily closing price of IBM stock (in $) is well modeled by a Normal model with mean $197.92 and standard deviation $7.16. According to this model, what is the probability that on a randomly selected day in this period the stock price closed a) above $205.08?
14/07/14 7:31 AM
www.freebookslides.com
Exercises 265
b) below $212.24? c) between $183.60 and $205.08? d) Which would be more unusual, a day on which the stock price closed above $206 or below $180? 29. Currency exchange rates, again. For the model of the EUR/GBP exchange rate discussed in Exercise 27, what would the cutoff rates be that would separate the a) highest 16% of EUR/GBP rates? b) lowest 50%? c) middle 95%? d) lowest 2.5%? 30. Stock prices, again. According to the model in Exercise 28, what cutoff value of price would separate the a) lowest 16% of the days? b) highest 0.15%? c) middle 68%? d) highest 50%? 31. Mutual fund probabilities. According to the Normal model N(0.062, 0.018) describing mutual fund returns in the 1st quarter of 2013 in Exercise 23, what percent of this group of funds would you expect to have return a) over 6.8%? b) between 0% and 7.6%? c) more than 1%? d) less than 0%? 32. Normal IQs. Based on the Normal model N (100, 16) describing IQ scores from Exercise 24, what percent of applicants would you expect to have scores a) over 80? b) under 90? c) between 112 and 132? d) over 125? 33. Mutual funds, once more. Based on the model N(0.062, 0.018) for quarterly returns from Exercise 23, what are the cutoff values for the a) highest 10% of these funds? b) lowest 20%? c) middle 40%? d) highest 80%? 34. More IQs. In the Normal model N(100, 16) for IQ scores from Exercise 24, what cutoff value bounds the a) highest 5% of all IQs? b) lowest 30% of the IQs? c) middle 80% of the IQs? d) lowest 90% of all IQs? 35. Mutual funds, finis. Consider the Normal model N(0.062, 0.018) for returns of mutual funds in Exercise 23 one last time.
M07_SHAR8696_03_SE_C07.indd 265
a) What value represents the 40th percentile of these returns? b) What value represents the 99th percentile? c) What’s the IQR of the quarterly returns for this group of funds? 36. IQs, finis. Consider the IQ model N(100, 16) one last time. a) What IQ represents the 15th percentile? b) What IQ represents the 98th percentile? c) What’s the IQR of the IQs? 37. Parameters. Every Normal model is defined by its parameters, the mean and the standard deviation. For each model described here, find the missing parameter. As always, start by drawing a picture. a) m = 20, 45% above 30; s = ? b) m = 88, 2% below 50; s = ? c) s = 5, 80% below 100; m = ? d) s = 15.6, 10% above 17.2; m = ? 38. Parameters, again. Every Normal model is defined by its parameters, the mean and the standard deviation. For each model described here, find the missing parameter. Don’t forget to draw a picture. a) m = 1250, 35% below 1200; s = ? b) m = 0.64, 12% above 0.70; s = ? c) s = 0.5, 90% above 10.0; m = ? d) s = 220, 3% below 202; m = ? 39. SAT or ACT? Each year thousands of high school students take either the SAT or ACT, standardized tests used in the college admissions process. Combined SAT scores can go as high as 1600, while the maximum ACT composite score is 36. Since the two exams use very different scales, comparisons of performance are difficult. (A convenient rule of thumb is SAT = 40 * ACT + 150; that is, multiply an ACT score by 40 and add 150 points to estimate the equivalent SAT score.) Assume that one year the combined SAT can be modeled by N(1000, 200) and the ACT can be modeled by N(27, 3). If an applicant to a university has taken the SAT and scored 1260 and another student has taken the ACT and scored 33, compare these students scores using z-values. Which one has a higher relative score? Explain. 40. Economics. Anna, a business major, took final exams in both Microeconomics and Macroeconomics and scored 83 on both. Her roommate Megan, also taking both courses, scored 77 on the Micro exam and 95 on the Macro exam. Overall, student scores on the Micro exam had a mean of 81 and a standard deviation of 5, and the Macro scores had a mean of 74 and a standard deviation of 15. Which student’s overall performance was better? Explain.
14/07/14 7:31 AM
www.freebookslides.com CHAPTER 7 The Normal and Other Continuous Distributions
266
41. Low job satisfaction. Suppose that job satisfaction scores can be modeled with N(100, 12). Human resource departments of corporations are generally concerned if the job satisfaction drops below a certain score. What score would you consider to be unusually low? Explain. 42. Low return. Exercise 23 proposes modeling quarterly returns of a group of mutual funds with N(0.062, 0.018). The manager of this group of funds would like to flag any fund whose return is unusually low for a quarter. What level of return would you consider to be unusually low? Explain. 43. Management survey. A survey of 200 middle managers showed a distribution of the number of hours of exercise they participated in per week with a mean of 3.66 hours and a standard deviation of 4.93 hours. a) According to the Normal model, what percent of managers will exercise fewer than one standard deviation below the mean number of hours? b) For these data, what does that mean? Explain. c) Explain the problem in using the Normal model for these data. 44. Progress rate. Grade Point Average (GPA) and Progress Rate (PR) are two variables of crucial importance to students and schools. While each class’s GPA data is typically bell shaped, the PR data is quite different. Here is a histogram and summary statistics from a large class of business students in a European university, where PR is determined as the share of the 60 yearly European Credit Transfer and Accumulation System (ECTS) points achieved by any student.
–5
5
Mean Std Dev Std Err Mean Upper 95% Mean Lower 95% Mean N
10 15 20 25 30 35 40 45 50 55 60
40.596317 22.290204 0.6849613 41.940354 39.25228 1059
M07_SHAR8696_03_SE_C07.indd 266
100.0% 99.5% 97.5% 90.0% 75.0% 50.0% 25.0% 10.0% 2.5% 0.5% 0.0%
maximum quartile median quartile minimum
60 60 60 60 60 53.5 17 6.5 0 0 0
a) Which is a better summary of the typical PR, the mean or the median? Explain. b) Which is a better summary of the spread, the IQR or the standard deviation? Explain. c) From a Normal model, about what percentage of students are within one standard deviation of the mean PR? d) What percentage of students actually are within one standard deviation of the mean? e) Explain the problem in using the Normal model for these data. 45. Drug company. Manufacturing and selling drugs that claim to reduce an individual’s cholesterol level is big business. A company would like to market their drug to women if their cholesterol is in the top 15%. Assume the cholesterol levels of adult American women can be described by a Normal model with a mean of 188 mg>dL and a standard deviation of 24. a) Draw and label the Normal model. b) What percent of adult women do you expect to have cholesterol levels over 200 mg>dL? c) What percent of adult women do you expect to have cholesterol levels between 150 and 170 mg>dL? d) Estimate the interquartile range of the cholesterol levels. e) Above what value are the highest 15% of women’s cholesterol levels? 46. Tire company. A tire manufacturer believes that the tread life of its snow tires can be described by a Normal model with a mean of 32,000 miles and a standard deviation of 2500 miles. a) If you buy a set of these tires, would it be reasonable for you to hope that they’ll last 40,000 miles? Explain. b) Approximately what fraction of these tires can be expected to last less than 30,000 miles? c) Approximately what fraction of these tires can be expected to last between 30,000 and 35,000 miles? d) Estimate the IQR for these data. e) In planning a marketing strategy, a local tire dealer wants to offer a refund to any customer whose tires fail to last a certain number of miles. However, the dealer does not want to take too big a risk. If the dealer is willing to give refunds to no more than 1 of every 25 customers, for what mileage can he guarantee these tires to last? 47. Claims. Two companies make batteries for cell phone manufacturers. One company claims a mean life span of 2 years, while the other company claims a mean life span of 2.5 years (assuming average use of minutes/month for the cell phone). a) Explain why you would also like to know the standard deviations of the battery life spans before deciding which brand to buy. b) Suppose those standard deviations are 1.5 months for the first company and 9 months for the second company. Does this change your opinion of the batteries? Explain.
14/07/14 7:31 AM
www.freebookslides.com
Exercises 267
48. Car speeds. The police department of a major city needs to update its budget. For this purpose, they need to understand the variation in their fines collected from motorists for speeding. As a sample, they recorded the speeds of cars driving past a location with a 20 mph speed limit, a place that in the past has been known for producing fines. The mean of 100 readings was 23.84 mph, with a standard deviation of 3.56 mph. (The police actually recorded every car for a twomonth period. These are 100 representative readings.) a) How many standard deviations from the mean would a car going the speed limit be? b) Which would be more unusual, a car traveling 34 mph or one going 10 mph? 49. CEOs. A business publication recently released a study on the total number of years of experience in industry among CEOs. The mean is provided in the article, but not the standard deviation. Is the standard deviation most likely to be 6 months, 6 years, or 16 years? Explain which standard deviation is correct and why. 50. Stocks. A newsletter for investors recently reported that the average stock price for a blue chip stock over the past 12 months was $72. No standard deviation was given. Is the standard deviation more likely to be $6, $16, or $60? Explain.
c) If the costs can be described by Normal models, what’s the probability that medical expenses are higher for someone’s dog than for her cat? d) What concerns do you have? 53. More cereal. In Exercise 51 we poured a large and a small bowl of cereal from a box. Suppose the amount of cereal that the manufacturer puts in the boxes is a random variable with mean 16.2 ounces and standard deviation 0.1 ounces. a) Find the expected amount of cereal left in the box. b) What’s the standard deviation? c) If the weight of the remaining cereal can be described by a Normal model, what’s the probability that the box still contains more than 13 ounces? 54. More pets. You’re thinking about getting two dogs and a cat. Assume that annual veterinary expenses are independent and have a Normal model with the means and standard deviations described in Exercise 52. a) Define appropriate variables and express the total annual veterinary costs you may have. b) Describe the model for this total cost. Be sure to specify its name, expected value, and standard deviation. c) What’s the probability that your total expenses will exceed $400? 55. Bikes. Bicycles arrive at a bike shop in boxes. Before they can be sold, they must be unpacked, assembled, and tuned (lubricated, adjusted, etc.). Based on past experience, the shop manager makes the following assumptions about how long this may take: • The times for each setup phase are independent. • The times for each phase follow a Normal model. • The means and standard deviations of the times (in minutes) are as shown:
51. Cereal. The amount of cereal that can be poured into a small bowl varies with a mean of 1.5 ounces and a standard deviation of 0.3 ounces. A large bowl holds a mean of 2.5 ounces with a standard deviation of 0.4 ounces. You open a new box of cereal and pour one large and one small bowl. a) How much more cereal do you expect to be in the large bowl? b) What’s the standard deviation of this difference? c) If the difference follows a Normal model, what’s the probability the small bowl contains more cereal than the large one? d) What are the mean and standard deviation of the total amount of cereal in the two bowls? e) If the total follows a Normal model, what’s the probability you poured out more than 4.5 ounces of cereal in the two bowls together? f) The amount of cereal the manufacturer puts in the boxes is a random variable with a mean of 16.3 ounces and a standard deviation of 0.2 ounces. Find the expected amount of cereal left in the box and the standard deviation.
a) What are the mean and standard deviation for the total bicycle setup time? b) A customer decides to buy a bike like one of the display models but wants a different color. The shop has one, still in the box. The manager says they can have it ready in half an hour. Do you think the bike will be set up and ready to go as promised? Explain.
52. Pets. The American Veterinary Association claims that the annual cost of medical care for dogs averages $100, with a standard deviation of $30, and for cats averages $120, with a standard deviation of $35. a) What’s the expected difference in the cost of medical care for dogs and cats? b) What’s the standard deviation of that difference?
56. Bike sale. The bicycle shop in Exercise 55 estimates using current labor costs that unpacking a bike costs $0.82 on average with a standard deviation of $0.16. Assembly costs $8.00 on average with a standard deviation of $0.88 and tuning costs $4.10 with a standard deviation of $0.90. Because the costs are directly related to the times, you can use the same assumptions as in exercise 55.
M07_SHAR8696_03_SE_C07.indd 267
Phase
Mean
SD
Unpacking Assembly Tuning
3.5 21.8 12.3
0.7 2.4 2.7
14/07/14 7:31 AM
www.freebookslides.com 268
CHAPTER 7 The Normal and Other Continuous Distributions
a) Define your random variables, and use them to express the total cost of the bike set up. b) Find the mean set up cost. c) Find the standard deviation of the set up cost. d) If the next shipment is 40 bikes, what is the probability that the total set up cost will be less than $500? 57. Coffee and doughnuts. At a certain coffee shop, all the customers buy a cup of coffee; some also buy a doughnut. The shop owner believes that the number of cups he sells each day is normally distributed with a mean of 320 cups and a standard deviation of 20 cups. He also believes that the number of doughnuts he sells each day is independent of the coffee sales and is normally distributed with a mean of 150 doughnuts and a standard deviation of 12. a) The shop is open every day but Sunday. Assuming dayto-day sales are independent, what’s the probability he’ll sell more than 2000 cups of coffee in a week? b) If he makes a profit of 50 cents on each cup of coffee and 40 cents on each doughnut, can he reasonably expect to have a day’s profit of over $300? Explain. c) What’s the probability that on any given day he’ll sell a doughnut to more than half of his coffee customers? 58. Weightlifting. The Atlas BodyBuilding Company (ABC) sells “starter sets” of barbells that consist of one bar, two 20-pound weights, and four 5-pound weights. The bars weigh an average of 10 pounds with a standard deviation of 0.25 pounds. The weights average the specified amounts, but the standard deviations are 0.2 pounds for the 20-pounders and 0.1 pounds for the 5-pounders. We can assume that all the weights are normally distributed. a) ABC ships these starter sets to customers in two boxes: The bar goes in one box and the six weights go in another. What’s the probability that the total weight in that second box exceeds 60.5 pounds? Define your variables clearly and state any assumptions you make. b) It costs ABC $0.40 per pound to ship the box containing the weights. Because it’s an odd-shaped package, though, shipping the bar costs $0.50 a pound plus a $6.00 surcharge. Find the mean and standard deviation of the company’s total cost for shipping a starter set. c) Suppose a customer puts a 20-pound weight at one end of the bar and the four 5-pound weights at the other end. Although he expects the two ends to weigh the same, they might differ slightly. What’s the probability the difference is more than a quarter of a pound? 59. Lefties. A lecture hall has 200 seats with folding arm tablets, 30 of which are designed for left-handers. The typical size of classes that meet there is 188, and we can assume that about 13% of students are left-handed. Use a Normal approximation to find the probability that a righthanded student in one of these classes is forced to use a lefty arm tablet.
M07_SHAR8696_03_SE_C07.indd 268
60. Seatbelts. Police estimate that 80% of drivers wear their seatbelts. They set up a safety roadblock, stopping cars to check for seatbelt use. If they stop 120 cars, what’s the probability they find at least 20 drivers not wearing their seatbelt? Use a Normal approximation. 61. Rickets. Vitamin D is essential for strong, healthy bones. Although the bone disease rickets was largely eliminated in England during the 1950s, some people there are concerned that this generation of children is at increased risk because they are more likely to watch TV or play computer games than spend time outdoors. Recent research indicated that about 20% of British children are deficient in vitamin D. A company that sells vitamin D supplements tests 320 elementary school children in one area of the country. Use a Normal approximation to find the probability that no more than 50 of them have vitamin D deficiency. 62. Tennis. A tennis player has taken a special course to improve her serving. She thinks that individual serves are independent of each other. She has been able to make a successful first serve 70% of the time. Use a Normal approximation to find the probability she’ll make at least 65 of her first serves out of the 80 she serves in her next match if her success percentage has not changed. 63. Pipeline defects. Maintenance Engineers are responsible for the proper functioning of a 20 km long gas pipeline. If there is a pipeline failure, they need to physically inspect the whole stretch to discover the fracture. If X is how far the fracture is from them, X can be modeled as a uniform random variable on the interval from 0 to 20 km. a) What is the probability that the fracture is found within the first stretch of 5 km? b) What is the probability that the fracture is found only after inspecting 16 km of the pipeline? 64. Quitting time. My employee seems to leave work anytime between 5PM and 6PM, uniformly. a) What is the probability he will still be at work at 5:45 PM? b) What is the probability he will still be at work at 5:45 PM every day this week (M-F)? c) What did you assume to calculate b? 65. Web visitors. A website manager has noticed that during the evening hours, about 3 people per minute check out from their shopping cart and make an online purchase. She believes that each purchase is independent of the others. a) What model might you suggest to model the number of purchases per minute? b) What model would you use to model the time between events? c) What is the mean time between purchases? d) What is the probability that the time to the next purchase will be between 1 and 2 minutes?
14/07/14 7:31 AM
www.freebookslides.com
Exercises 269
66. Information desk. The arrival rate at a university library information desk is about 5 per hour, with an apparent lack of any relationship between arrivals in consecutive hours. a) What model might you use to model the number of arrivals at this desk per hour? b) What model would you use to model the time between arrivals at the information desk? c) What would the probability be that the time to the next arrival at the desk is 15 minutes or less? d) What is the mean time between arrivals?
Just C hecking Ans wers 1 a) On the first test, the mean is 88 and the SD
is 4, so z = 190 - 882 >4 = 0.5. On the s econd test, the mean is 75 and the SD is 5, soz = 180 - 752 >5 = 1.0. The first test has the lower z-score, so it is the one that will be dropped.
b) The second test is 1 standard deviation above the mean, farther away than the first test, so it’s the better score relative to the class. 2 The mean is 184 centimeters, with a standard deviation
of 8 centimeters. 2 meters is 200 centimeters, which is 2 standard deviations above the mean. We e xpect 2.28% of the men to be above 2 meters.
3 a) We know that 68% of the time we’ll be within
1standard deviation (2 min) of 20. So 32% of the time we’ll arrive in less than 18 or more than 22 minutes. Half of those times (16%) will be greater than 22minutes, so 84% will be less than 22 minutes.
b) 24 minutes is 2 standard deviations above the mean. From Table Z we find that 2.28% of the times will be more than 24 minutes. c) Traffic incidents may occasionally increase the time it takes to get to school, so the driving times may be skewed to the right, and there may be outliers. d) If so, the Normal model would not be appropriate and the percentages we predict would not be accurate.
M07_SHAR8696_03_SE_C07.indd 269
14/07/14 7:31 AM
www.freebookslides.com
M07_SHAR8696_03_SE_C07.indd 270
14/07/14 7:31 AM
8
www.freebookslides.com
Surveys and Sampling
Roper Polls Public opinion polls are a relatively new phenomenon. In 1948, as a result of telephone surveys of likely voters, all of the major organizations—Gallup, Roper, and Crossley—consistently predicted, throughout the summer and into the fall, that Thomas Dewey would defeat Harry Truman in the November presidential election. By October the results seemed so clear that Fortune magazine declared, “Due to the overwhelming evidence, Fortune and Mr. Roper plan no further detailed reports on change of opinion in the forthcoming presidential campaign . . . .” Of course, Harry Truman went on to win the 1948 election, and the picture of Truman in the early morning after the election holding up the Chicago Tribune (printed the night before), with its headline declaring Dewey the winner, has become legend. The public’s faith in opinion polls plummeted after the election, but Elmo Roper vigorously defended the pollsters. Roper was a principal and founder of one of the first market research firms, Cherington, Wood, and Roper, and director of the Fortune Survey, which was the first national poll to use scientific sampling techniques. He argued that rather than abandoning polling, business leaders should learn what had gone wrong in the 1948 polls so that market research could be improved. His frank admission of the mistakes made in those polls helped to restore confidence in polling as a business tool. For the rest of his career, Roper split his efforts between two projects, commercial polling and public opinion. He established the Roper Center for Public Opinion Research at Williams College as a place to house public opinion archives, convincing fellow polling leaders Gallup and Crossley to participate as well. Now located 271
M08_SHAR8696_03_SE_C08.indd 271
14/07/14 7:31 AM
www.freebookslides.com 272
CHAPTER 8 Surveys and Sampling
at the University of Connecticut, the Roper Center is one of the world’s leading archives of social science data. Roper’s market research efforts started as Roper Research Associates and later became the Roper Organization, which was acquired in 2005 by GfK. Founded in Germany in 1934 as the Gesellschaft für Konsumforschung (literally, “Society for Consumption Research”), GfK now stands for “growth from knowledge.” It is the fourth largest international market research organization, with over 130 companies in 70 countries and more than 7700 employees worldwide.
G
fK Roper Consulting conducts a yearly, global study to examine cultural, economic, and social information that may be crucial to companies doing business worldwide. These companies use the information provided by GfK Roper to help make marketing and advertising decisions in different markets around the world. How do the researchers at GfK Roper know that the responses they get reflect the real attitudes of consumers? After all, they don’t ask everyone, but they don’t want to limit their conclusions to just the people they surveyed. Generalizing from the data at hand to the world at large is something that market researchers, investors, and pollsters do every day. To do it wisely, they need three fundamental ideas.
8.1
Three Ideas of Sampling Idea 1: Sample—Examine a Part of the Whole
The W’s and Sampling The population we are interested in is usually determined by the why of our study. The participants or cases in the sample we draw will be the who. When and how we draw the sample may depend on what is practical.
M08_SHAR8696_03_SE_C08.indd 272
We’d like to know about an entire collection of individuals, called a population, but examining all of them is usually impractical, if not impossible. So we settle for examining a smaller group of individuals—a sample—selected from the population. For the Roper researchers the population of interest is the entire world, but it’s not practical, cost-effective, or feasible to survey everyone. So they examine a sample selected from the population. We take samples all the time. For example, if a restaurant chef wants to be sure that the vegetable soup she’s cooking is up to her standards, she’ll taste a spoonful. She doesn’t need to consume the whole pot. She can trust that the taste will represent the flavor of the population—the entire pot. The idea of tasting is that a small sample, if selected properly, can represent the larger population. Sampling is common in many aspects of business practice. For example, auditors may sample some records rather than reading through all of them. Manufacturers monitor quality by testing a small sample off the line. The GfK Roper Reports® Worldwide poll is an example of a sample survey, designed to ask questions of a small group of people in the hope of learning something about the entire population. Most likely, you’ve never been selected to be part of a national opinion poll. That’s true of most people. So how can the pollsters claim that a sample represents the entire population? As we’ll see, a representative sample can often provide a good idea of what the entire population is like. But the sample must be selected with care. Selecting a sample to represent the population fairly is easy in theory, but in practice, it’s more difficult than it sounds. For example, a sample may fail to represent part of the population. If a retail business samples customers as they come in the door, they may be missing an important part of their potential customer population—those who choose to shop elsewhere. Samples that over- or underemphasize
14/07/14 7:31 AM
www.freebookslides.com
Three Ideas of Sampling
273
some characteristics of the population are said to be biased. When a sample is biased, the summary characteristics of a sample differ from the corresponding characteristics of the population it is trying to represent, so they can produce misleading information. Conclusions based on biased samples are inherently flawed. There is usually no way to fix bias after the sample is drawn and no way to salvage useful information from it. To make the sample as representative as possible, the best strategy is to select individuals for the sample at random. This may seem almost careless at first, but, as we will see, it is essential.
Idea 2: Randomize Think back to our soup example. Suppose the chef adds some salt to the pot (the population). If she samples from the top before stirring, she’ll get the misleading idea that the soup is salty. If she samples from the bottom, she’ll get the equally misleading idea that it’s bland. But by stirring the soup, she’ll make each spoonful a more random sample of the soup, distributing the salt throughout the pot, and making each taste more typical of the saltiness of the whole pot. Deliberate randomization is one of the great tools of Statistics. Randomization can also protect against factors that you aren’t aware of. Suppose, while the chef isn’t looking, an assistant adds a handful of peas to the soup. The peas sink to the bottom of the pot, mixing with the other vegetables. Stirring in the salt also randomizes the peas throughout the pot, making the sample taste more typical of the overall pot even though the chef didn’t know the peas were there. So randomizing protects us by giving us a representative sample even for effects we were unaware of. For a survey, we select participants at random, and this helps us represent all the features of our population, making sure that on average the sample looks like the rest of the population. The essential feature of randomness is that the selection is “fair.” We have discussed many facets of randomness in Chapter 5, and we can use some of those concepts here. What makes the sample fair is that each participant has an equal chance to be selected. • Why not match the sample to the population? Rather than randomizing, we could try to design a sample to include every possible, relevant characteristic: income level, age, political affiliation, marital status, number of children, place of residence, etc. But we can’t possibly think of all the things that might be important. Even if we could, we wouldn’t be able to match our sample to the population for all these characteristics. How well can a sample represent the population from which it was selected? Here’s an example using the database of the Paralyzed Veterans of America, a philanthropic organization with a donor list of about 3.5 million people. We’ve taken two samples, each of 8000 individuals at random from the population. Table 8.1 shows how the means and proportions match up on seven variables.
Table 8.1 Means and proportions for seven variables from two samples of size 8000 from the Paralyzed Veterans of America data. We drew these samples using Microsoft Excel’s RAND function (Excel 2013), but you can use almost any statistics software to draw similar random samples. The fact that the summaries of the variables from these two samples are so similar gives us confidence that either one would be representative of the entire population.
M08_SHAR8696_03_SE_C08.indd 273
14/07/14 7:31 AM
www.freebookslides.com 274
CHAPTER 8 Surveys and Sampling
The two samples match closely in every category. You can see how well randomizing has stirred the population. We didn’t preselect the samples for these variables, but randomizing has matched the results closely. The two samples don’t vary much from each other, so we can assume that they don’t differ much from the rest of the population either.
Idea 3: The Sample Size Is What Matters You probably weren’t surprised by the idea that a sample can represent the whole. And the idea of sampling randomly to make the sample fair makes sense too. But the third important idea of sampling often surprises people. The third idea is that the size of the sample determines what we can conclude from the data regardless of the size of the population. Many people think that to provide a good representation of the population, the sample must be a large percentage, or fraction, of the population, but in fact all that matters is the size of the sample. The size of the population doesn’t matter at all.1 A random sample of 100 students in a college represents the student body just about as well as a random sample of 100 voters represents the entire electorate of the United States. This is perhaps the most surprising idea in designing surveys. Think about the pot of soup again. The chef is probably making a large pot of soup. But she doesn’t need a really big spoon to decide how the soup tastes. She’ll get the same information from an ordinary spoonful no matter how large the pot—as long as the pot is sufficiently stirred. That’s what randomness does for us. What fraction of the population you sample doesn’t matter. It’s the sample size itself that’s important. This idea is of key importance to the design of any sample survey, because it determines the balance between how well the survey can measure the population and how much the survey costs. How big a sample do you need? That depends on what you’re estimating, but too small a sample won’t be representative of the population. To get an idea of what’s really in the soup, you need a large enough taste to be a representative sample from the pot, including, say, a selection of the vegetables. For a survey that tries to find the proportion of the population falling into a category, you’ll usually need at least several hundred respondents.2 • What do the professionals do? How do professional polling and market research companies do their work? The most common polling method today is to contact respondents by telephone. Computers generate random telephone numbers for telephone exchanges known to include residential customers; so pollsters can contact people with unlisted phone numbers. The person who answers the phone will be invited to respond to the survey—if that person qualifies. (For example, only adults are usually surveyed, and the respondent usually must live at the residence phoned.) If the person answering doesn’t qualify, the caller will ask for an appropriate alternative. When they conduct the interview, the pollsters often list possible responses (such as product names) in randomized orders to avoid biases that might favor the first name on the list. Do these methods work? The Pew Research Center for the People and the Press reports on survey completion rates about every three years. Pew reports that by 2012 a telephone survey could contact about 62% of households whose
1
Well, that’s not exactly true. If the population is smaller than about 10 times the size of the sample it can matter. It doesn’t matter whenever, as usual, our sample is a very small fraction of the population. 2 Chapter 9 gives the details behind this statement and shows how to decide on a sample size for a survey.
M08_SHAR8696_03_SE_C08.indd 274
14/07/14 7:31 AM
www.freebookslides.com
Three Ideas of Sampling
275
phone numbers had been randomly generated. However, only 14% of those contacts yielded an interview, amounting to only 9% of the households originally sampled. Nevertheless, Pew concludes that “telephone surveys that include landlines and cell phones and are weighted to match the demographic composition of the population continue to provide accurate data on most political, social and economic measures.” (www.people-press.org/2012/05/15/ assessing-the-representativeness-of-public-opinion-surveys/)
A Census—Does It Make Sense? Why bother determining the right sample size? If you plan to open a store in a new community, why draw a sample of residents to understand their interests and needs? Wouldn’t it be better to just include everyone and make the “sample” be the entire population? Such a special sample is called a census. Although a census would appear to provide the best possible information about the population, there are a number of reasons why it might not. First, it can be difficult to complete a census. There always seem to be some individuals who are hard to locate or hard to measure. Do you really need to contact the folks away on vacation when you collect your data? How about those with no telephone or mailing address? The cost of locating the last few cases may far exceed the budget. It can also be just plain impractical to take a census. The quality control manager for Hostess® Twinkies® doesn’t want to taste all the Twinkies on the production line to determine their quality. Aside from the fact that nobody could eat that many Twinkies, it would defeat their purpose: there would be none left to sell. Second, the population you’re studying may change. For example, in any human population, babies are born, people travel, and folks die during the time it takes to complete the census. News events and advertising campaigns can cause sudden shifts in opinions and preferences. A sample, surveyed in a shorter time frame, may actually generate more accurate information. Finally, taking a census can be cumbersome. A census usually requires a team of pollsters and the cooperation of the population. Even with both, it’s almost impossible to avoid errors. Because it tries to count everyone, the U.S. Census records too many college students. Many are included both by their families and in a report filed by their schools. Errors of this sort, of both under- and overcounting can be found throughout the U.S. Census.
For Example
Identifying sampling terms
A nonprofit organization has taken over the historic State Theater and hopes to preserve it with a combination of attractive shows and fundraising. The organization has asked a team of students to help them design a survey to better understand the customer base likely to purchase tickets. Fortunately, the theater’s computerized ticket system records contact and some demographic information for ticket purchasers, and that database of 7345 customers is available.
Questions What is the population of interest? What would a census be in this case? Would it be practical? Answers The population is all potential ticket purchasers. A census would have to
reach all potential purchasers. We don’t know who they are or have any way to contact them.
M08_SHAR8696_03_SE_C08.indd 275
14/07/14 7:31 AM
www.freebookslides.com 276
CHAPTER 8 Surveys and Sampling
8.2
Statistic Any quantity that we calculate from data could be called a “statistic.” But in practice, we usually obtain a statistic from a sample and use it to estimate a population parameter.
Parameter Population model parameters are not just unknown—usually they are unknowable. We take a sample and use the sample statistics to estimate them.
Populations and Parameters GfK Roper Reports Worldwide reports that 60.5% of people over 50 worry about food safety, but only 43.7% of teens do. What does this claim mean? We can be sure the Roper researchers didn’t take a census. So they can’t possibly know exactly what percentage of teenagers worry about food safety. So what does “43.7%” mean? To generalize from a sample to the world at large, we need a model of reality. Such a model doesn’t need to be complete or perfect. Just as a model of an airplane in a wind tunnel can tell engineers what they need to know about aerodynamics even though it doesn’t include every rivet of the actual plane, models of data can give us summaries that we can learn from and use even though they don’t fit each data value exactly. It’s important to remember that they’re only models of reality and not reality itself. But without models, what we can learn about the world at large is limited to only what we can say about the data we have at hand. Models use mathematics to represent reality. We call the key numbers in those models parameters. Sometimes a parameter used in a model for a population is called (redundantly) a population parameter. But let’s not forget about the data. We use the data to try to estimate values for the population parameters. Any summary found from the data is a statistic. Those statistics that estimate population parameters are particularly interesting. Sometimes—and especially when we match statistics with the parameters they estimate—we use the term sample statistic. We draw samples because we can’t work with the entire population. We hope that the statistics we compute from the sample will estimate the corresponding parameters accurately. A sample that does this is said to be representative.
Ju s t Che c k i n g 1 Various claims are often made for surveys. Why is each of the d) A poll taken at a popular website (www.statsisfun.org)
garnered 12,357 responses. The majority of respondents following claims not correct? said they enjoy doing Statistics. With a sample size that a) It is always better to take a census than to draw a sample. large, we can be sure that most Americans feel this way. b) Stopping customers as they are leaving a restaurant is a e) The true percentage of all Americans who enjoy good way to sample opinions about the quality of the food. Statistics is called a “population statistic.” c) We drew a sample of 100 from the 3000 students in a school. To get the same level of precision for a town of 30,000 residents, we’ll need a sample of 1000.
8.3
Common Sampling Designs We’ve said that every individual in the population should have an equal chance of being selected in a sample. That makes the sample fair, but it’s not quite enough to ensure that the sample is representative. Consider, for example, a market analyst who samples customers by drawing at random from product registration forms, half of which arrived by mail and half by online registration. She flips a coin. If it comes up heads, she’ll draw 100 mail returns; tails, she’ll draw 100 electronic returns. Each customer has an equal chance of being selected, but if tech-savvy customers are different, then the samples are hardly representative.
Simple Random Sample (SRS) To make the sample representative, we must ensure that our sampling method gives each combination of individuals an equal chance as well. A sample drawn in this way is called a simple random sample, usually abbreviated SRS. An SRS is the sampling method on which the theory of working with sampled data is based and thus the standard against which we measure other sampling methods.
M08_SHAR8696_03_SE_C08.indd 276
14/07/14 7:31 AM
www.freebookslides.com
Common Sampling Designs
Sampling Errors vs. Bias Referring to sample-to-sample variability as sampling error, makes it sound like it’s some kind of mistake. It’s not. We understand that samples will vary, so “sampling errors” are to be expected. It’s bias we must strive to avoid. Bias means our sampling method distorts our view of the population. Of course, bias leads to mistakes. Even more insidious, bias introduces errors that we cannot correct with s ubsequent analysis.
277
We’d like to select from the population, but often we don’t have a list of all the individuals in the population. The list we actually draw from is called a sampling frame. A store may want to survey all its regular customers. But it can’t draw a sample from the population of all regular customers, because it doesn’t have such a list. The store may have a list of customers who have registered as “frequent shoppers.” That list can be the sampling frame from which the store can draw its sample. Of course, whenever the sampling frame and the population differ (as they almost always will), we must deal with the differences. Are the opinions of those who registered as frequent shoppers different from the rest of the regular shoppers? What about customers who used to be regulars but haven’t shopped there recently? The answers to questions like these about the sampling frame may depend on the purpose of the survey and may impact the conclusions that one can draw. Once we have a sampling frame, we need to randomize it so we can choose an SRS. Fortunately, random numbers are readily available these days in spreadsheets, statistics programs, and even on the Internet. Before this technology existed, people used to literally draw numbers out of a hat to randomize. But now, the easiest way to randomize your sampling frame is to match it with a parallel list of random numbers and then sort the random numbers, carrying along the cases so that they get “shuffled” into random order. Then you can just pick cases off the top of the randomized list until you have enough for your sample. Samples drawn at random generally differ one from another. If we were to repeat the sampling process, a new draw of random numbers would select different people for our sample. These differences would lead to different values for the variables we measure. We call these sample-to-sample differences sampling variability. Sometimes they are called sampling error even though no error has taken place. Surprisingly, sampling variability isn’t a problem; it’s an opportunity. If different samples from a population vary little from each other, then most likely the underlying population harbors little variation. If the samples show much sampling variability, the underlying population probably varies a lot. In the coming chapters, we’ll spend much time and attention working with sampling variability to better understand what we are trying to measure.
For Example
Choosing a random sample
Continuing the example on page 275, the student consultants select 200 ticket buyers at random from the database. First, the State Theater database is placed in a spreadsheet. Next, to draw random numbers, the students use the Excel command RAND( ). (They type =RAND( ) in the top cell of a column next to the data and then use Fill Down to populate the column down to the bottom.) They then sort the spreadsheet to put the random column in order and select ticket buyers from the top of the randomized spreadsheet until they complete 200 interviews. This makes it easy to select more respondents when (as always happens) some of the people they select can’t be reached by telephone or decline to participate.
A Different Answer Every Time? The RAND() function in Excel can take you by surprise. Every time the spreadsheet reopens, you get a new column of random numbers. But don’t worry. Once you’ve shuffled the rows, you can ignore the new numbers. The order you got by shuffling won’t keep changing. (Image created in Microsoft Excel 2013.)
M08_SHAR8696_03_SE_C08.indd 277
Questions What is the sampling frame? If the customer database held 30,000 records instead of 7345, how much larger a sample would we need to get the same information? If we then draw a different sample of 200 customers and obtain different answers to the questions on the survey, how do we refer to these differences?
Answers The sampling frame is the customer database. The size of the sample is all that matters, not the size of the population. We would need a sample of 200. The differences in the responses from one sample to another are called sampling error, or sampling variability.
14/07/14 7:31 AM
www.freebookslides.com 278
CHAPTER 8 Surveys and Sampling
Simple random sampling is not the only fair way to sample. More complicated designs may save time or money or avert sampling problems. All statistical sampling designs have in common the idea that chance, rather than human choice, is used to select the sample.
Stratified Sampling
Strata or Clusters? We create strata by dividing the population into groups of similar individuals so that each stratum is different from the others. (For example, we often stratify by age, race, or sex.) By contrast, we create clusters that all look pretty much alike, each representing the wide variety of individuals seen in the population.
M08_SHAR8696_03_SE_C08.indd 278
Designs that are used to sample from large populations—especially populations residing across large areas—are often more complicated than simple random samples. Sometimes we slice the population into homogeneous groups, called strata, and then use simple random sampling within each stratum, combining the results at the end. This is called stratified random sampling. Why would we want to stratify? Suppose we want to survey how shoppers feel about a potential new anchor store at a large suburban mall. The shopper population is 60% women and 40% men, and we suspect that men and women have different views on their choice of anchor stores. If we use simple random sampling to select 100 people for the survey, we could end up with 70 men and 30 women or 35 men and 65 women. Our resulting estimates of the attractiveness of a new anchor store could vary widely. To help reduce this sampling variability, we can force a representative balance, selecting 40 men at random and 60 women at random. This would guarantee that the proportions of men and women within our sample match the proportions in the population, and that should make such samples more accurate in representing population opinion. You can imagine that stratifying by race, income, age, and other characteristics can be helpful, depending on the purpose of the survey. When we use a sampling method that restricts by strata, additional samples are more like one another, so statistics calculated for the sampled values will vary less from one sample to another. This reduced sampling variability is the most important benefit of stratifying, but the analysis of data sampled with these designs is beyond the scope of our book.
Cluster and Multistage Sampling Sometimes dividing the sample into homogeneous strata isn’t practical, and even simple random sampling may be difficult. For example, suppose we wanted to assess the reading level of a product instruction manual based on the length of the sentences. Simple random sampling could be awkward; we’d have to number each sentence and then find, for example, the 576th sentence or the 2482nd sentence, and so on. Doesn’t sound like much fun, does it? We could make our task much easier by picking a few pages at random and then counting the lengths of the sentences on those pages. That’s easier than picking individual sentences and works if we believe that the pages are all reasonably similar to one another in terms of reading level. Splitting the population in this way into parts or clusters that each represent the population can make sampling more practical. We select one or a few clusters at random and perform a census within each of them. This sampling design is called cluster sampling. If each cluster fairly represents the population, cluster sampling will generate an unbiased sample. What’s the difference between cluster sampling and stratified sampling? We stratify to ensure that our sample represents different groups in the population, and sample randomly within each stratum. This reduces the sample-to-sample variability. Strata are homogeneous, but differ from one another. By contrast, clusters are more or less alike, each heterogeneous and resembling the overall population. We cluster to save money or even to make the study practical. Sometimes we use a variety of sampling methods together. In trying to assess the reading level of our instruction manual, we might worry that the “quick start”
14/07/14 7:31 AM
www.freebookslides.com
Common Sampling Designs
279
instructions are easy to read, but the “troubleshooting” chapter is more difficult. If so, we’d want to avoid samples that selected heavily from any one chapter. To guarantee a fair mix of sections, we could randomly choose one section from each chapter of the manual. Then we would randomly select a few pages from each of those sections. If altogether that made too many sentences, we might select a few sentences at random from each of the chosen pages. So, what is our sampling strategy? First we stratify by the chapter of the manual and randomly choose a section to represent each stratum. Within each selected section, we choose pages as clusters. Finally, we consider an SRS of sentences within each cluster. Sampling schemes that combine several methods are called multistage samples. Most surveys conducted by professional polling organizations and market research firms use some combination of stratified and cluster sampling as well as simple random samples.
For Example
Identifying more complex designs
The theater board wants to encourage people to come from out of town to attend theater events. They know that, in general, about 40% of ticket buyers are from out of town. These customers often purchase dinner at a local restaurant or stay overnight in a local inn, generating business for the town. The board hopes this information will encourage local businesses to advertise in the theater program, so they want to be sure out-of-town customers are represented in the samples. The database includes ZIP codes. The student consultants decide to sample 80 ticket buyers from ZIP codes outside the town and 120 from the town’s ZIP code.
Questions What kind of sampling scheme are they using to replace the simple random sample?
What are the advantages of selecting 80 out of town and 120 local customers?
Answers A stratified sample, consisting of a sample of 80 out-of-town customers and a sample of 120 local customers.
By stratifying, they can guarantee that 40% of the sample is from out of town, reflecting the overall proportions among ticket buyers. If out-of-town customers differ in important ways from local ticket buyers, a stratified sample will reduce the variation in the estimates for each group so that the combined estimates can be more precise.
Systematic Samples Sometimes we draw a sample by selecting individuals systematically. For example, a systematic sample might select every tenth person on an alphabetical list of employees. To make sure our sample is random, we still must start the systematic selection with a randomly selected individual—not necessarily the first person on the list. When there is no reason to believe that the order of the list could be associated in any way with the responses measured, systematic sampling can give a representative sample. Systematic sampling can be much less expensive than true random sampling. When you use a systematic sample, you should justify the assumption that the systematic method is not associated with any of the measured variables. Think about the reading level sampling example again. Suppose we have chosen a section of the manual at random, then three pages at random from that section, and now we want to select a sample of 10 sentences from the 73 sentences found on those pages. Instead of numbering each sentence so we can pick a simple random sample, it would be easier to sample systematically. A quick calculation shows 73>10 = 7.3, so we can get our sample by picking every seventh sentence
M08_SHAR8696_03_SE_C08.indd 279
14/07/14 7:31 AM
www.freebookslides.com 280
CHAPTER 8 Surveys and Sampling
on the page. But where should you start? At random, of course. We’ve accounted for 10 * 7 = 70 of the sentences, so we’ll throw the extra three into the starting group and choose a sentence at random from the first 10. Then we pick every seventh sentence after that and record its length.
Ju s t Che c k i n g 2 We need to survey a random sample of the 300 passengers on
a flight from San Francisco to Tokyo. Name each sampling method described.
a) Pick every tenth passenger as people board the plane. b) From the boarding list, randomly choose five people fly-
c) Randomly generate 30 seat numbers and survey the pas-
sengers who sit there.
d) Randomly select a seat position (right window, right cen-
ter, right aisle, etc.) And survey all the passengers sitting in those seats.
ing first class and 25 of the other passengers.
Guided Example
Market Demand Survey In a course at a business school in the United States, the students form business teams, propose a new product, and use seed money to launch a business to sell the product on campus. Before committing funds for the business, each team must complete the following assignment: “Conduct a survey to determine the potential market demand on campus for the product you are proposing to sell.” Suppose your team’s product is a 500-piece jigsaw puzzle of the map of your college campus. Design a marketing survey and discuss the important issues to consider.
Plan
Setup State the goals and objectives of the survey. Population and Parameters
Our team designed a study to find out how likely students at our school are to buy our proposed product—a 500-piece jigsaw puzzle of the map of our college campus.
Identify the population to be studied and the associated sampling frame. What are the parameters of interest?
The population studied will be students at our school. We have obtained a list of all students currently enrolled to use as the sampling frame. The parameter of interest is the proportion of students likely to buy this product. We’ll also collect some demographic information about the respondents.
Sampling Plan Specify the sampling method and the planned sample size, n. Specify how the sample was actually drawn. What is the sampling frame?
We will select a simple random sample of 200 students. The sampling frame is the master list of students we obtained from the registrar. We decided against stratifying by sex or class because we thought that students were all more or less alike in their likely interest in our product.
The description should, if possible, be complete enough to allow someone to replicate the procedure, drawing another sample from the same population in the same manner. A good description of the procedure is essential, even if it could never practically be repeated. The
M08_SHAR8696_03_SE_C08.indd 280
We will ask the students we contact: Do you solve jigsaw puzzles for fun? Then we will show them a prototype puzzle and ask: If this puzzle sold for $10, would you purchase one? We will also record the respondent’s sex and class.
14/07/14 7:31 AM
www.freebookslides.com
Common Sampling Designs
281
question you ask is important, so state the wording of the question clearly. Be sure that the question is useful in helping you with the overall goal of the survey.
Do
Report
Sampling Practice Specify when, where, and how the sampling will be performed. Specify any other details of your survey, such as how respondents were contacted, any incentives that were offered to encourage them to respond, how nonrespondents were treated, and so on.
The survey will be administered in the middle of the fall semester during October. We have a master list of registered students, which we will randomize by matching it with random numbers from www.random.org and sorting on the random numbers, carrying the names. We will contact selected students by phone or e-mail and arrange to meet with them. If a student is unwilling to participate, the next name from the randomized list will be substituted until a sample of 200 participants is found. We will meet with students in an office set aside for this purpose so that each will see the puzzle under similar conditions.
Summary and Conclusion
Memo Re: Survey plans Our team’s plans for the puzzle market survey call for a simple random sample of students. Because subjects need to be shown the prototype puzzle, we must arrange to meet with selected participants. We have arranged an office for that purpose. We will also collect demographic information so we can determine whether there is in fact a difference in interest level among classes or between men and women.
This report should include a discussion of all the elements needed to design the study. It’s good practice to discuss any special circumstances or other issues that may need attention.
The Real Sample What’s the Sample? The population we want to study is determined by asking why. When we design a survey, we use the term “sample” to refer to the individuals selected, from whom we hope to obtain responses. Unfortunately, the real sample is just those we can reach to obtain responses—the who of the study. These are slightly different uses of the same term sample. The context usually makes clear which we mean, but it’s important to realize that the difference between the two samples could undermine even a well-designed study.
M08_SHAR8696_03_SE_C08.indd 281
We have been discussing sampling in a somewhat idealized setting. In the real world, things can be a bit messier. Here are some things to consider. The population may not be as well-defined as it seems. For example, if a company wants the opinions of a typical mall “shopper,” who should they sample? Should they only ask shoppers carrying a purchase? Should they include people eating at the food court? How about teenagers just hanging out in the mall? Even when the population is clear, it may not be possible to establish an appropriate sampling frame. Usually, the practical sampling frame is not the group you really want to know about. For example, election polls want to sample from those who will actually vote in the next election—a group that is particularly tricky to identify before election day. The sampling frame limits what your survey can find out. Then there’s your target sample. These are the individuals selected according to your sample design for whom you intend to measure responses. You’re not likely to get responses from all of them. (“I know it’s dinner time, but I’m sure you wouldn’t mind answering a few questions. It’ll only take 20 minutes or so. Oh, you’re busy?”) Nonresponse is a problem in many surveys. Sample designs are usually about the target sample. But in the real world, you won’t get responses from everyone your design selects. So in reality, your sample
14/07/14 7:31 AM
www.freebookslides.com 282
CHAPTER 8 Surveys and Sampling
consists of the actual respondents. These are the individuals about whom you do get data and can draw conclusions. Unfortunately, they might not be representative of either the sampling frame or the population. At each step, the group we can study may be constrained further. The who of our study keeps changing, and each constraint can introduce biases. A careful study should address the question of how well each group matches the population of interest. The who in an SRS is the population of interest from which we’ve drawn a representative sample. That’s not always true for other kinds of samples. When people (or committees!) decide on a survey, they often fail to think through the important questions about who are the who of the study and whether they are the individuals from whom the answers would be interesting or have meaningful business consequences. This is a key step in performing a survey and should not be overlooked.
Calvin & Hobbes © 1993 Watterson. Distributed by Universal Uclick. Reprinted with permission. All rights reserved.
8.4
The Valid Survey It isn’t sufficient to draw a sample and start asking questions. You want to feel confident your survey can yield the information you need about the population you are interested in. We want a valid survey. To help ensure a valid survey, you need to ask four questions:
• • • •
What do I want to know? Who are the right respondents? What are the right questions? What will be done with the results?
These questions may seem obvious, but there are a number of specific pitfalls to avoid: Know what you want to know. Far too often, decision makers decide to perform a survey without any clear idea of what they hope to learn. Before considering a survey, you must be clear about what you hope to learn and what population you want to learn about. If you don’t know that, you can’t even judge whether you have a valid survey. The survey instrument—the questionnaire itself—can be a source of errors. Perhaps the most common error is to ask unnecessary questions. The longer the survey, the fewer people will complete it, leading to greater nonresponse bias. For each question on your survey, you should ask yourself whether you really want to know this and know what you would do with the responses if you had them. If you don’t have a good use for the answer to a question, don’t ask it. Use the right sampling frame. A valid survey obtains responses from appropriate respondents. Be sure you have a suitable sampling frame. Have you identified
M08_SHAR8696_03_SE_C08.indd 282
14/07/14 7:31 AM
www.freebookslides.com
The Valid Survey
283
the population of interest and sampled from it appropriately? A company looking to expand its base might survey customers who returned warrantee registration cards—after all, that’s a readily available sampling frame—but if the company wants to know how to make its product more attractive, it needs to survey customers who rejected its product in favor of a competitor’s product. This is the population that can tell the company what about its product needs to change to capture a larger market share. The errors in the presidential election polls of 1948 were likely due to the use of telephone samples in an era when telephones were not affordable by the less affluent—who were the folks most likely to vote for Truman. It is equally important to be sure that your respondents actually know the information you hope to discover. Your customers may not know much about the competing products, so asking them to compare your product with others may not yield useful information. Ask specific rather than general questions. It is better to be specific. “Do you usually recall TV commercials?” won’t be as useful as “How many TV commercials can you recall from last night?” or better, yet, “Please describe for me all the TV commercials you can recall from your viewing last night.” Watch for biases. Even with the right sampling frame, you must beware of bias in your sample. If customers who purchase more expensive items are less likely to respond to your survey, this can lead to nonresponse bias. Although you can’t expect all mailed surveys to be returned, if those individuals who don’t respond have common characteristics, your sample will no longer represent the population you hope to learn about. Surveys in which respondents volunteer to participate, such as online surveys, suffer from voluntary response bias. Individuals with the strongest feelings on either side of an issue are more likely to respond; those who don’t care may not bother. Be careful with question phrasing. Questions must be carefully worded. A respondent may not understand the question—or may not understand the question the way the researcher intended it. For example, “Does anyone in your family own a Ford truck?” leaves the term “family” unclear. Does it include only spouses and children or parents and siblings, or do in-laws and second cousins count too? A question like “Was your Twinkie fresh?” might be interpreted quite differently by different people. Be careful with answer phrasing. Respondents and survey-takers may also provide inaccurate responses, especially when questions are politically or sociologically sensitive. This also applies when the question does not take into account all possible answers, such as a true-false or multiple-choice question to which there may be other answers. Or the respondent may not know the correct answer to the question on the survey. In 1948, there were four major candidates for President,3 but some survey respondents might not have been able to name them all. A survey question that just asked “Who do you plan to vote for?” might have underrepresented the less prominent candidates. And one that just asked “What do you think of Wallace?” might yield inaccurate results from voters who simply didn’t know who he was. We refer to inaccurate responses (intentional or unintentional) as measurement errors. One way to cut down on measurement errors is to provide a range of possible responses. But be sure to phrase them in neutral terms. The best way to protect a survey from measurement errors is to perform a pilot test. In a pilot test, a small sample is drawn from the sampling frame, and a draft form of the survey instrument is administered. A pilot test can point out flaws in the instrument. For example, during a staff cutback at one of our schools, a researcher 3
Harry Truman, Thomas Dewey, Strom Thurmond, and Henry Wallace.
M08_SHAR8696_03_SE_C08.indd 283
14/07/14 7:31 AM
www.freebookslides.com 284
CHAPTER 8 Surveys and Sampling
surveyed faculty members to ask how they felt about the reduction in staff support. The scale ran from “It’s a good idea” to “I’m very unhappy.” Fortunately, a pilot study showed that everyone was very unhappy or worse. The scale was re-tuned to run from “unhappy” to “ready to quit.”
For Example
Survey design
A nonprofit organization has enlisted some student consultants to help design a fundraising survey. The student consultants suggest to the board of directors that they may want to rethink their survey plans. They point out that there are differences among the population, the sampling frame, the target sample contacted by telephone, and the actual sample.
Question How are the population, sampling frame, target sample, and sample likely to differ? Answer The population is all potential ticket buyers. The sampling frame consists of only those who have previously purchased tickets. Anyone who wasn’t attracted to previous productions wouldn’t be surveyed. That could keep the board from learning of ways to make the theater’s offering more attractive to those who hadn’t purchased tickets before. The target sample is those selected from the database who can be contacted by telephone. Those with unlisted numbers or who had declined to give their phone number can’t be contacted. It may be more difficult to contact those with caller ID. The actual sample will be those previous customers selected at random from the database who can be reached by telephone and who agree to complete the survey.
8.5
How to Sample Badly Bad sample designs yield worthless data. Many of the most convenient forms of sampling can be seriously biased. And there is no way to correct for the bias from a bad sample. So it’s wise to pay attention to sample design—and to beware of reports based on poor samples.
Voluntary Response Sample One of the most common dangerous sampling methods is the voluntary response sample. In a voluntary response sample, a large group of individuals is invited to respond, and all who do respond are counted. This method is used by call-in shows, 900 numbers, Internet polls, and letters written to members of Congress. Voluntary response samples are almost always biased, and so conclusions drawn from them are almost always wrong. It’s often hard to define the sampling frame of a voluntary response study. Practically, the frames are groups such as Internet users who frequent a particular website or viewers of a particular TV show. But those sampling frames don’t correspond to the population you are likely to be interested in. Even if the sampling frame is of interest, voluntary response samples are often biased toward those with strong opinions or those who are strongly motivated— and especially from those with strong negative opinions. A request that travelers who have used the local airport visit a survey site to report on their experiences is much more likely to hear from those who had long waits, cancelled flights, and lost luggage than from those whose flights were on time and carefree. The resulting voluntary response bias invalidates the survey.
M08_SHAR8696_03_SE_C08.indd 284
14/07/14 7:31 AM
www.freebookslides.com
How to Sample Badly
285
Convenience Sampling Another sampling method that doesn’t work is convenience sampling. As the name suggests, in convenience sampling we simply include the individuals who are convenient. Unfortunately, this group may not be representative of the population. A survey of 437 potential home buyers in Orange County, California, found, among other things, that
Do you use the Internet? Click here
for yes
Click here
for no
Internet Surveys Internet convenience surveys are often worthless. As voluntary response surveys, they have no well-defined sampling frame (all those who use the Internet and visit their site?) and thus report no useful information. Do not use them.
all but 2 percent of the buyers have at least one computer at home, and 62 percent have two or more. Of those with a computer, 99 percent are connected to the Internet (Jennifer Hieger, “Portrait of Homebuyer Household: 2 Kids and a PC,” Orange County Register, July 27, 2001). Later in the article, we learn that the survey was conducted via the Internet. That was a convenient way to collect data and surely easier than drawing a simple random sample, but perhaps home builders shouldn’t conclude from this study that every family has a computer and an Internet connection. Many surveys conducted at shopping malls suffer from the same problem. People in shopping malls are not necessarily representative of the population of interest. Mall shoppers tend to be more affluent and include a larger percentage of teenagers and retirees than the population at large. To make matters worse, survey interviewers tend to select individuals who look “safe,” or easy to interview. Convenience sampling is not just a problem for beginners. In fact, convenience sampling is a widespread problem in the business world. When a company wants to find out what people think about its products or services, it may turn to the easiest people to sample: its own customers. But the company will never learn how those who don’t buy its product feel about it.
Bad Sampling Frame? An SRS from an incomplete sampling frame introduces bias because the individuals included may differ from the ones not in the frame. It may be easier to sample workers from a single site, but if a company has many sites and they differ in worker satisfaction, training, or job descriptions, the resulting sample can be biased. There is serious concern among professional pollsters that the increasing numbers of people who can be reached only by cell phone may bias telephone-based market research and polling.
Undercoverage Many survey designs suffer from undercoverage, in which some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population. Undercoverage can arise for a number of reasons, but it’s always a potential source of bias. Are people who use answering machines to screen callers (and are thus less available to blind calls from market researchers) different from other customers in their purchasing preferences?
For Example
Common mistakes in survey design
A board member proposes that rather than telephoning past customers, they simply post someone at the door to ask theater goers their opinions. Another suggests that it would be even easier to post a questionnaire on the theater website and invite responses there. A third suggests that rather than working with random numbers, they simply phone every 200th person on the list of past customers. (continued)
M08_SHAR8696_03_SE_C08.indd 285
14/07/14 7:31 AM
www.freebookslides.com 286
CHAPTER 8 Surveys and Sampling
Question Identify the three methods proposed and explain what strengths and weaknesses they have.
Answer Questioning customers at the door would be a convenience sample. It would be cheap and fast but is likely to be biased by the nature and quality of the particular performance where the survey takes place. Inviting responses on the website would be a voluntary response sample. Only customers who frequented the website and decided to respond would be surveyed. This might, for example, underrepresent older customers or those without home Internet access. Sampling every 200th name from the customer list would be a systematic sample. It is slightly easier than randomizing. If the order of names on the list is unrelated to any questions asked, then this might be an acceptable method. But if, for example, the list is kept in the order of first purchases (when a customer’s name and information were added to the database), then there might be a relationship between opinions and location on the list.
What Can Go Wrong? • Nonrespondents. No survey succeeds in getting responses from everyone. The problem is that those who don’t respond may differ from those who do. And if they differ on just the variables we care about, the lack of response will bias the results. Rather than sending out a large number of surveys for which the response rate will be low, it is often better to design a smaller, randomized survey for which you have the resources to ensure a high response rate. • Long, dull surveys. Surveys that are too long are more likely to be refused, reducing the response rate and biasing all the results. Keep it short. • Response bias. Response bias includes the tendency of respondents to tailor their responses to please the interviewer and the consequences of slanted question wording. • Push polls. Push polls, which masquerade as surveys, present one side of an issue before asking a question. For example, a question like Would the fact that the new store that just opened by the mall sells mostly goods made overseas by workers in sweatshop conditions influence your decision to shop there rather than in the downtown store that features American-made products?
is designed not to gather information, but to spread ill-will toward the new store.
The Wizard of Id © 2001 John L. Hart/Distributed by Creators Syndicate. Reprinted with permission. All rights reserved.
M08_SHAR8696_03_SE_C08.indd 286
14/07/14 7:31 AM
www.freebookslides.com
287
What Have We Learned?
How to Think about Biases • Look for biases in any survey. If you design a survey of your own, ask someone else to help look for biases that may not be obvious to you. Do this before you collect your data. There’s no way to recover from a biased sample or a survey that asks biased questions. A bigger sample size for a biased study just gives you a bigger useless study. A really big sample gives you a really big useless study. • Spend your time and resources reducing biases. No other use of resources is as worthwhile as reducing the biases. • If you possibly can, pretest or pilot your survey. Administer the survey in the exact form that you intend to use it to a small sample drawn from the population you intend to sample. Look for misunderstandings, misinterpretation, confusion, or other possible biases. Then redesign your survey instrument. • Always report your sampling methods in detail. Others may be able to detect biases where you did not expect to find them.
Ethics in Action
T
he Lackawax River Group is interested in applying for state funds to continue their restoration and conservation of the Lackawax River, a river that has been polluted from years of industry and agricultural discharge. While they have managed to gain significant support for their cause through education and community involvement, the executive committee is now interested in presenting the state with more compelling evidence. They decided to survey local residents regarding their attitudes toward the proposed expansion of the river restoration and conservation project. With limited time and money (the deadline for the grant application was fast approaching), the executive committee was delighted that one of its members, Harry Greentree, volunteered to undertake the project. Harry owned a local organic food store and agreed to have a sample of his shoppers interviewed during the next
ne-week period. One committee member questioned o whether a representative sample of residents could be found in this way, but the other members of the committee thought that the customers of Harry’s store were likely to be just the kind of well-informed residents whose opinions they wanted to hear. The only instruction the committee decided to give was that the shoppers be selected in a systematic fashion, for instance, by interviewing every fifth person who entered the store. Harry had no problem with this request and was eager to help the Lackawax River Group. • Identify the ethical dilemma in this scenario. • What are the undesirable consequences? • P ropose an ethical solution that considers the welfare of all stakeholders.
What Have We Learned? Learning Objectives
Know the three ideas of sampling.
• Examine a part of the whole: A sample can give information about the population. • Randomize to make the sample representative. • The sample size is what matters. It’s the size of the sample—and not its fraction of the larger population—that determines the precision of the statistics it yields. Be able to draw a Simple Random Sample (SRS) using a table of random digits or a list of random numbers from technology or an Internet site.
• In a simple random sample (SRS), every possible group of n individuals has an equal chance of being our sample.
M08_SHAR8696_03_SE_C08.indd 287
14/07/14 7:31 AM
www.freebookslides.com 288
CHAPTER 8 Surveys and Sampling Know the definitions of other sampling methods:
• Stratified samples can reduce sampling variability by identifying homogeneous subgroups and then randomly sampling within each. • Cluster samples randomly select among heterogeneous subgroups that each resemble the population at large, making our sampling tasks more manageable. • Systematic samples can work in some situations and are often the least expensive method of sampling. But we still want to start them randomly. • Multistage samples combine several random sampling methods. Identify and avoid causes of bias.
• Nonresponse bias can arise when sampled individuals will not or cannot respond. • Response bias arises when respondents’ answers might be affected by external influences, such as question wording or interviewer behavior. • Voluntary response samples are almost always biased and should be avoided and distrusted. • Convenience samples are likely to be flawed for similar reasons. • Undercoverage occurs when individuals from a subgroup of the population are selected less often than they should be.
Terms Bias
Any systematic failure of a sampling method to represent its population.
Census
An attempt to collect data on the entire population of interest.
Cluster
A subset of a population aggregated into larger sampling units. These units, chosen for reasons of cost or practicality are often natural groups thought to be representative of the population.
Cluster sampling
A sampling design in which groups, or clusters, representative of the population are chosen at random and a census is then taken of each.
Convenience sampling Measurement error
A sample that consists of individuals who are conveniently available. Any inaccuracy in a response, from any source, whether intentional or unintentional.
Multistage sample
A sampling scheme that combines several sampling methods.
Nonresponse bias
Bias introduced to a sample when a large fraction of those sampled fails to respond.
Parameter Pilot test Population Population parameter Randomization Representative sample Response bias Sample Sample size Sample survey Sampling frame Sampling variability (or sampling error)
M08_SHAR8696_03_SE_C08.indd 288
A numerically valued attribute of a model for a population. We rarely expect to know the value of a parameter, but we do hope to estimate it from sampled data. A small trial run of a study to check that the methods of the study are sound. The entire group of individuals or instances about whom we hope to learn. A numerically valued attribute of a model for a population. A defense against bias in the sample selection process, in which each individual is given a fair, random chance of selection. A sample from which the statistics computed accurately reflect the corresponding population parameters. Anything in a survey design that influences responses. A subset of a population, examined in hope of learning about the population. The number of individuals in a sample. A study that asks questions of a sample drawn from some population in the hope of learning something about the entire population. A list of individuals from which the sample is drawn. Individuals in the population of interest but who are not in the sampling frame cannot be included in any sample. The natural tendency of randomly drawn samples to differ, one from another.
14/07/14 7:31 AM
www.freebookslides.com
289
Technology Help
Simple random sample (SRS) Statistic, sample statistic
Strata Stratified random sample Systematic sample Voluntary response bias Voluntary response sample Undercoverage
A sample in which each set of n elements in the population has an equal chance of selection. A value calculated for sampled data, particularly one that corresponds to, and thus estimates, a population parameter. The term “sample statistic” is sometimes used, usually to parallel the corresponding term “population parameter.” Subsets of a population that are internally homogeneous but may differ one from another. A sampling design in which the population is divided into several homogeneous subpopulations, or strata, and random samples are then drawn from each stratum. A sample drawn by selecting individuals systematically from a sampling frame. Bias introduced to a sample when individuals can choose on their own whether to participate in the sample. A sample in which a large group of individuals are invited to respond and decide individually whether or not to participate. Voluntary response samples are generally worthless. A sampling scheme that biases the sample in a way that gives a part of the population less representation than it has in the population.
Technology Help: Random Sampling Computer-generated random numbers are usually quite good enough for drawing random samples. But there is little reason not to use the truly random values available on the Internet. Here’s a convenient way to draw an SRS of a specified size using a computer-based sampling frame. The sampling frame can be a list of names or identification numbers arrayed, for example, as a column in a spreadsheet, statistics program, or database: 1 Generate random numbers of enough digits so that each exceeds the size of the sampling frame list by several digits. This makes duplication unlikely. (For example, in Excel, use the RAND function described in detail in Technology Help, Chapter 5 to fill a column with random numbers between 0 and 1. With many digits they will almost surely be unique.) 2 Assign the random numbers arbitrarily to individuals in the sampling frame list. For example, put them in an adjacent column. 3 Sort the list of random numbers, carrying along the sampling frame list. 4 Now the first n values in the sorted sampling frame column are an SRS of n values from the entire sampling frame. Most statistics packages also offer commands to sample from your data, but you should be careful to see that they do what you intend.
To generate random numbers in Excel:
• Enter the minimum and maximum bounds for the random numbers. This will be the minimum and maximum of the random numbers generated.
• Choose Data + Data Analysis + Random Number Generation. (Note: the Data Analysis add-in must be installed.)
• A list of random numbers will be generated in a new worksheet. The example shown here resulted from parameters of 1 to 100.
• In the Random Number Generation window, fill in
• Format cells to obtain values desired.
Excel
• Number of variables = number of columns of random numbers. • Number of random numbers = number of rows of random numbers. • Select a distribution from the drop-down menu. Parameters for your selected distribution will appear below.
M08_SHAR8696_03_SE_C08.indd 289
To sample from a column of data in Excel: • Choose Data + Data Analysis + Sampling. • Type in or select the cell range containing the data. If this column has a title, place a check in the box marked “Labels”.
14/07/14 7:31 AM
www.freebookslides.com CHAPTER 8 Surveys and Sampling
290
• Next to Random, indicate the “number of Samples” desired—this is actually the sample size, n.
• Right-click on the top of Column 1.
• Finally, choose a location for the selected sample.
• Under Functions (grouped) choose Random + Random Uniform.
• Choose Formula...
Warning: Excel samples with replacement. This is probably not the sampling method you want for drawing a sample from a population. The method given above using externally generated random numbers may be more appropriate.
• Click OK; format data as desired.
Minitab
• Select Tables + Subset.
To generate a list of random numbers in Minitab:
• Choose Random – sample size: and enter the desired sample size,n.
• Choose Calc + Random Data + Uniform. • Enter the number of rows. • Select the column where the random numbers will be stored. • Click OK.
To sample from a variable in JMP: • Select the column of data to sample from by clicking the top of the column.
• Choose Selected columns. • Fill in a name for the table where the sample will be stored. A table will be created containing the random sample.
To sample from a variable in Minitab:
SPSS
• Name a column in the data that will contain the sample; this column will be blank.
To generate a list of random numbers in SPSS:
• Choose Calc + Random Data + Sample From Columns. • Enter the number of rows to sample. This is the sample size, n. • Indicate the column from which to select the data under “From Columns”. • Indicate the column in which the samples data should be placed under “Store Samples In”. • Minitab samples without replacement. To sample with replacement, check the box specifying that alternative. • Click OK.
• Open a new dataset and assign a new variable with numbers 1 to n where n is the number of random numbers that will be generated. • In the Transform menu, choose Compute Variable . . . • Assign a name to the target variable. • Under Numeric Expression, type RV.UNIFORM(min,max), where min = the lowest value of the variable and max = the highest value of the variable. For example, RV.UNIFORM (0, 1) will give you random numbers between 0 and 1. To select a random sample in SPSS: • From Data menu, choose Select Cases. • Choose Random sample of cases.
JMP To generate a list of random numbers in JMP:
• Click the Sample button to select either a percentage of cases or a number of cases.
• Create a New Data Table.
• Select the desired output.
• Choose Rows + Add Rows and enter the number of random numbers desired.
Brief Case
Market Survey Research You are part of a marketing team that needs to research the potential of a new product. Your team decides to e-mail an interactive survey to a random sample of consumers. Write a short questionnaire that will generate the information you need about the new product. Select a sample of 200 using an SRS from your sampling frame. Discuss how you will collect the data and how the responses will help your market research.
The GfK Roper Reports Worldwide Survey GfK Roper Consulting conducts market research for multinational companies who want to understand attitudes in different countries so they can market and advertise more effectively to different cultures. Every year they conduct a poll
M08_SHAR8696_03_SE_C08.indd 290
14/07/14 7:31 AM
www.freebookslides.com
Exercises 291
worldwide, which asks hundreds of questions of people in approximately 30different countries. Respondents are asked a variety of questions about food. Some of the questions are simply yes/no (agree/disagree) questions: Please tell me whether you agree or disagree with each of these statements about your appearance: (Agree = 1; Disagree = 2; Don’t know = 9). The way you look affects the way you feel. I am very interested in new skin care breakthroughs. People who don’t care about their appearance don’t care about themselves. Other questions are asked on a 5-point scale (Please tell me the extent to which you disagree or agree with it using the following scale: Disagree completely = 1; Disagree somewhat = 2; Neither disagree nor agree = 3; Agree somewhat = 4; Agree completely = 5; Don’t know = 9). Examples of such questions include: I read labels carefully to find out about ingredients, fat content, and/or calories. I try to avoid eating fast food. When it comes to food I’m always on the lookout for something new. Think about designing a survey on such a global scale: • What is the population of interest? • Why might it be difficult to select an SRS from this sampling frame? • What are some potential sources of bias? • Why might it be difficult to ensure a representative number of men and women and all age groups in some countries? • What might be a reasonable sampling frame?
Exercises Section 8.1 1. Indicate whether each statement below is true or false. If false, explain why. a) We can eliminate sampling error by selecting an unbiased sample. b) Randomization helps to ensure that our sample is representative. c) Sampling error refers to sample-to-sample differences and is also known as sampling variability. d) It is better to try to match the characteristics of the sample to the population rather than relying on randomization. 2. Indicate whether each statement below is true or false. If false, explain why. a) To get a representative sample, you must sample a large fraction of the population. b) Using modern methods, it is best to select a representative subset of a population systematically.
M08_SHAR8696_03_SE_C08.indd 291
c) A census is the only true representative sample. d) A random sample of 100 students from a school with 2000 students has the same precision as a random sample of 100 from a school with 20,000 students.
Section 8.2 3. An environmental advocacy group is interested in the perceptions of farmers about global climate change. Specifically, they wish to determine the percentage of organic farmers who are concerned that climate change will affect their crop yields. They use an alphabetized list of members of the Northeast Organic Farming Association (www.nofa .org), a nonprofit organization of over 5000 members with chapters in Connecticut, Massachusetts, New Hampshire, New Jersey, New York, Rhode Island, and Vermont. They use Excel to generate a randomly shuffled list of the members. They then select members to contact from this list until they have succeeded in contacting 150 members.
14/07/14 7:31 AM
www.freebookslides.com 292
CHAPTER 8 Surveys and Sampling
a) What is the population? b) What is the sampling frame? c) What is the population parameter of interest? d) What sampling method is used? 4. A movie theatre company is interested in the opinions of their frequent customers about the recently installed online ticketing system. Specifically they want to know what proportion of them plan to use the new ticketing system. They took a random sample of 15,000 customers from their data base and sent them an SMS message with a request to fill out a survey in exchange for a free ticket to see a movie of their choice. a) What is the population? b) What is the sampling frame? c) What is the population parameter of interest? d) What is the sampling method used?
Section 8.3 5. As discussed in the chapter, GfK Roper Consulting conducts a global consumer survey to help multinational companies understand different consumer attitudes throughout the world. In India, the researchers interviewed 1000 people aged 13–65 (www.gfkamerica.com). Their sample is designed so that they get 500 males and 500 females. a) Are they using a simple random sample? How do you know? b) What kind of design do you think they are using? 6. For their class project, a group of Business students decides to survey the student body to assess opinions about a proposed new student coffee shop to judge how successful it might be. Their sample of 200 contained 50 first-year students, 50 sophomores, 50 juniors, and 50 seniors. a) Do you think the group was using an SRS? Why? b) What kind of sampling design do you think they used? 7. The environmental advocacy group from Exercise 3 that was interested in gauging perceptions about climate change among organic farmers has decided to use a different method to sample. Instead of randomly selecting members from a shuffled list, they listed the members in alphabetical order and took every tenth member until they succeeded in contacting 150 members. What kind of sampling method have they used? 8. The airline company from Exercise 4, interested in the opinions of their frequent flyer customers about their proposed new routes, has decided that different types of customers might have different opinions. Of their customers, 50% are silver-level, 30% are blue, and 20% are red. They first compile separate lists of silver, blue, and red members and then randomly select 5000 silver members, 3000 blue members, and 2000 red members to e-mail. What kind of sampling method have they used?
M08_SHAR8696_03_SE_C08.indd 292
For Exercises 9 and 10, identify the following if possible. (If not, say why.) a) The population b) The population parameter of interest c) The sampling frame d) The sample e) The sampling method, including whether or not randomization was used f) Any potential sources of bias you can detect and any problems you see in generalizing to the population of interest 9. A survey company emailed a questionnaire to the directors of major hotel chains across the country, and received responses from 35% of them. The respondents reported that they did not think the recent economic slowdown had any impact on their level of operations. 10. A question posted on a university’s website asked potential new students whether or not the university should include health insurance plans in their student fees.
Section 8.4 11. An intern for the environmental group in Exercise 3 has decided to make the survey process simpler by calling 150 of the members who attended the recent symposium on coping with climate change that was recently held in Burlington, VT. He has all the phone numbers, so it will be easy to contact them. He will start calling members from the top of the list, which was generated as the members enrolled for the symposium. He has written a script to read to them that follows, “As we learned in Burlington, climate change is a serious problem for farmers. Given the evidence of impact on crops, do you agree that the government should be doing more to fight global warming?” a) What is the population of interest? b) What is the sampling frame? c) Point out any problems you see either with the sampling procedure and/or the survey itself. What are the potential impacts of these problems? 12. The airline company in Exercise 4 has realized that some of its customers don’t have e-mail or don’t read it regularly. They decide to restrict the mailing only to customers who have recently registered for a “Win a trip to Miami” contest, figuring that those with Internet access are more likely to read and to respond to their e-mail. They send an e-mail with the following message: “Did you know that National Airlines has just spent over $3 million refurbishing our brand new hub in Miami? By answering the following question, you may be eligible to win $1000 worth of coupons that can be spent in any of the fabulous restaurants or shops in the Miami airport. Might
14/07/14 7:31 AM
www.freebookslides.com
Exercises 293
you possibly think of traveling to Miami in the next six months on your way to one of your destinations?” a) What is the population? b) What is the sampling frame? c) Point out any problems you see either with the sampling procedure and/or the survey itself. What are the potential impacts of these problems? 13. An intern is working for Pacific TV (PTV), a small cable and Internet provider, and has proposed some questions that might be used in the survey to assess whether customers are willing to pay $50 for a new service. Question 1: If PTV offered state-of-the-art, high-speed Internet service for $50 per month, would you subscribe to that service? Question 2: Would you find $50 per month—less than the cost of a daily cappuccino—an appropriate price for high-speed Internet service? a) Do you think these are appropriately worded questions? Why or why not? b) Which one has more neutral wording? Explain. 14. Here are more proposed survey questions for the survey in Exercise 13: Question 3: Do you find that the slow speed of DSL Internet access reduces your enjoyment of web services? Question 4: Given the growing importance of high-speed Internet access for your children’s education, would you subscribe to such a service if it were offered? a) Do you think these are appropriately worded questions? Why or why not? b) Suggest a question with better wording.
Section 8.5 15. Indicate whether each statement below is true or false. If false, explain why. a) A local television news program that asks viewers to call in and give their opinion on an issue typically results in a biased voluntary response sample. b) Convenience samples are generally representative of the population. c) Measurement error is the same as sampling error. d) A pilot test can be useful for identifying poorly worded questions on a survey. 16. Indicate whether each statement below is true or false. If false, explain why. a) Asking viewers to call into an 800 number is a good way to produce a representative sample.
M08_SHAR8696_03_SE_C08.indd 293
b) When writing a survey, it’s a good idea to include as many questions as possible to ensure efficiency and to lower costs. c) A recent poll on a website was valid because the sample size was over 1,000,000 respondents. d) Malls are not necessarily good places to conduct surveys because people who frequent malls may not be representative of the population at large. 17. For your marketing class, you’d like to take a survey from a sample of all the Catholic Church members in your city to assess the market for a DVD about Pope Francis’s first year as pope. A list of churches shows 17 Catholic churches within the city limits. Rather than try to obtain a list of all members of all these churches, you decide to pick 3 churches at random. For those churches, you’ll ask to get a list of all current members and contact 100 members at random. a) What kind of design have you used? b) What could go wrong with the design that you have proposed? 18. PIRSA Fisheries, based in South Australia, plans to study the recreational fishing around Goolwa Beach. To do that, they decide to randomly select five fishing boats at the end of a randomly chosen fishing day and count the numbers and types of all the fish on those boats. a) What kind of design have they used? b) What could go wrong with the design that they have proposed?
Chapter Exercises 19. Software licenses. The website www.gamefaqs.com asked, as their question of the day to which visitors to the site were invited to respond, “Do you ever read the end-user license agreements when installing software or games?” Of the 98,574 respondents, 63.47% said they never read those agreements—a fact that software manufacturers might find important. a) What kind of sample was this? b) How much confidence would you place in using 63.47% as an estimate of the fraction of people who don’t read software licenses? 20. Drugs in baseball. Major League Baseball, responding to concerns about their “brand,” tests players to see whether they are using performance-enhancing drugs. Officials select a team at random, and a drug-testing crew shows up unannounced to test all 40 players on the team. Each testing day can be considered a study of drug use in Major League Baseball. a) What kind of sample is this? b) Is that choice appropriate?
14/07/14 7:31 AM
www.freebookslides.com 294
CHAPTER 8 Surveys and Sampling
21. Pew. Pew Research Center publishes polls on issues important in the news and about global life at its website, www.pewinternet.org. At the end of a report about a survey you can find paragraphs such as this one: Country: Brazil; Sample design: Multi-stage cluster sample stratified by Brazil’s five regions and size of municipality; Mode: Face-to-face adults 18 plus; Languages: Portuguese; Fieldwork dates: March 4 – April 21, 2013; Sample size: 960; Margin of Error: ±4.1 percentage points; Representative: Adult population. a) Explain the multi-stage design applied in terms of regions and municipalities. b) What sampling frame might have been used? 22. Defining the survey. At its website (www.gallup.com) the Gallup World Poll reports results of surveys conducted in various places around the world. At the end of one of these reports about the reliability of electric power in Africa, they describe their methods, including explanations such as the following: Results are based on face-to-face interviews with 1,000 adults, aged 15 and older, conducted in 2010 in Botswana, Burkina Faso, Cameroon, Central African Republic, Chad, Ghana, Kenya, Liberia, Mali, Niger, Nigeria, Senegal, Sierra Leone, South Africa, Tanzania, Uganda, and Zimbabwe. For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error ranges from { 3.4 percentage points to { 4.0 percentage points. The margin of error reflects the influence of data weighting. In addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.4 a) Gallup is interested in the opinions of Africans. What kind of survey design are they using? b) Some of the countries surveyed have large populations. (South Africa is estimated to have over 50 million people.) Some are quite small. (Zimbabwe has fewer than 13,000,000 people.) Nonetheless, Gallup sampled 1000 adults in each country. How does this affect the precision of its estimates for these countries? 23–30. Survey details. For the following reports about statistical studies, identify the following items (if possible). If you can’t tell, then say so—this often happens when we read about a survey. a) The population b) The population parameter of interest c) The sampling frame d) The sample e) The sampling method, including whether or not randomization was employed f) Any potential sources of bias you can detect and any problems you see in generalizing to the population of interest 4
Copyright © 2011 Gallup Inc. All rights reserved. The content is used with permission; however, Gallup retains all rights of republication.
M08_SHAR8696_03_SE_C08.indd 294
23. Global Views on Morality. The 2013 Pew Research Center’s Global Attitudes Project asked 1000 adult respondents in 40 countries what they thought about eight moral issues, such as premarital sex and alcohol use. 24. Global warming. The Gallup Poll interviewed 1022 randomly selected U.S. adults aged 18 and older, March 7–10, 2013. Gallup reports that when asked whether respondents thought that global warming was due primarily to human activities, 57% of respondents said it was. 25. At the bar. Researchers waited outside a bar they had randomly selected from a list of such establishments. They stopped every tenth person who came out of the bar and asked whether he or she thought drinking and driving was a serious problem. 26. Election poll. Hoping to learn what issues may resonate with voters in the coming election, the campaign director for a mayoral candidate selects one block at random from each of the city’s election districts. Staff members go there and interview all the residents they can find. 27. Toxic waste. The Environmental Protection Agency took soil samples at 16 locations near a former industrial waste dump and checked each for evidence of toxic chemicals. They found no elevated levels of any harmful substances. 28. Housing discrimination. Inspectors send trained “renters” of various races and ethnic backgrounds, and of both sexes to inquire about renting randomly assigned advertised apartments. They look for evidence that landlords deny access illegally based on race, sex, or ethnic background. 29. Quality control. A company packaging snack foods maintains quality control by randomly selecting 10 cases from each day’s production and weighing the bags. Then they open one bag from each case and inspect the contents. 30. Contaminated milk. Dairy inspectors visit farms unannounced and take samples of the milk to test for contamination. If the milk is found to contain dirt, antibiotics, or other foreign matter, the milk will be destroyed and the farm is considered to be contaminated pending further testing. 31. Bradley effect. The Bradley effect theory posits that inaccurate polls are skewed by the phenomenon of voters giving inaccurate polling responses because they fear that, by stating their true preference, they will open themselves to criticism of racial or ethnic motivation. Members of the public may feel under pressure to provide an answer that is deemed to be more publicly acceptable, or ‘politically correct’, but they vote according to their true preference. Is the Bradley effect an example of bias, or of sampling error? 32. Indian polls. In the 2014 elections in India, no less than eleven opinion poll agencies have been seen, whose surveys are published and broadcast by leading magazines and
14/07/14 7:31 AM
www.freebookslides.com
Exercises 295
news channels. However, predictions of polling agencies demonstrate considerable variability. Is this more likely to be a result of bias, or sampling error? Explain. 33. Cable company market research. A local cable TV company, Pacific TV (PTV), with customers in 15 towns is considering offering high-speed Internet service on its cable lines. Before launching the new service they want to find out whether customers would pay the $75 per month that they plan to charge. An intern has prepared several alternative plans for assessing customer demand. For each, indicate what kind of sampling strategy is involved and what (if any) biases might result. a) Put a big ad in the newspaper asking people to log their opinions on the PTV website. b) Randomly select one of the towns and contact every cable subscriber by phone. c) Send a survey to each customer and ask them to fill it out and return it. d) Randomly select 20 customers from each town. Send them a survey, and follow up with a phone call if they do not return the survey within a week. 34. Cable company market research, part 2. Four new sampling strategies have been proposed to help PTV determine whether enough cable subscribers are likely to purchase high-speed Internet service. For each, indicate what kind of sampling strategy is involved and what (if any) biases might result. a) Run a poll on the local TV news, asking people to dial one of two phone numbers to indicate whether they would be interested. b) Hold a meeting in each of the 15 towns, and tally the opinions expressed by those who attend the meetings. c) Randomly select one street in each town and contact each of the households on that street. d) Go through the company’s customer records, selecting every 40th subscriber. Send employees to those homes to interview the people chosen. 35. Amusement park riders. An amusement park has opened a new roller coaster. It is so popular that people are waiting for up to three hours for a two-minute ride. Concerned about how patrons (who paid a large amount to enter the park and ride on the rides) feel about this, they survey every tenth person in line for the roller coaster, starting from a randomly selected individual. a) What kind of sample is this? b) Is it likely to be representative? c) What is the sampling frame? 36. Playground. Some people have been complaining that the children’s playground at a municipal park is too small and is in need of repair. Managers of the park decide to survey city residents to see if they believe the playground should be rebuilt. They hand out questionnaires to parents
M08_SHAR8696_03_SE_C08.indd 295
who bring children to the park. Describe possible biases in this sample. 37. Another ride. The survey of patrons waiting in line for the roller coaster in Exercise 35 asks whether they think it is worthwhile to wait a long time for the ride and whether they’d like the amusement park to install still more roller coasters. What biases might cause a problem for this survey? 38. Playground bias. The survey described in Exercise 36 asked, Many people believe this playground is too small and in need of repair. Do you think the playground should be repaired and expanded even if that means imposing an entrance fee to the park? Describe two ways this question may lead to response bias. 39. (Possibly) Biased questions. Examine each of the following questions for possible bias. If you think the question is biased, indicate how and propose a better question. a) Should companies that pollute the environment be compelled to pay the costs of cleanup? b) Should a company enforce a strict dress code? 40. More possibly biased questions. Examine each of the following questions for possible bias. If you think the question is biased, indicate how and propose a better question. a) Do you think that price or quality is more important in selecting a tablet computer? b) Given humanity’s great tradition of exploration, do you favor continued funding for space flights? 41. Phone surveys. Anytime we conduct a survey, we must take care to avoid undercoverage. Suppose we plan to select 500 names from the city phone book, call their homes between noon and 4 p.m., and interview whoever answers, anticipating contacts with at least 200 people. a) Why is it difficult to use a simple random sample here? b) Describe a more convenient, but still random, sampling strategy. c) What kinds of households are likely to be included in the eventual sample of opinion? Who will be excluded? d) Suppose, instead, that we continue calling each number, perhaps in the morning or evening, until an adult is contacted and interviewed. How does this improve the sampling design? e) Random-digit dialing machines can generate the phone calls for us. How would this improve our design? Is anyone still excluded? 42. Cell phone survey. What about drawing a random sample only from cell phone exchanges? Discuss the advantages and disadvantages of such a sampling method compared with surveying randomly generated telephone numbers from non–cell phone exchanges. Do you think these
14/07/14 7:31 AM
www.freebookslides.com 296
CHAPTER 8 Surveys and Sampling
advantages and disadvantages have changed over time? How do you expect they’ll change in the future? 43. Change. How much change do you have on you right now? Go ahead, count it. a) How much change do you have? b) Suppose you check on your change every day for a week as you head for lunch and average the results. What parameter would this average estimate? c) Suppose you ask 10 friends to average their change every day for a week, and you average those 10 measurements. What is the population now? What parameter would this average estimate? d) Do you think these 10 average change amounts are likely to be representative of the population of change amounts in your class? In your college? In the country? Why or why not? 44. Fuel economy. Occasionally, when I fill my car with gas, I figure out how many miles per gallon my car got. I wrote down those results after six fill-ups in the past few months. Overall, it appears my car gets 28.8 miles per gallon. a) What statistic have I calculated? b) What is the parameter I’m trying to estimate? c) How might my results be biased? d) When the Environmental Protection Agency (EPA) checks a car like mine to predict its fuel economy, what parameter is it trying to estimate? 45. Accounting. Between quarterly audits, a company likes to check on its accounting procedures to address any problems before they become serious. The accounting staff processes payments on about 120 orders each day. The next day, the supervisor rechecks 10 of the transactions to be sure they were processed properly. a) Propose a sampling strategy for the supervisor. b) How would you modify that strategy if the company makes both wholesale and retail sales, requiring different bookkeeping procedures? 46. Happy workers? A manufacturing company employs 14 project managers, 48 foremen, and 377 laborers. In an effort to keep informed about any possible sources of employee discontent, management wants to conduct job satisfaction interviews with a simple random sample of employees every month. a) Do you see any danger of bias in the company’s plan? Explain. b) How might you select a simple random sample? c) Why do you think a simple random sample might not provide the best estimate of the parameters the company wants to estimate? d) Propose a better sampling strategy.
M08_SHAR8696_03_SE_C08.indd 296
e) Listed below are the last names of the project managers. Use random numbers to select two people to be interviewed. Be sure to explain your method carefully. Barrett DeLara Maceli Rosica Williams
Bowman DeRoos Mulvaney Smithson Yamamoto
Chen Grigorov Pagliarulo Tadros
47. Quality control. Sammy’s Salsa, a small local company, produces 20 cases of salsa a day. Each case contains 12 jars and is imprinted with a code indicating the date and batch number. To help maintain consistency, at the end of each day, Sammy selects three bottles of salsa, weighs the contents, and tastes the product. Help Sammy select the sample jars. Today’s cases are coded 07N61 through 07N80. a) Carefully explain your sampling strategy. b) Show how to use random numbers to pick the three jars for testing. c) Did you use a simple random sample? Explain. 48. Fish quality. Concerned about reports of discolored scales on fish caught downstream from a newly sited chemical plant, scientists set up a field station in a shoreline public park. For one week they asked fishermen there to bring any fish they caught to the field station for a brief inspection. At the end of the week, the scientists said that 18% of the 234 fish that were submitted for inspection displayed the discoloration. From this information, can the researchers estimate what proportion of fish in the river have discolored scales? Explain. 49. Sampling methods. Consider each of these situations. Do you think the proposed sampling method is appropriate? Explain. a) We want to know what percentage of local doctors accept Medicaid patients. We call the offices of 50 doctors randomly selected from local Yellow Pages listings. b) We want to know what percentage of local businesses anticipate hiring additional employees in the upcoming month. We randomly select a page in the Yellow Pages and call every business listed there. 50. More sampling methods. Consider each of these situations. Do you think the proposed sampling method is appropriate? Explain. a) We want to know if business leaders in the community support the development of an “incubator” site at a vacant lot on the edge of town. We spend a day phoning local businesses in the phone book to ask whether they’d sign a petition. b) We want to know if travelers at the local airport are satisfied with the food available there. We go to the airport on a busy day and interview every tenth person in line in the food court.
14/07/14 7:31 AM
www.freebookslides.com
Exercises 297
Ju s t Che c k i n g A n swers 1 a) It can be hard to reach all members of a population,
and it can take so long that circumstances change, affecting the responses. A well-designed sample is often a better choice.
b) This sample is probably biased—people who didn’t like the food at the restaurant might not choose to eat there. c) No, only the sample size matters, not the fraction of the overall population. d) Students who frequent this website might be more enthusiastic about Statistics than the overall population of Statistics students. A large sample cannot compensate for bias. e) It’s the population “parameter.” “Statistics” describe samples. 2 a) systematic
b) stratified c) simple d) cluster
M08_SHAR8696_03_SE_C08.indd 297
14/07/14 7:31 AM
www.freebookslides.com
M08_SHAR8696_03_SE_C08.indd 298
14/07/14 7:31 AM
9
www.freebookslides.com
Sampling Distributions and Confidence Intervals for Proportions
Marketing Credit Cards: The MBNA Story When Delaware substantially raised its interest rate ceiling in 1981, banks and other lending institutions rushed to establish corporate headquarters there. One of these was the Maryland Bank National Association, which established a credit card branch in Delaware using the acronym MBNA. Starting in 1982 with 250 employees in a vacant supermarket in Ogletown, Delaware, MBNA grew explosively in the next two decades. One of the reasons for this growth was MBNA’s use of affinity groups—issuing cards endorsed by alumni associations, sports teams, interest groups, and labor unions, among others. MBNA sold the idea to these groups by letting them share a small percentage of the profit. By 2006, MBNA had become Delaware’s largest private employer. At its peak, MBNA had more than 50 million cardholders and had outstanding credit card loans of $82.1 billion, making MBNA the third-largest U.S. credit card bank. “In American corporate history, I doubt there are many companies that burned as brightly, for such a short period of time, as MBNA,” said Rep. Mike Castle, R-Del.1 MBNA was bought by Bank of America in 2005 for $35 billion. Bank of America kept the brand briefly before issuing all cards under its own name in 2007.
1
Delaware News Online, January 1, 2006.
299
M09_SHAR8696_03_SE_C09.indd 299
14/07/14 7:30 AM
www.freebookslides.com 300
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
U
N o t at i o n A l e r t We use p for the proportion in the population and pn for the o bserved proportion in a sample. We’ll also use q for the p roportion of failures 1q = 1 - p2, and qn for its observed value, just to s implify some formulas.
Our marketing manager can’t know p, the actual proportion of all cardholders who will increase their spending by more than $800. Her experiment with 1000 cardhold211 ers provides only one sample proportion, pn = = 0.211. So how can this single 1000 experiment provide useful information? If we could see how proportions vary across all possible samples that she could have taken, that might help us make decisions about what a single experiment can tell us. One way to do that is to simulate lots of samples of the same size using the same population proportion. Here’s a histogram of 10,000 sample proportions, each for a random sample of size 1000, using p = 0.2 as the true proportion:
Figure 9.1 A histogram of 10,000 samples of size 1000 with a true proportion of 0.20. Most of the samples have proportions between 0.175 and 0.225 and nearly all have proportions between 0.16 and 0.24.
The Distribution of Sample Proportions
1500
9.1
# of samples 500 1000
Cardholders of a bank’s credit card Proportion of cardholders who increase their spending by at least $800 in the subsequent month WHEN Now WHERE United States WHY To predict costs and benefits of a program offer WHO
WHAT
nlike the early days of the credit card industry when MBNA established itself, the environment today is intensely competitive, with companies constantly looking for ways to attract new customers and to maximize the profitability of the customers they already have. Many of the large companies have millions of customers, so instead of trying out a new idea with all their customers, they almost always conduct a pilot study or trial first, conducting a survey or an experiment on a sample of their customers. Credit card companies make money on their cards in three ways: they earn a percentage of every transaction, they charge interest on balances that are not paid in full, and they collect fees (yearly fees, late fees, etc.). To generate all three types of revenue, the marketing departments of credit card banks constantly seek ways to encourage customers to increase the use of their cards. A marketing specialist at one company has an idea of offering double air miles to their customers with an airline-affiliated card if they increase their spending by at least $800 in the month following the offer. Of course, offering double miles is not free. The company has to pay the airline for the added miles they give away. Her finance department tells her that if 20% of all customers increase spending by $800 then, based on past behavior, the double miles offer will be profitable. Unfortunately, she can’t know what all customers will do until it’s too late. So, she decides to send the offer to a random sample of 1000 customers. In that sample, she finds that 211 (21.1%) of the cardholders increase their spending by more than the required $800. Is that good enough? Could another sample of 1000 different people show 19.5%? If results vary from sample to sample how can we make good decisions? Variation like this is sometimes called sampling error even though no error has been committed. A better name for this variation that you’d expect to see from sample to sample might be sampling variability. Even though we can’t control this variability we can predict exactly how much different proportions will vary from sample to sample. This will enable us to make sound business decisions based on a single sample.
0.16
0.18
0.20
0.22
0.24
Sample Proportions
M09_SHAR8696_03_SE_C09.indd 300
14/07/14 7:30 AM
www.freebookslides.com
301
The Distribution of Sample Proportions
Imagine We see only the sample we actually drew. If we imagine the results of all the other possible samples we could have drawn (by modeling or simulating them), we can learn more.
The Sampling Distribution for a Proportion We have now answered the question raised at the start of the chapter. To discover how variable a sample proportion is, we need to know the proportion and the size of the sample. That’s all.
Effect of Sample Size Because n is in the denominator of SD1pn 2, the larger the sample, the smaller the standard deviation. We need a small standard deviation to make sound business decisions, but larger samples cost more. That tension is a fundamental issue in Statistics.
What can we see about this distribution? First, not every sample has a sample proportion equal to 0.2. We know that sample proportions vary, and this distribution shows that variation. For example, we can see that sample proportions bigger than 0.24 and smaller than 0.16 are rare, and that most of the sample proportions are between 0.18 and 0.22. From the 10,000 samples, we can also compute the standard deviation of these sample proportions to see how much they vary. In this simulation the standard deviation is 0.0126, or 1.26%. The 68–95–99.7 Rule tells us to expect 95% of the sample proportions to be within 2 * 0.0126 of the mean (which is 0.20). That is, we expect 95% of the sample proportions to be in the interval (0.175, 0.225) and 99.7% within 3 * 0.0126 of 0.20 (0.162, 0.238). This matches the histogram pretty well. The histogram in Figure 9.1 shows a simulation of the sampling distribution of pn. The theoretical sampling distribution is the distribution of all the sample proportions that would arise from all possible samples of the same size with a constant probability of a “success.” We actually didn’t need to simulate this sampling distribution. We know, from Chapter 7, that the number of successes can be modeled by a Binomial, which, in turn, can be modeled by a Normal distribution as long as np and nq are large enough. The sample proportion is just the number of successes, X, divided by n, so if the distribution of X is Normal, the distribution of pn should have the same shape. And we know that it should be centered at the true proportion p and pq . Let’s see how well that matches our simulation: have standard deviation An pq 10.2210.82 = = 0.0126. Pretty good! With a sample size this large, the An A 1000 Normal approximation really works well. So, we can say that the sampling distribution of a sample proportion from a sample of size n with true proportion p is Normal with mean p and standard deviapq tion . Here’s a picture of that sampling distribution model: An
Figure 9.2 A Normal model centered at p with pq a standard deviation of is a good model An for a collection of proportions found for many random samples of size n from a population with success probability p. pq 23! n
pq 22! n
pq 21! n
p
pq 1! n
pq 2! n
pq 3! n
The Sampling Distribution Model for a Proportion Provided that the sampled values are independent and the sample size is large enough, the sampling distribution of pn is modeled by a Normal model with mean m1pn 2 = p and pq standard deviation SD1pn 2 = . An
For Example
The distribution of a sample proportion
A supermarket has installed “self-checkout” stations that allow customers to scan and bag their own groceries. These are popular, but because customers occasionally encounter a problem, a staff member must be available to help out. The manager wants (continued )
M09_SHAR8696_03_SE_C09.indd 301
14/07/14 7:30 AM
www.freebookslides.com CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
302
to estimate what proportion of customers need help so that he can optimize the number of self-check stations per staff member. He collects data from the stations for 30 days, recording the proportion of customers on each day that need help and makes a histogram of the observed proportions.
Questions
1. If the proportion needing help is independent from day to day, what shape would you expect his histogram to follow? 2. Is the assumption of independence reasonable?
Answers
1. Normal, centered at the true proportion. 2. Possibly not. For example, shoppers on weekends might be less experienced than regular weekday shoppers and would then need more help.
Ju s t Che c k i n g histogram of all the sample proportions from these samples, what shape would it have? tomers about whether they like the proposed location for the new coffee shop on the third floor, with a panoramic view of 2 Where would the center of that histogram be? the food court. Of course, you’ll get just one number, your 3 If you think that about half the customers are in favor of the plan, sample proportion, pn. But if you imagined all the possible what would the standard deviation of the sample proportions be? samples of 100 customers you could draw and imagined the
1 You want to poll a random sample of 100 shopping mall cus-
How Good Is the Normal Model?
1500
1000
500
0.0
0.5
1.0
Figure 9.3 Proportions from samples of size 2 can take on only three possible values. A Normal model does not work well here.
We’ve seen that the sampling distribution of proportions follows the 68–95–99.7 Rule well. But do all sample proportions really work like this? Stop and think for a minute about what we’re claiming. We’ve said that if we draw repeated random samples of the same size, n, from some population and measure the proportion, pn, we get for each sample, then the collection of these proportions will pile up around the underlying population proportion, p, in such a way that a histogram of the sample proportions can be modeled well by a Normal model. There must be a catch. Suppose the samples were of size 2, for example. Then the only possible numbers of successes could be 0, 1, or 2, and the proportion values would be 0, 0.5, and 1. There’s no way the histogram could ever look like a Normal model with only three possible values for the variable (Figure 9.3). Well, there is a slight catch. The claim is only approximately true. But, the model becomes a better and better representation of the distribution of the sample proportions as the sample size gets bigger.2 That’s one reason we require np and nq to be at least 10. But the distributions of proportions from samples of the size you’re likely to see in business do have histograms that are remarkably close to a Normal model.
For Example
Sampling distribution for proportions
Time Warner provides cable, phone, and Internet services to customers, some of whom subscribe to “packages” including several services. Nationwide, suppose that 30% of their customers are “package subscribers” and subscribe to all three types of service. A local representative in Phoenix, Arizona, wonders if the proportion in his region is the same as the national proportion.
2
M09_SHAR8696_03_SE_C09.indd 302
Formally, we say the claim is true in the limit as the sample size (n) grows.
14/07/14 7:30 AM
www.freebookslides.com
The Distribution of Sample Proportions
303
Questions If the same proportion holds in his region and he takes a survey of 100 customers at random from his subscriber list: 1. What proportion of customers would you expect to be package subscribers? 2. What is the standard deviation of the sample proportion? 3. What shape would you expect the sampling distribution of the proportion to have? 4. Would you be surprised to find out that in a sample of 100, 49 of the customers are package subscribers? Explain. What might account for this high percentage?
Answers
1. Because 30% of customers nationwide are package subscribers, we would expect the same for the sample proportion. pq 10.3210.72 2. The standard deviation is SD1pn 2 = = = 0.046. Cn C 100 3. Normal. 4. 49 customers results in a sample proportion of 0.49. The mean is 0.30 with a standard deviation of 0.046. This sample proportion is more than 4 standard 10.49 - 0.302 deviations higher than the mean: = 4.13. It would be very unusual 0.046 to find such a large proportion in a random sample. Either it is a very unusual sample, or the proportion in his region is not the same as the national average.
Assumptions and Conditions Most models are useful only when specific assumptions are true. In the case of the model for the distribution of sample proportions, there are two assumptions: Independence Assumption: The sampled values must be independent of each other. Sample Size Assumption: The sample size, n, must be large enough. Of course, the best we can do with assumptions is to think about whether they are likely to be true, and we should do so. However, we often can check corresponding conditions that provide information about the assumptions as well. Think about the Independence Assumption and check the following corresponding conditions before using the Normal model to model the distribution of sample proportions: Randomization Condition: If your data come from an experiment, subjects should have been randomly assigned to treatments. If you have a survey, your sample should be a simple random sample of the population. If some other sampling design was used, be sure the sampling method was not biased and that the data are representative of the population. 10% Condition: If sampling has not been made with replacement (that is, returning each sampled individual to the population before drawing the next individual), then the sample size, n, should be no larger than 10% of the population. If it is, you must adjust the size of the confidence interval with methods more advanced than those found in this book. Success/Failure Condition: The Success/Failure condition says that the sample size must be big enough so that both the number of “successes,” np, and the number of “failures,” nq, are expected to be at least 10.3 Expressed without the symbols, this condition just says that we need to expect at least 10 successes and at least 10 failures to have enough data for sound conclusions. For the bank’s credit card promotion example, we labeled as a “success” a cardholder 3
We saw where the 10 came from in the Math Box on page 254.
M09_SHAR8696_03_SE_C09.indd 303
14/07/14 7:30 AM
www.freebookslides.com 304
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
who increases monthly spending by at least $800 during the trial. The bank observed 211 successes and 789 failures. Both are at least 10, so there are certainly enough successes and enough failures for the condition to be satisfied.4 These two conditions seem to contradict each other. The Success/Failure condition wants a big sample size. How big depends on p. If p is near 0.5, we need a sample of only 20 or so. If p is only 0.01, however, we’d need 1000. But the 10% condition says that the sample size can’t be too large a fraction of the population. Fortunately, the tension between them isn’t usually a problem in practice. Often, as in polls that sample from all U.S. adults, or industrial samples from a day’s production, the populations are much larger than 10 times the sample size.
For Example
Assumptions and conditions for sample proportions
The analyst conducting the Time Warner survey says that, unfortunately, only 20 of the customers he tried to contact actually responded, but that of those 20, 8 are package subscribers.
Questions
1. If the proportion of package subscribers in his region is 0.30, how many package subscribers, on average, would you expect in a sample of 20? 2. Would you expect the shape of the sampling distribution of the proportion to be Normal? Explain.
Answers
1. You would expect 0.30 * 20 = 6 package subscribers. 2. No. Because 6 is less than 10, we should be cautious in using the Normal as a model for the sampling distribution of proportions. (The number of observed successes, 8, is also less than 10.)
Guided Example
Foreclosures An analyst at a home loan lender was looking at a package of 90 mortgages that the company had recently purchased in central California. The analyst was aware that in that region about 13% of the homeowners with current mortgages will default on their loans in the next year and the houses will go into foreclosure. In deciding to buy the collection of mortgages, the finance department assumed that no more than 15 of the mortgages would go into default. Any amount above that will result in losses for the company. In the package of 90 mortgages, what’s the probability that there will be more than 15 foreclosures?
Plan
Setup State the objective of the study.
We want to find the probability that in a group of 90 mortgages, more than 15 will default. Since 15 out of 90 is 16.7%, we need the probability of finding more than 16.7% defaults out of a sample of 90, if the proportion of defaults is 13%.
4
The Success/Failure condition is about the number of successes and failures we expect, but if the number of successes and failures that occurred is Ú 10, then you can use that.
M09_SHAR8696_03_SE_C09.indd 304
14/07/14 7:30 AM
www.freebookslides.com
A Confidence Interval for a Proportion
Model Check the conditions.
305
✓ Independence Assumption If the mortgages come from a wide geographical area, one homeowner defaulting should not affect the probability that another does. However, if the mortgages come from the same neighborhood(s), the independence assumption may fail and our estimates of the default probabilities may be wrong. ✓ Randomization Condition. For the question asked, these 90 mortgages in the package can be considered as a random sample of mortgages in the region. If there are too many failures, we may doubt that they are a representative sample. ✓ 10% Condition. The 90 mortgages are less than 10% of the population. ✓ Success/Failure Condition np = 9010.132 = 11.7 Ú 10 nq = 9010.872 = 78.3 Ú 10
State the parameters and the sampling distribution model.
The population proportion is p = 0.13. The conditions are satisfied, so we’ll model the sampling distribution of pn with a Normal model, with mean 0.13 and standard deviation SD1pn 2 =
Plot Make a picture. Sketch the model and shade the area we’re interested in, in this case the area to the right of 16.7%.
pq 10.13210.872 = ≈ 0.035. n A C 90
Our model for pn is N10.13, 0.0352. We want to find P 1pn 7 0.1672. 0.167
0.145 0.025 –3s
Do
Report
Mechanics Use the standard deviation as a ruler to find the z-score of the cutoff p roportion. Find the resulting p robability from a table, a computer program, or a calculator.
Conclusion Interpret the robability in the context p of the question.
9.2
0.06 –2s
0.095 –1s
0.130 p
0.165 1s
0.2 2s
0.235 3s
pn - p 0.167 - 0.13 = = 1.06 SD1pn 2 0.035 P1pn 7 0.1672 = P1z 7 1.062 = 0.1446 z =
Memo Re: Mortgage defaults Assuming that the 90 mortgages we recently purchased are a random sample of mortgages in this region, there is about a 14.5% chance that we will exceed the 15 foreclosures that Finance has determined as the break-even point.
A Confidence Interval for a Proportion To plan their inventory and production needs, businesses use a variety of forecasts about the economy. One important attribute is consumer confidence in the overall economy. Tracking changes in consumer confidence over time can help businesses
M09_SHAR8696_03_SE_C09.indd 305
14/07/14 7:30 AM
www.freebookslides.com 306
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
N o t at i o n A l e r t Remember that pn is our sample estimate of the true proportion p. Recall also that q is just shorthand for 1 - p, and qn = 1 - pn .
gauge whether the demand for their products is on an upswing or about to experience a downturn. The Gallup Poll periodically asks a random sample of U.S. adults whether they think economic conditions are getting better, getting worse, or staying about the same. When Gallup polled 3559 respondents in April 2013 (during the week ending April 21), only 1495 thought economic conditions in the United States were getting better—a sample proportion of pn = 1495>3559 = 42%. We (and Gallup) hope that this observed proportion is close to the population proportion, p, but we know that a second sample of 3559 adults wouldn’t have a sample proportion of exactly 42.0%. In fact, Gallup did sample another group of adults just a few days later and found a slightly different sample proportion. What can we say about consumer confidence in the entire population when the proportion that we measure keeps bouncing around from sample to sample? That’s where the sampling distribution model can help. By knowing how much they vary and the shape of their distribution, we’ll get a clearer idea of where the true proportion might be and how much we know about it. So, what do we know about our sampling distribution model? We know that it’s centered at the true proportion, p, of all U.S. adults who think the economy is improving. But we don’t know p. It probably isn’t 42.0%. That’s the pn from our sample. What we do know is that the sampling distribution model of pn is centered at p, and we know that pq . We also know that the the standard deviation of the sampling distribution is An shape of the sampling distribution is approximately Normal, when the sample is large enough. This is all fine in the model world, but we need to solve problems in the real world. In the real world, we don’t know p. (If we did, we wouldn’t have bothered totake a sample.) And so, we don’t know 2pq>n either. But, we’ll do the best we can and estimate it by using 2pn qn >n . That may not seem like a big deal, but it gets a special name. Whenever we estimate the standard deviation of a sampling distribution, we call it a standard error (SE). Using pn, we find the standard error: SE1pn 2 =
pnqn 10.42211 - 0.422 = = 0.008 Cn C 3559
Now, we use that to draw our best guess of the sampling distribution for the true proportion who think the economy is getting better as shown in Figure 9.4. Figure 9.4 The sampling distribution of sample proportions from samples of size 3559 is centered at the true proportion, p, with a standard deviation of 0.008.
p – 0.0024 p – 0.0016 p – 0.008
p
p + 0.008 p + 0.0016 p + 0.0024
Because the sampling distribution is Normal, we expect that about 68% of all samples of 3559 U.S. adults taken in April 2013 would have sample proportions within 1 standard deviation of p. And about 95% of all these samples will have proportions within p { 2 SEs. But where is our sample proportion in this picture? And what value does p really have? We still don’t know! We do know that for 95% of random samples, pn will be no more than 2 SEs away from p. So here’s the key to using sampling distributions. Let’s reverse it and look at it from pn ’s point of view. If I’m pn, there’s a 95% chance that p is no more than 2 SEs away from me. If I reach out 2 SEs, or 2 * 0.008, away from me on both sides, I’m 95% sure that p will be within my grasp.
M09_SHAR8696_03_SE_C09.indd 306
14/07/14 7:30 AM
www.freebookslides.com
A Confidence Interval for a Proportion
307
Figure 9.5 Reaching out 2 SEs on either side of np makes us 95% confident we’ll trap the true proportion, p.
ACME p-trap: Guaranteed* to capture p. *with 95% confidence
pˆ – 2 SE
pˆ
pˆ + 2 SE
What Can We Say about a Proportion? So what can we really say about p? Of course, I’m not sure that my interval catches p. And I don’t know its true value, but I can state a probability that I’ve covered the true value in an interval. Here’s a list of things we’d like to be able to say and the reasons we can’t say most of them: 1. “42.0% of all U.S. adults thought the economy was improving.” It would be nice to be able to make absolute statements about population values with certainty, but we just don’t have enough information to do that. There’s no way to be sure that the population proportion is the same as the sample proportion; in fact, it almost certainly isn’t. Observations vary. Another sample would yield a different sample proportion. 2. “It is probably true that 42.0% of all U.S. adults thought the economy was improving.” No. In fact, we can be pretty sure that whatever the true proportion is, it’s not exactly 42.0%, so the statement is not true. 3. “We don’t know exactly what proportion of U.S. adults thought the economy was improving, but we know that it’s within the interval 42.0% t 2 : 0.8%. That is, it’s between 40.4% and 43.6%.” This is getting closer, but we still can’t be certain. We can’t know for sure that the true proportion is in this interval—or in any particular range. 4. “We don’t know exactly what proportion of U.S. adults thought the economy was improving, but the interval from 40.4% to 43.6% probably contains the true proportion.” Close! Now, we’ve fudged twice—first by giving an interval and second by admitting that we only think the interval “probably” contains the true value.
“Far better an approximate answer to the right question, … than an exact answer to the wrong question.” —John W. Tukey
That last statement is true, but it’s a bit wishy-washy. We can tighten it up by quantifying what we mean by “probably.” We saw that 95% of the time when we reach out 2 SEs from pn, we capture p, so we can be 95% confident that this is one of those times. After putting a number on the probability that this interval covers the true proportion, we’ve given our best guess of where the parameter is and how certain we are that it’s within some range. 5. “We are 95% confident that between 40.4% and 43.6% of U.S. adults thought the economy was improving.” Statements like this are called confidence intervals. They don’t tell us everything we might want to know, but they’re the best we can do. Each confidence interval discussed in this book has a name. You’ll see many different kinds of confidence intervals in the following chapters. Some will be about more than one sample, some will be about statistics other than proportions, and some
M09_SHAR8696_03_SE_C09.indd 307
14/07/14 7:30 AM
www.freebookslides.com 308
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
will use models other than the Normal. The interval calculated and interpreted here is an example of a one-proportion z-interval.5 We’ll lay out the formal definition in the next few pages.
For Example
Finding a 95% confidence interval for a proportion
The Chamber of Commerce of a mid-sized city has supported a proposal to change the zoning laws for a new part of town. The new regulations would allow for mixed commercial and residential development. The vote on the measure is scheduled for three weeks from today, and the president of the Chamber of Commerce is concerned that they may not have the majority of votes that they will need to pass the measure. She commissions a survey that asks likely voters if they plan to vote for the measure. Of the 516 people selected at random from likely voters, 289 said they would likely vote for the measure.
Questions
a. Find a 95% confidence interval for the true proportion of voters who will vote for the measure. (Use the 68–95–99.7% Rule.) b. What would you report to the president of the Chamber of Commerce?
Answers pnqn 10.56210.442 289 = 0.56 So, SE1pn 2 = = = 0.022 516 Cn C 516 A 95% confidence interval for p can be found from pn { 2 SE1pn 2 = 0.56 { 210.0222 = 10.516, 0.6042 or 51.6% to 60.4%. b. We are 95% confident that the true proportion of voters who plan to vote for the measure is between 51.6% and 60.4%. This assumes that the sample we have is representative of all likely voters.
a. pn =
What Does “95% Confidence” Really Mean? What do we mean when we say we have 95% confidence that our interval contains the true proportion? Formally, what we mean is that “95% of samples of this size will produce confidence intervals that capture the true proportion.” This is correct but a little long-winded, so we sometimes say “we are 95% confident that the true proportion lies in our interval.” Our uncertainty is about whether the particular sample we have at hand is one of the successful ones or one of the 5% that fail to produce an interval that captures the true value. In this chapter, we have seen how proportions vary from sample to sample. If other pollsters had selected their own samples of adults, they would have found some who thought the economy was getting better, but each sample proportion would almost certainly differ from ours. When they each tried to estimate the true proportion, they’d center their confidence intervals at the proportions they observed in their own samples. Each would have ended up with a different interval. Figure 9.6 shows the confidence intervals produced by simulating 20 samples. The purple dots are the simulated proportions of adults in each sample who thought the economy was improving, and the orange segments show the confidence intervals found for each simulated sample. The green line represents the true percentage of adults who thought the economy was improving. You can see that most of the simulated confidence intervals include the true value—but one missed. (Note that it is the intervals that vary from sample to sample; the green line doesn’t move.) Of course, a huge number of possible samples could be drawn, each with its own sample proportion. This simulation approximates just some of them. Each sample can be used to make a confidence interval. That’s a large pile of possible confidence intervals, and ours is just one of those in the pile. Did our confidence 5
In fact, this confidence interval is so standard for a single proportion that you may see it simply called a “confidence interval for the proportion.”
M09_SHAR8696_03_SE_C09.indd 308
14/07/14 7:30 AM
www.freebookslides.com
Figure 9.6 The horizontal green line shows the true proportion of people in April 2013 who thought the economy was improving. Most of the 20 simulated samples shown here produced 95% confidence intervals that captured the true value, but one missed.
A Confidence Interval for a Proportion
309
Proportion
interval “work”? We can never be sure because we’ll never know the true proportion of all U.S. adults who thought in April 2013 that the economy was improving. However, the Normal model assures us that 95% of the intervals in the pile are winners, covering the true value, and only 5%, on average, miss the target. That’s why we’re 95% confident that our interval is a winner. The statements we made about what all U.S. adults thought about the economy were possible because we used a Normal model for the sampling distribution. But is that model appropriate? As we’ve seen, all statistical models make assumptions. If those assumptions are not true, the model might be inappropriate, and our conclusions based on it may be wrong. Because the confidence interval is built on the Normal model for the sampling distribution, the assumptions and conditions are the same as those we discussed in section 9.1. But, because they are so important, we’ll go over them again. You can never be certain that an assumption is true, but you can decide intelligently whether it is reasonable. When you have data, you can often decide whether an assumption is plausible by checking a related condition in the data. However, you’ll want to make a statement about the world at large, not just about the data. So the assumptions you make are not just about how the data look, but about how representative they are. Here are the assumptions and the corresponding conditions to check before creating (or believing) a confidence interval about a proportion.
Independence Assumption You first need to think about whether the independence assumption is plausible. You can look for reasons to suspect that it fails. You might wonder whether there is any reason to believe that the data values somehow affect each other. (For example, might any of the adults in the sample be related?) This condition depends on your knowledge of the situation. It’s not one you can check by looking at the data. However, now that you have data, there are two conditions that you can check: • Randomization Condition: Were the data sampled at random or generated from a properly randomized experiment? Proper randomization can help ensure independence. • 10% Condition: Samples are almost always drawn without replacement. Usually, you’d like to have as large a sample as you can. But if you sample from a small population, the probability of success may be different for the last few individuals you draw than it was for the first few. For example, if most of the women have already been sampled, the chance of drawing a woman from the remaining population is lower. If the sample exceeds 10% of the population, you will have to adjust the margin of error with methods more advanced than those found in this book. But if less than 10% of the population is sampled, it is safe to proceed without adjustment.
M09_SHAR8696_03_SE_C09.indd 309
14/07/14 7:30 AM
www.freebookslides.com 310
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
Sample Size Assumption The model we use for inference is based on the Normal model. So, the sample must be large enough for the Normal sampling model to be appropriate. It turns out that we need more data when the proportion is close to either extreme (0 or 1). This requirement is easy to check with the following condition: • Success/Failure Condition: We must expect our sample to contain at least 10 “successes” and at least 10 “failures.” Recall that by tradition we arbitrarily label one alternative (usually the outcome being counted) as a “success” even if it’s something bad. The other alternative is then a “failure.” So we check that both npn Ú 10 and nqn Ú 10.
For Example
Assumptions and conditions for a confidence interval for proportions
We previously reported a confidence interval to the president of the Chamber of Commerce.
Question Were the assumptions and conditions for making this interval s atisfied? Answer Because the sample was randomized, we assume that the responses of the people surveyed were independent so the randomization condition is met. We assume that 516 people represent fewer than 10% of the likely voters in the town so the 10% condition is met. Because 289 people said they were likely to vote for the measure and thus 227 said they were not, both are much larger than 10 so the Success/Failure condition is also met. All the conditions to make a confidence interval for the proportion appear to have been satisfied.
9.3
Margin of Error: Certainty vs. Precision We’ve just claimed that at a certain confidence level we’ve captured the true proportion of all U.S. adults who thought the economy was improving in April 2013. Our confidence interval stretched out the same distance on either side of the estimated proportion with the form: pn { 2 SE1pn 2.
Confidence Intervals We’ll see many confidence intervals in this book. All have the form: estimate { ME. For proportions at 95% confidence: ME ≈ 2 SE1pn 2.
M09_SHAR8696_03_SE_C09.indd 310
The extent of that interval on either side of pn is called the margin of error (ME). In general, confidence intervals look like this: estimate { ME. The margin of error for our 95% confidence interval was 2 SEs. What if we wanted to be more confident? To be more confident, we’d need to capture p more often, and to do that, we’d need to make the interval wider. For example, if we want to be 99.7% confident, the margin of error will have to be 3 SEs. The more confident we want to be, the larger the margin of error must be. We can be 100% confident that any proportion is between 0% and 100%, but that’s not very useful. Or we could give a narrow confidence interval, say, from 41.98% to 42.02%. But we couldn’t be very confident about a statement this precise. Every confidence interval is a balance between certainty and precision. The tension between certainty and precision is always there. There is no simple answer to the conflict. Fortunately, in most cases we can be both sufficiently
14/07/14 7:30 AM
www.freebookslides.com
Margin of Error: Certainty vs. Precision
311
Figure 9.7 Reaching out 3 SEs on either side of np makes us 99.7% confident we’ll trap the true proportion p. Compare the width of this interval with the interval in Figure 9.5. W!
NE
!
ACME p-trap: Guaranteed* to capture p. *Now with 99.7% confidence!
pˆ – 3 SE
pˆ
IM
PR OV E
D!
!
pˆ + 3 SE
certain and sufficiently precise to make useful statements. The choice of confidence level is somewhat arbitrary, but you must choose the level yourself. The data can’t do it for you. The most commonly chosen confidence levels are 90%, 95%, and 99%, but any percentage can be used. (In practice, though, using something like 92.9% or 97.2% might be viewed with suspicion.)
Garfield © 1999 Jim Davis/Distributed by Universal Uclick. Reprinted with permission. All rights reserved.
Critical Values N o t at i o n A l e r t We put an asterisk on a letter to indicate a critical value. We usually use “z ” when we talk about Normal models, so z* is always a critical value from a Normal model. Some common confidence levels and their associated critical values:
CI
z*
90% 95% 99%
1.645 1.960 2.576
In our opening example, our margin of error was 2 SEs, which produced a 95% confidence interval. To change the confidence level, we’ll need to change the number of SEs to correspond to the new level. A wider confidence interval means more confidence. For any confidence level, the number of SEs we must stretch out on either side of pn is called the critical value. Because it is based on the Normal model, we denote it z*. For any confidence level, we can find the corresponding critical value from a computer, a calculator, or a Normal probability table, such as Table Z in the back of the book. For a 95% confidence interval, the precise critical value is z* = 1.96. That is, 95% of a Normal model is found within {1.96 standard deviations of the mean. We’ve been using z* = 2 from the 68–95–99.7 Rule because 2 is very close to 1.96 and is easier to remember. Usually, the difference is negligible, but if you want to be precise, use 1.96.6 Suppose we could be satisfied with 90% confidence. What critical value would we need? We can use a smaller margin of error. Our greater precision is offset by our acceptance of being wrong more often (that is, having a confidence interval that misses the true value). Specifically, for a 90% confidence interval, the critical value is only 1.645 because for a Normal model, 90% of the values are within 1.645 standard deviations from the mean (Figure 9.8). By contrast, suppose your 6
It’s been suggested that since 1.96 is both an unusual value and so important in Statistics, you can recognize someone who’s had a Statistics course by just saying “1.96” and seeing whether they react.
M09_SHAR8696_03_SE_C09.indd 311
14/07/14 7:31 AM
www.freebookslides.com 312
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
boss demands more confidence. If she wants an interval in which she can have 99% confidence, she’ll need to include values within 2.576 standard deviations, creating a wider confidence interval. Figure 9.8 For a 90% confidence interval, the critical value is 1.645 because for a Normal model, 90% of the values fall within 1.645 standard deviations of the mean.
–1.645
1.645
0.9
–3
–2
–1
0 z
1
2
3
One-Proportion z-Interval When the conditions are met, we are ready to find the confidence interval for the population proportion, p. The confidence interval is pn { z* * SE1pn 2, where the standard pnqn deviation of the proportion is estimated by SE1pn 2 = . Cn
For Example
Finding confidence intervals for proportions with different levels of confidence
The president of the Chamber of Commerce is worried that 95% confidence is too low and wants a 99% confidence interval.
Question Find a 99% confidence interval. Would you reassure her that the measure will pass? Explain.
Answer In the example on page 308, we used 2 as the value of z* for 95% confidence. A more precise value would be 1.96 for 95% confidence. For 99% confidence, the critical z-value is 2.576. So, a 99% confidence interval for the true proportion is pn { 2.576 SE1pn 2 = 0.56 { 2.57610.0222 = 10.503, 0.6172
The confidence interval is now wider: 50.3% to 61.7%.
The Chamber of Commerce needs at least 50% for the vote to pass. At a 99% confidence level, it looks as if the measure will pass. However, we must assume that the sample is representative of the voters in the actual election and that people vote in the election as they said they will when they took the survey.
Guided Example
Public Opinion In March of 2013, workers in the greeting card company Edit66, based in the southern French town of Cabestany, took their bosses hostage. Company chiefs Paul Denis and Merthus Bezemer had informed employees who were to be laid off that they would not receive severance pay that they are legally entitled to. The workers refused to allow their bosses to leave the premises. The town’s mayor Jean Vila supported the action. (www.english.rfi.fr/economy/20130329 -greeetings-card-workers-kidnap-bosses-over-unpaid-layoff-pay) There had been a number of similar “bossnappings” in France in 2009. Incidents occurred at SONY, 3M, and Caterpillar plants in France. Apoll taken by Le Parisien in
M09_SHAR8696_03_SE_C09.indd 312
14/07/14 7:31 AM
www.freebookslides.com
313
Margin of Error: Certainty vs. Precision
WHO WHAT WHEN WHERE HOW
WHY
Adults in France Proportion who sympathize with the practice of bossnapping April 2–3, 2009 France 1010 adults were randomly sampled by the French Institute of Public Opinion (l’Ifop) for the magazine Paris Match To investigate public opinion of bossnapping
Plan
April 2009 found 45% of the French “supportive” of such action. A similar poll taken by Paris Match, April 2–3, 2009, found 30% “approving” and 63% were “understanding” or “sympathetic” of the action. Only 7% condemned the practice of “bossnapping.” The Paris Match poll was based on a random representative sample of 1010 adults. What can we conclude about the proportion of all French adults who sympathize with (without supporting outright) the practice of bossnapping? To answer this question, we’ll build a confidence interval for the proportion of all French adults who sympathize with the practice of bossnapping. As with other procedures, there are three steps to building and summarizing a confidence interval for proportions: Plan, Do, and Report.
Setup State the context of the question. Identify the parameter you wish to estimate. Identify the population about which you wish to make statements. Choose and state a confidence level.
Model Think about the assumptions and check the conditions to decide whether we can use the Normal model.
State the sampling distribution model for the statistic. Choose your method.
Do
Mechanics Construct the c onfidence interval. First, find the standard error. (Remember: It’s called the “standard error” because we don’t know p and have to use pn instead.) Next, find the margin of error. We could informally use 2 for our critical value, but 1.96 is more accurate.7
M09_SHAR8696_03_SE_C09.indd 313
We want to find an interval that is likely with 95% confidence to contain the true proportion, p, of French adults who sympathize with the practice of bossnapping. We have a random sample of 1010 French adults, with a sample proportion of 63%. ✓ Independence Assumption: A French polling agency, l’Ifop, phoned a random sample of French adults. It is unlikely that any respondent influenced another. ✓ Randomization Condition: l’Ifop drew a random sample from all French adults. We don’t have details of their randomization but assume that we can trust it. ✓ 10% Condition: Although sampling was necessarily without replacement, there are many more French adults than were sampled. The sample is certainly less than 10% of the population. ✓ Success/Failure Condition: npn = 1010 * 0.63 = 636 Ú 10 and nqn = 1010 * 0.37 = 374 Ú 10, so the sample is large enough. The conditions are satisfied, so I can use a Normal model to find a one-proportion z-interval.
n = 1010, pn = 0.63, so SE1pn 2 =
C
0.63 * 0.37 = 0.015 1010
Because the sampling model is Normal, for a 95% confidence interval, the critical value z* = 1.96. The margin of error is: ME = z* * SE1pn 2 = 1.96 * 0.015 = 0.029
(continued )
14/07/14 7:31 AM
www.freebookslides.com 314
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
Write the confidence interval. Check that the interval is plausible. We may not have a strong expectation for the center, but the width of the interval depends primarily on the sample size— especially when the estimated proportion is near 0.5.
Report
Conclusion Interpret the confidence interval in the proper context. We’re 95% confident that our interval captured the true proportion.
So the 95% confidence interval is: 0.63 { 0.029 or 10.601, 0.6592.
The confidence interval covers a range of about plus or minus 3%. That’s about the width we might expect for a sample size of about 1000 (when pn is reasonably close to 0.5).
Memo Re: Bossnapping survey The polling agency l’Ifop surveyed 1010 French adults and asked whether they approved, were sympathetic to, or disapproved of recent bossnapping actions. Although we can’t know the true proportion of French adults who were sympathetic (without supporting outright), based on this survey we can be 95% confident that between 60.1% and 65.9% of all French adults were. Because this is an ongoing concern, we may want to repeat the survey to obtain more current data. We may also want to keep these results in mind for future corporate public relations.
Just C hecking Think some more about the 95% confidence interval we just created in the guided example for the proportion of French adults who were sympathetic to bossnapping. 4 If we wanted to be 98% confident, would our confidence interval need
to be wider or narrower?
5 Our margin of error was about {3%. If we wanted to reduce it to {2%
without increasing the sample size, would our level of confidence be higher or lower?
6 If the organization had polled more people, would the interval’s margin of error
have likely been larger or smaller?
9.4
Choosing the Sample Size Every confidence interval must balance precision—the width of the interval— against confidence. Although it is good to be precise and comforting to be confident, there is a trade-off between the two. A confidence interval that says that the percentage is between 10% and 90% wouldn’t be of much use, although you could be quite confident that it covered the true proportion. An interval from 43% to 44% is reassuringly precise, but not if it carries a confidence level of 35%. It’s a
7
If you are following along on your calculator and not rounding off (as we have done for this example), you’ll get SE = 0.0151944 and a ME of 0.0297804.
M09_SHAR8696_03_SE_C09.indd 314
14/07/14 7:31 AM
www.freebookslides.com
Choosing the Sample Size
315
rare study that reports confidence levels lower than 80%. Levels of 95% or 99% are more common. The time to decide whether the margin of error is small enough to be useful is when you design your study. Don’t wait until you compute your confidence interval. To get a narrower interval without giving up confidence, you need to have less variability in your sample proportion. How can you do that? Choose a larger sample. Consider a company planning to offer a new service to their customers. Product managers want to estimate the proportion of customers who are likely to purchase this new service to within 3% with 95% confidence. How large a sample do they need? Let’s look at the margin of error: ME = z*
pnqn Cn
0.03 = 1.96 n Should We Use? What p
Often you’ll have an estimate of the population proportion based on experience or perhaps on a previous study. If so, use that value as pn in calculating what size sample you need. If not, the cautious approach is to use pn = 0.5. That will determine the largest sample necessary regardless of the true proportion. It’s the worst case scenario.
pnqn . Cn
They want to find n, the sample size. To find n, they need a value for pn. They don’t know pn because they don’t have a sample yet, but they can probably guess a value. The worst case—the value that makes the SD (and therefore n) largest—is 0.50, so if they use that value for pn, they’ll certainly be safe. The company’s equation, then, is: 0.03 = 1.96
10.5210.52 . n C
To solve for n, just multiply both sides of the equation by 2n and divide by 0.03: 0.03 2n = 1.96 210.5210.52 1.96 210.5210.52 2n = ≈ 32.67 0.03
Then square the result to find n:
n ≈ 132.672 2 ≈ 1067.1
That method will probably give a value with a fraction. To be safe, always round up. The company will need at least 1068 respondents to keep the margin of error as small as 3% with a confidence level of 95%. Unfortunately, bigger samples cost more money and require more effort. Because the standard error declines only with the square root of the sample size, to cut the standard error (and thus the ME) in half, you must quadruple the sample size. Generally, a margin of error of 5% or less is acceptable, but different circumstances call for different standards. The size of the margin of error may be a marketing decision or one determined by the amount of financial risk you (or the company) are willing to accept. Drawing a large sample to get a smaller ME, however, can run into trouble. It takes time to survey 2400 people, and a survey that extends over a week or more may be trying to hit a target that moves during the time of the survey. A news event or new product announcement can change opinions in the middle of the survey process. Keep in mind that the sample size for a survey is the number of respondents, not the number of people to whom questionnaires were sent or whose phone numbers were dialed. Also keep in mind that a low response rate turns any study essentially into a voluntary response study, which is of little value for inferring population values. It’s almost always better to spend resources on increasing the
M09_SHAR8696_03_SE_C09.indd 315
14/07/14 7:31 AM
www.freebookslides.com 316
CHAPTER 9 Sampling Distributions and Confidence Intervals for Proportions
Why 1000? Public opinion polls often use a sample size of 1000, which gives an ME of about 3% (at 95% confidence) when p is near 0.5