Chapter 1 Preface

As a typical graduate student who rarely had enough money I would struggle to find ways to get my hands on the textbooks I needed for my courses without paying full cost. Old editions, used copies, photocopies, Interlibrary Loan services, you name it, I tried it. Post the transition to becoming an instructor, and I suspect this is true for most of us who teach anything, I found myself always on the hunt for a text that covered the material I needed to teach in a clear and thorough way. At the same time, the hyperactive bogey of textbook prices is never far away from the next course on the roster. To make matters worse, statistics textbooks written for public administration/public affairs students are not only few and far in between but also of varying quality, ridiculously priced, and often insufficient for the classroom. This has been an exasperating situation for over two decades with no signs of self-correcting. It was thus less serendipity and more growing frustration with the state of affairs that lead to the creation of this text. Given that in this age of crushing student debt the world is replete with open-source software for data analysis and desktop publishing, why not curate a textbook tailored not only to my instructional needs but also one that serves my students’ needs above all? Leaning on the wonderful resources created by the teams at RStudio, R, Jared Lander (his wonderful " R for Everyone: Advanced Analytics and Graphics" showed me what a quality, aesthetically pleasing product might resemble), and the thousands of R users across the globe ever willing to share their coding knowledge, I have assembled my course material into a work that I hope is useful beyond just the pocketbook. My goal is to deliver a quality product and so all suggestions for substantive, technical and stylistic improvements are very much welcome.

1.1 Data Analysis and Public Affairs

In the early decades of the Twentieth century the field of public administration was a leader, as much in the area of intellectual thinking as in research on governmental processes and outcomes. Since the 1960s, however, public administration has been seen as a poorer cousin of its once twin, its cache dwindling in the social and behavioral sciences, largely because of the feeling that the scholarshipo it produces is less rigorous than that of other disciplines. There may be some truth to this notion but intellectual debates and scholarly machismo is not our goal here. Rather, the task before us to fill a more crucial gap in the real world of public affairs – data savvy public servants. Most of you, like thousands of your peers across the nation, no doubt groaned when you saw that one of your required courses was a research methods class. Maybe you remembered the mandatory statistics course you took as an undergraduate and that familiar, old dread, swept over you in a flash. I am here to banish those ghosts, for good.

This is a very applied course in data analysis, one in which you will learn about how to use data in a way best suited to answer a specific question. Maybe the question is about weighing the evidence in a racial bias in hiring lawsuit your city is facing. Or your agency is curious to know if the public information campaign it has been working on and broadcasting to promote healthy living is having any (or no) impact on the health of the citizens. Maybe you work for a an economic development agency and need to track trends in unemployment rates. Whatever the question before us, invariably there are data that can be used to find reasonable answers. To get at these answers, however, you need to know three things:

  1. How do I gather the data?
  2. How should I analyze the data I gathered?
  3. What are the strengths and limitations of my analysis?

These will be our guiding questions throughout this book. Seems fairly easy but it does require hard work that involves a lot of hands-on practice. After all, data analysis requires us to understand the syntax of statistics, its vocabulary, and in all cases, a crowd of greek symbols squeezed into what seems to be a mysterious mathematical formula.

1.2 The chapters that follow

The chapters that follow are sequenced in a way designed to promote learning. We start in Chapter 1 with the fundamentals, learning about samples and how they differ from populations, how the different ways we measure some attribute or phenomenon (for example, how the way we measure a survey respondent’s sex differs from how we measure hours of physical activity she/he spends per week) has implications for the analysis that can be done, the tricky business of establishing cause-and-effect, and other such foundationary principles.

In Chapter 2 we move on to understanding the many interesting ways in which we can explore patterns in our data, both graphically and with simple tables. In the process we will also learn best practices in data visualization – how should you build an effective graph? Issues of human cognition and visual perception are more important than we tend to recognize.

Chapter 3 revolves around measures of central tendency and variability. That is, we learn about the three ways we can measure and discuss the average, the typical – The mean, the median, and the mode – and variability around these averages – the range, the interquartile range, the variance, and the standard deviation. We also fold into our data visualization toolbox a very powerful graphic called the box-plot that relies on the five-number summary to tell us a lot about the shape of our distribution.

Probability theory, one of the most notorious subjects in statistics, is our focus in Chapter 4. It is a tricky subject but unless we wrestle with the nuances of probability theory everything else that follows will make little to no sense. It is as powerful as it is difficult, but that is why game shows rely on it and Jeopardy champions and poker stars need to master it. It is also the source for understanding why such events as the winning pick-3 number in NYC on the first anniversary of the 9/11 terrorist attacks being 9-1-1 was not a rare event. If you understand probability you can be the life of the party by correctly predicting that at least two people in a gathering of 30 share the same birthday.

In Chapter 5 we discuss the logic of of hypothesis testing, a method of formalizing and testing our suspicions about whether a program has had an impact, whether there is a gender bias in hiring, etc. This is a tightly specified method, with little room for doing things differently; when we get to Chapter 5 I will explain why there is no flexibility. One of the biggest surprise many students encounter occurs in this chapter; no matter the strength of the evidence there is always a possibility that we could be drawing the wrong conclusion from our data analysis.

Chapters 6 through 10 lead us through the world of inferential statistics, the process of analyzing the sample data in hand and extrapolating our conclusions to the population the sample represents. We start quite simply, looking at how to determine if the differences between two groups are statistically significant (whatever that means). Since often the world cannot be broken up into two groups (men and women, for example) but instead must be studied as it exists in reality (for example, the fact that the Ohio Department of Education classifies public school districts into eight mutually exclusive categories), we also learn how to analzye multiple groups in a coherent manner. We then move on to the mother of most statistical analyses today – regression analysis. This is the most exciting and useful portion of the course because we are finally able to accomodate real world complexity into our calculations. For example, if you wanted to predict the number of highway fatalities on a particular stretch of I70, you would have to account for many things (visibility, traffic density, traffic speed, road conditions, drive intoxication, time of day, and so on). Indeed, much of the noise about predictive analytics, health analytics, data mining, etc. revolves around one form or another of regression analysis.

1.3 Keys to Learning Data Analysis

I have already emphasized that statistics has its own language, and in thsat sense learning statistics is no different from learning a foreign language. You cannot master a foreign language simply by cracking open a book for an hour a week or going to the weekly class. If it were that simple we would all be linguists, but we are not. The ones who master a foreign language (or at least learn enough to impress the servers at your neighborhood Spanish/Italian/French restaurant) are those who practice as much as they can. That is the approach I recommend to you if you want to learn data analysis.

To encourage practice you will see a number of practice problems that conclude each chapter. These are designed to reinforce learning and I expect you to try and solve them. Answer keys are provided with fully worked-out calculations so that if you do make a mistake you can see where and how you went wrong. You should do a few problems before you tackle the assignment for the week; this puts you in the best position to not only complete the assignment correctly but also with minimal frustration and time needed to complete the work. Of course, each chapter also has several worked examples per key concept/calculation so there is no shortage of learning-by-doing opportunities.

This book may not work for everybody. Some people need to approach the same material from multiple vantage points before things click. That is perfectly fine, and if you are one such individual, I encourage you to also look at the several thousand videos and blogs and free books and papers on the internet. Some of these materials have been curated and hotlinked in this book while others have been listed in the Bibliography for this text; use them.