Journey analysis, Part 3: Can we predict who will flake?

This post follows (as you might expect) Part one and Part two of Journey analysis, in which we examine both /who/ registers for Journey and /when/ they register. Ultimately, though, we want to know not just how many people will *sign up* for an event, but also how many will actually *attend*. As it turns out, this proportion was much lower than we expected for Journey. About 53% of the registrants actually showed up. 47% flaked.

Note: we couldn’t match up all the waivers with registrations, so we’re missing about 6% of people who showed up. In this analysis I make the assumption that these are missing completely at random and don’t affect the results, but if you do the numbers yourself and see that it looks like less than 53% showed up, that’s why.

I know this is the Bay Area, with a pretty high reputation for flaking, and this /is/ a free event, so people may not feel as invested in actually attending, but I didn’t realize it was going to be that high. In DC, we’ve actually seen the opposite: /more/ people show up than pre-register. For this Journey, we cut off registration at 3400, to make sure that we would have enough materials for everyone who registered — and we ended up with a ton of extra, because of all the people who didn’t show. That’s the key lesson for next time: here, pre-registraton rates are much higher than attendance rates.

But what we’d really like to know is: can we figure out what that rate is likely to be /beforehand/? Can we determine what factors indicate that someone is more likely to flake, so we can predict our actual flake rate for the event?

I really, *really* wanted to have a great finding for this. Something like figuring out that people who live in the Marina flake for Journey more often, or that people living in the Mission are more reliable. Sadly, a geographic predisposition to reliability or flakiness doesn’t seem to be borne out by the data. But I can at least use this as a good example for how to do rate comparisons.

Now, we could take the naive approach: just take, for the people who signed up from each zip code, what percent of them actually showed up? Let’s call this the inverse flake ratio, or IFR. If we plot that, it looks like this:

naive_ifr_inkscape

So, for example, 94012 (Burlingame): 0% IVR. 100% flaky. Or 94119 (SoMa): 100% IVR. 0% flaky. Really tells you something about those people, right?

Well. Not really. Because, for one thing, that’s a pretty minimal sample size: only 1 person signed up from each of those zip codes. And based just on that one person, we’re judging how flaky the entire zip code is. We can do better by estimating the uncertainty we have about the true rate — when we don’t have many people who’ve signed up, we’re a lot less confident about what the rate is for that zip code. The next step would be to go through modeling this as a binomial, and using an independent uninformative beta distribution as a prior for each zip code. But we can do even better, using a hierarchical Bayesian model.

Basically what we’re saying is that there’s some distribution of IFR among different zip codes. And while we don’t know what that distribution /is/, we’re going to assume it’s approximately a Beta Distribution. Roughly, most zip codes are going to have a similar rate, with more extreme rates being less likely. It’s a way of modeling the intuition that the IFR isn’t likely to vary widely from zip code to zip code, but instead is going to vary around a more common value. As it turns out, based on this data, our distribution on zip codes’ IFR is going to look like this:

ifr_distribution

What this indicates is that there’s a relatively broad range of possible IFR across zip codes. It’s not exactly uniform — it’s more likely to have an IFR close to 0.5 than either 0 or 1 — but it is fairly broad. We don’t have a strong expectation that all IFRs are going to come from a narrow range.

So then for each individual zip code, we’re going to end up with a distribution over the possible IFRs. And we can compare that distribution to see how likely it is that people from one zip code are more or less reliable than the Bay Area in general. Geographically, our less naive map looks like this:

hierarchical_bayes_geo_inkscape

But it’s still a single number per zip code, when really we should have a range of uncertainty around each one. So, looking at it for more information visualization (though less geographically visual), we can plot our 95% confidence interval on the IFR for each zip code, with the x-axis being the number of registrations from that zip code. Each vertical line is the range of values we expect that zip code’s “true” IFR might be. The red lines are upper and lower bounds on what we expect the Bay Area’s mean IFR is.

hierarchical_bayes_ifr

That’s a little disappointing: the variances on the IFR estimates are such that we don’t see a significant flakiness change based on zip codes. All of those vertical lines cross the red lines of the Bay Area’s overall IFR. That means we can’t really make any disparaging /or/ complimentary remarks about any particular area. Perhaps that’s for the best.

We also see that the Bay Area’s overall IFR is likely below 50%, which means that, at least after signing up for Journey, fewer than 50% of registrants are expected to actually show up. That’s pretty important to remember when planning free events here. I’m very curious to try repeating this in other cities and see the comparison.

Still, let’s press on a little further. Can we try to model this somewhat differently, using some other covariates to try to predict flakiness? Maybe if we use the distance of the zip code from the Journey start, we’d see some predictive pattern?

We can do a quick check on this by using a GLM (gneralized linear model) to predict whether a registrant will show up, based on other factors we know about them. Let’s go back to our original dataset and do a logistic regression model to see if any of the following are significant factors in whether a registered person will actually sign up for Journey:

glm(formula = attended ~ signup_timestamp + distance + age +
    is_facebook + is_google + is_twitter, family = "binomial",
    data = registrations)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)
(Intercept)      -1.274e+02  5.178e+01  -2.460   0.0139 *
signup_timestamp  9.243e-08  3.744e-08   2.469   0.0136 *
distance         -4.400e-04  2.260e-03  -0.195   0.8456
age              -5.107e-03  5.714e-03  -0.894   0.3714
is_facebookTRUE  -4.590e-01  4.438e-01  -1.034   0.3010
is_googleTRUE    -3.575e-01  4.436e-01  -0.806   0.4203
is_twitterTRUE   -7.112e-01  4.590e-01  -1.550   0.1213

It doesn’t look like distance is actually a significant factor…or how old they are…or what signup method was used. (Although I must note that this model is only really looking for linear relationships to log-odds, and there are other relationships which could be found. While I’m not going any deeper into it now, I’d certainly be curious to hear any other investigations, and may come back to try some other machine learning techniques in the future.) However, there /does/ seem to be a slight effect where those signing up near the event are more likely to attend. We can look into that a little more using a moving average:

moving_attendance

That’s noisy, but we can see a general upward trend. Using MARS we can generalize that into three sections: initial signups, mid-rate signups, and last-minute signups:

mars_time

We can use that to get a better prediction of our actual attendance, given signups (similar to part 2). Again, to actually generalize this, we’d want data from a number of Journeys. Have I mentioned that I want more data? I always want more data.

For the next Journey, I’m currently leaning towards leaving registration open up to the time of the event, but also letting people know what our current projections are: for the number of people who will ultimately register and how many will actually attend. Just let people know that the materials are only for people with a signed waiver, and are first-come, first-serve. Having an overall idea of the range of signup-to-attendance rate is going to be pretty valuable in predicting attendance, especially if we’re considering hosting journeys of different sizes.

You can download the code and csv files needed here: journey_parts_3_and_4.zip.

Join us next time for the final section, where we look at how different zip codes sign up for and attend Journey!

  1. No comments yet.

SetPageWidth