## Wednesday, 25 July 2012

### Stay classy, unauthorised migrant

I still haven't decided whether I should shave and keep only my moustache, in preparation for my trip to the city of Ron Burgundy (I probably won't).

I'm really curious about San Diego as I've heard mixed opinions about it and I'm really intrigued by its close proximity to the border. My talk at the JSM is on Sunday (which is only the second day of the conference), so hopefully I'll have some time to check it out.

I'll talk about a model we've developed to estimate some selected characteristics of a difficult-to-define population (eg one including unauthorised migrants). Because it is virtually impossible to get a complete sampling frame (since we normally don't have a list of the people who don't want to be listed...), simple random sampling is not very effective and so we need an alternative method to get reasonable estimates.

The basic idea is to characterise the sampling units in terms of a set of $K$ aggregation places (we called them "centres") with which they are associated, in the sense that they often(-ish) visit them. In the case of unauthorised migrants, examples include ethnic shops, restaurants, bars, etc $-$ places where you imagine you may find the sampling units when they are out and about.

The sample is then obtained by randomly selecting $n_k$ subjects from each centre $k=1,\ldots,K$. The required information (eg age, sex, marital status, etc) is collected from the survey, whether the respondent is an authorised migrant or not (I'll swipe missing data issues under the rug, for now). The estimates of the relevant characteristics based on this sample are a biased representation of the population (if only because we cannot be sure that we've picked up all the centres).

But in addition to the main variables, we ask each subject to give information on whether they are related to any of the other $K-1$ centres. Using the information about the profile of each respondent (in terms of the centres with which they are associated), we can suitably re-proportion the sample to obtain reasonable results. Crucially, the weights used to re-proportion the sample depend on the (subjective) importance score associated with each of the $K$ centres in the analysis.

The method has been applied to real data, but earlier today I was running some simulations in which I pretended to know the overall population (both in terms of the individual profiles and other demographic characteristics of interest, which of course in real data is not possible) to check how good the estimates were, using different sample sizes.

I haven't finished yet (and by the way: that's more like it. To get a presentation done with so much time to spare is really not me!), but it all seems to be working. I'll post something if I manage to produce some nice graph.