32.3 Target shooting

Here is a general problem; suppose that we a set UU and a set SUS\subseteq U, and that we would like to estimate what proportion of elements of SS are in UU.

Of course, if we know the values of |U||U| and |S||S| then the answer is just

R:=|S||U|R:=\frac{|S|}{|U|} (32.2)

but oftentimes we don’t, or to compute the actual sets would be impractical. One idea is to sample elements from UU and check if they are in SS, something like this

  • randomly sample u1,u2,,uNu_{1},u_{2},...,u_{N} from UU (uniformly and independently)

  • calculate aN=(1/N)×1iNI(ui)a_{N}=(1/N)\times\sum_{1\leq i\leq N}I(u_{i}) where I(u)=1I(u)=1 if uSu\in S and I(u)=0I(u)=0 otherwise.

clearly aNa_{N} gives us a rough idea of what RR (which above we defined as |S||U|\frac{|S|}{|U|}) is. But just how good?

Let ε>0\varepsilon>0. Let XX be the output of our algorithm, and we seek to derive a bound on

Pr[|XR|ε|S|/|U|]\Pr[|X-R|\geq\varepsilon|S|/|U|] (32.3)

that is, the probability that XX is more than a factor of ε\varepsilon away from the true value of |S|/|U||S|/|U|.

Note that our algorithm XX is equivalent to

(1/N)×1iNI(ui)(1/N)\times\sum_{1\leq i\leq N}I(u_{i}) (32.4)

and as each I(ui)Bern(R)I(u_{i})\sim\text{Bern}(R), we know that

1iNI(ui)B(N,|S|/|U|)\sum_{1\leq i\leq N}I(u_{i})\sim B(N,|S|/|U|) (32.5)

and therefore we can use a Chernoff bound (as we have a binomially distributed discrete random variable). It follows that

E[X]=1NN|S||U|=|S||U|E[X]=\frac{1}{N}N\frac{|S|}{|U|}=\frac{|S|}{|U|} (32.6)

and therefore

Pr[|XR|ε|S|/|U|]\displaystyle\Pr[|X-R|\geq\varepsilon|S|/|U|] =Pr[|XE[X]|εE[X]\displaystyle=\Pr[|X-E[X]|\geq\varepsilon E[X] (32.7)

then we can let Z=NXZ=NX which means that ZB(N,|S|/|U|)Z\sim B(N,|S|/|U|) and is slightly more pleasant to work with. This means that

Pr[|XE[X]|εE[X]\displaystyle\Pr[|X-E[X]|\geq\varepsilon E[X] =Pr[|ZE[Z]|εE[Z]] (multiplying by N)\displaystyle=\Pr[|Z-E[Z]|\geq\varepsilon E[Z]]\text{ (multiplying by $N$)} (32.9)
=Pr[{ZE[Z]εE[Z]}{ZE[Z]E[Z]}]\displaystyle=\Pr[\{Z-E[Z]\geq\varepsilon E[Z]\}\cup\{Z-E[Z]\leq-E[Z]\}] (32.10)
Pr[Z(1+ε)E[Z]]+Pr[Z(1ε)E[Z]]\displaystyle\leq\Pr[Z\geq(1+\varepsilon)E[Z]]+\Pr[Z\leq(1-\varepsilon)E[Z]] (32.11)
2×exp(13ε2E[Z])\displaystyle\leq 2\times\exp\left(-\frac{1}{3}\varepsilon^{2}E[Z]\right) (32.12)
=2×exp(13ε2N|S||U|)\displaystyle=2\times\exp\left(-\frac{1}{3}\varepsilon^{2}N\frac{|S|}{|U|}\right) (32.13)

and through our choice of NN we can reduce the probability of failure arbitrarily. If we let δ>0\delta>0 then we can make

2×exp(13ε2N|S||U|)δ2\times\exp\left(-\frac{1}{3}\varepsilon^{2}N\frac{|S|}{|U|}\right)\leq\delta (32.14)

through choosing N=3|U||S|1ε2×ln(δ/2)N=-3\frac{|U|}{|S|}\frac{1}{\varepsilon^{2}}\times\ln(\delta/2) (we essentially want to cancel everything out so that we get eln(δ/2)=δ/2e^{\ln(\delta/2)}=\delta/2 and then the 22 cancels and we are left with δ\delta).

Thus we can make the failure probability arbitrarily small.