9 BJR and Patra-Sen
This chapter performs the Bonhomme-Jochmans-Robin (BJR) and Patra-Sen estimation. Section 3.2 and Section 3.3 explain the purpose of the BJR and Patra-Sen estimators, respectively.
The BJR estimator takes the transformed age data from Chapter 8 as input. It outputs four estimated nonparametric cumulative distribution functions (CDFs) of the four components of the mixture distribution and their estimated mixing proportions. It estimates the CDFs at 101 support points: . For a good estimate of the CDFs, I set the support points to have roughly equal probability mass between them. We have transformed the data into a roughly uniform distribution, so the CDF support points are set at a uniform distance from each other, between 0 and 1.
The estimated component with the highest mixing proportion is selected. This component is used as an input to the Patra-Sen estimator. A Uniform(0,1) distribution is used for the “decoy” distribution, . The output of the Patra-Sen estimation is the estimated real spend distribution.
We could stop here, but we can do a second-pass estimation for better results. The estimated real spend distribution from the initial Patra-Sen estimation will not have roughly equal probability mass between each CDF support point. We can interpolate the estimated real spend CDF to get new CDF support points that do have roughly equal probability mass between each point. Then the BJR estimation is performed again with the new CDF support points.
This procedure is followed three times:
- With all rings
- With all rings except for those marked as
is.nonstandard.rucknium
- With all rings except for those marked as
is.nonstandard.rucknium
and/oris.nonstandard.isthmus
The results from these three data subsets can be compared.
This code will take about two weeks to run on a powerful machine. The code in this chapter can run while the code in Chapter 8 is running, but it needs to stay “behind” the transformed age data as it is being written to the cdf.dir
and ring.member.ages.dir
directories.
The threads
variable is the number of CPU threads to use. The cdf.dir
and ring.member.ages.dir
variables should be the same as specified in the previous chapter, Chapter 8 . results.dir
is the name of the directory where the BJR estimate will be written to.
9.1 Code
library(decoyanalysis)
library(future)
library(data.table)
<- 64 - 17
threads
<- "weekly-weighted-cdf"
cdf.dir <- "weekly-ring-member-ages"
ring.member.ages.dir <- "results"
results.dir
dir.create(results.dir)
<- paste0(results.dir, "/results-01/")
results.dir.run
dir.create(results.dir.run)
<- paste0(results.dir, "bjr/")
results.dir.run.bjr
dir.create(results.dir.run.bjr)
options(future.globals.maxSize = 8000*1024^2)
::plan(future::multicore(workers = threads))
future
<- NULL
cluster.threads
<- FALSE
is.raw
if (is.raw) {
<- "weekly-ring-member-ages-raw"
ring.member.ages.dir
}
for (i in c("all", "rucknium", "rucknium_isthmus")) {
dir.create(paste0(results.dir.run.bjr, i))
dir.create(paste0(results.dir.run.bjr, i, "/first-pass"))
}
<- list.files(ring.member.ages.dir)
xmr.rings.ages.weeks <- xmr.rings.ages.weeks[grepl(".qs", xmr.rings.ages.weeks)]
xmr.rings.ages.weeks <- Reduce(intersect, list(list.files(paste0(results.dir.run.bjr, "all")),
already.done list.files(paste0(results.dir.run.bjr, "rucknium")), list.files(paste0(results.dir.run.bjr, "rucknium_isthmus"))))
<- setdiff(xmr.rings.ages.weeks, already.done)
xmr.rings.ages.weeks <- rev(sort(xmr.rings.ages.weeks))
xmr.rings.ages.weeks
<- 10
II <- 4
K
# cdf.points <- (0:100)/100
<- c(0.001, (1:99)/100, 0.999)
cdf.points #cdf.points <- c(0.0001, (1:99)/100, 0.9999)
# Distribution should be approximately uniform
while ( length(xmr.rings.ages.weeks) > 0) {
<- xmr.rings.ages.weeks[1]
week
<- qs::qread(paste0(ring.member.ages.dir, week))[[1]]
y.with.labels
if (is.raw) {
<- unname(quantile(c(y), probs = (0:100)/100, na.rm = TRUE))
cdf.points
}
for (label in c("all", "rucknium", "rucknium_isthmus")) {
if ( ! (label == "rucknium_isthmus" &
sum(is.nonstandard.rucknium) == sum(is.nonstandard.rucknium | is.nonstandard.isthmus)] ) ) {
y.with.labels[, # If is.nonstandard.isthmus does not have additional TRUEs, then just skip and output the Rucknium one
<- switch(label,
y all = y.with.labels[, -(1:2), with = FALSE],
rucknium = y.with.labels[! y.with.labels$is.nonstandard.rucknium, -(1:2), with = FALSE],
rucknium_isthmus = y.with.labels[! y.with.labels$is.nonstandard.rucknium &
! y.with.labels$is.nonstandard.isthmus, -(1:2), with = FALSE])
print(system.time(bjr.results <- bjr(y, II = II, K = K, cdf.points = cdf.points,
estimate.mean.sd = FALSE, basis = "Chebychev", control = list(cluster.threads = cluster.threads ))))
::qsave(bjr.results, file = paste0(results.dir.run.bjr, label, "/first-pass/", week))
qs
<- which.max(bjr.results$mixing.proportions)
wallet2.dist.index <- bjr.results$cdf$CDF[, wallet2.dist.index]
Fn.hat.value <- bjr.results$cdf$cdf.points
supp.points <- punif
decoy <- decoy
Fb = 16
M <- 1/M
alpha
<- patra.sen.bjr(Fn.hat.value, supp.points, Fb, alpha)
patra.sen.bjr.results
<- approx(c(0, patra.sen.bjr.results$Fs.hat, 1), c(0, supp.points, 1), supp.points)$y
new.cdf.points
print(system.time(bjr.results.new <- bjr(y, II = II, K = K, cdf.points = new.cdf.points,
estimate.mean.sd = FALSE, basis = "Chebychev", control = list(cluster.threads = cluster.threads ))))
}
::qsave(bjr.results.new, file = paste0(results.dir.run.bjr, label, "/", week))
qs
}
<- list.files(ring.member.ages.dir)
xmr.rings.ages.weeks <- xmr.rings.ages.weeks[grepl(".qs", xmr.rings.ages.weeks)]
xmr.rings.ages.weeks <- Reduce(intersect, list(list.files(paste0(results.dir.run.bjr, "all")),
already.done list.files(paste0(results.dir.run.bjr, "rucknium")), list.files(paste0(results.dir.run.bjr, "rucknium_isthmus"))))
<- setdiff(xmr.rings.ages.weeks, already.done)
xmr.rings.ages.weeks <- rev(sort(xmr.rings.ages.weeks))
xmr.rings.ages.weeks
cat(week, "\n")
}