Welcome to the LimeSurvey Community Forum

Ask the community, share ideas, and connect with other LimeSurvey users!

From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

VERAPROTOCOLE
Topic Author
Offline
New Member

More

1 month 1 week ago #274602 by VERAPROTOCOLE

From "ethical anonymity" to provable anonymity — a differential-privacy plugin c was created by VERAPROTOCOLE

Please help us help you and fill where relevant:
Your LimeSurvey version: [see right hand bottom of your LimeSurvey admin screen]
Own server or LimeSurvey hosting:
Survey theme/template:
==================
(Write here your question/remark)Please help us help you and fill where relevant:
Your LimeSurvey version: [see right hand bottom of your LimeSurvey admin screen]
Own server or LimeSurvey hosting:
Survey theme/template:
==================
(Write here your question/remark)Thanks for pushing on this, Holch — you're right that my previous answer didn't fully address what you're describing. Let me try to separate the two issues clearly, because they need different answers.
1. The survey creator passing token-table details into the response table
You're describing a real risk: in a standard LimeSurvey setup, the survey creator is the same entity that controls both the token table (with attributes like department) and the survey configuration. Whoever can configure the survey to use the department attribute could, with that same access, also leak other token-table fields (name, email, token) into the response table. I agree — within LimeSurvey alone, there's no separation between "the authority that knows the attribute" and "the system that processes the survey." It's one and the same actor.
This is exactly the gap VERA is built to address — not by hiding the attribute from the survey creator (it can't, and doesn't try to), but by splitting that single role into separate ones that never share the full picture:
The HR/organizer role generates a local opaque identifier per participant (never the email itself) and sends VERA only (opaque_id, department) — never identity.
VERA returns (opaque_id, token) without ever seeing the email.
HR alone reconstitutes the mapping locally and sends the token through a standard delivery channel (email/SMS) — which never receives or sees the department, only "send this content to this address."
The response, once submitted with the token, is processed and aggregated by VERA without ever being combined with identity or department at the individual level.
So: HR/the organizer still knows (identity, department) — I'm not claiming otherwise, and I want to be precise about that, because overclaiming it would be dishonest. What VERA prevents is that this knowledge propagates into the response-processing system itself. There's no single point in the pipeline — outside of HR's own local files — where identity, department, and an individual response ever coexist. VERA's contribution is introducing two independent boundaries (HR ↔ VERA, VERA ↔ delivery channel) that LimeSurvey's own architecture doesn't have by default.
I want to be honest about what this doesn't solve: HR (or whoever manages attribution) still has the underlying knowledge at the moment of issuance — that's an inherent property of any system distributing a right tied to a real-world attribute, not something cryptography can remove. What changes is who else has access to it downstream, and that's the actual question you raised.
On self-declaration specifically: you're right that if self-declaration is acceptable for the use case, LimeSurvey's native anonymous mode already solves the first problem on its own — no need for VERA there. VERA's separation of roles only matters when self-declaration isn't good enough for the task, which is exactly your cross-department evaluation example: you need the department attribute to be reliable, not just claimed by the respondent, which is the case where (b) or (c) apply and where the leakage risk you described is real.
2. Small sample sizes (your real-world client example with n=1 or n=2)
This one VERA handles directly through a hard threshold, not through noise alone.
VERA refuses to publish any cohort result below K_MIN=100 — not a degraded or noisy version, nothing at all. This isn't arbitrary: at ε=0.5, n≥100 is required to keep aggregate error within ±5%, demonstrated and reproducible in demo_cohortes.py in the repo, which simulates exactly your case — a small department (e.g. n=7) gets refused, a larger one (e.g. n=120) gets published with bounded error.
For your actual client case (departments at n=1 or n=2): with VERA, those departments would simply never produce a published result. If an organizer disables or bypasses that threshold to publish anyway, the result falls outside VERA's guarantee entirely — it's no longer "VERA with a small cohort," it's a choice made outside the protocol.
Summary
Token-table leakage into responses → prevented by design, because no component outside HR ever receives both identity and department together — not by fixing LimeSurvey's architecture, but by introducing a separation of roles that LimeSurvey alone doesn't have.
Small cohorts → solved by a hard refusal to publish below K_MIN=100, demonstrated with numbers in the repo.
I'd rather be precise about where the guarantee actually applies than oversell it. Happy to keep being challenged on this

Please Log in to join the conversation.

VERAPROTOCOLE
Topic Author
Offline
New Member

More

2 weeks 5 days ago #274695 by VERAPROTOCOLE

Replied by VERAPROTOCOLE on topic From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

Quick update, since this thread raised the right questions and I want to follow through honestly.

I ran an adversarial audit of the actual codebase this week (not just the design doc), cross-checked against my own empirical tests. It surfaced a real, critical bug I want to be transparent about: the noisy result returned by /api/rh/resultats was being re-sampled from the Laplace mechanism on every call, instead of being fixed after the first publication. That meant a caller could, in principle, average many repeated reads and cancel out the noise — silently breaking the ε guarantee after the second read.

Fixed today: the noisy result is now computed once, persisted, and returned identically on every subsequent call. Verified empirically — 5 consecutive calls to the same endpoint now return an identical noisy result, not 5 different ones. Same session also closed a few smaller issues: test endpoints that were consuming real tokens got removed from prod, the brute-force protection was blind to real client IPs behind the reverse proxy, and a schema gap that would've broken a clean install.

I'd rather surface this than not — a threat model that only survives until someone reads the actual code isn't much of a threat model. Full writeup, dated, with the bug description and the fix: github.com/taha-vera/Protocole-Vera/blob...AT_MODEL_COMPLETE.md

Also did a full end-to-end pass today — HR login, token generation, code-based vote, results — from an actual phone over HTTPS, not just curl. It holds together.

Still happy to keep being challenged on any of this.

Please Log in to join the conversation.

VERAPROTOCOLE
Topic Author
Offline
New Member

More

2 weeks 5 days ago #274696 by VERAPROTOCOLE

Replied by VERAPROTOCOLE on topic From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

Follow-up, in the same spirit as yesterday's post.

Reviewing the code again this morning, I found that the K_MIN=100 threshold — which I described in this thread as a hard refusal to publish below 100 respondents — was defined as a constant but never actually enforced anywhere in the publication path. The DP guarantee (ε=0.5) held, but small cohorts were still getting published. That directly contradicts what I told you here, and I'd rather say so than quietly patch it.

Now implemented and verified: cohorts below 100 respondents are refused outright, before any privacy budget is consumed. Tested — departments with 1-2 respondents now return an explicit refusal, not a noisy result.

Commit: github.com/taha-vera/Protocole-Vera/commit/fe2c7fb

Two real bugs in two days, both found by actually reading the code rather than the docs. Which is roughly the point.

Please Log in to join the conversation.

VERAPROTOCOLE
Topic Author
Offline
New Member

More

2 weeks 4 days ago #274700 by VERAPROTOCOLE

Replied by VERAPROTOCOLE on topic From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

Third follow-up this week — and the most significant one.

Two things I stated in this thread turned out not to match the actual behaviour of the code, and I'd rather correct them here than leave them standing.

First: I said VERA refuses to publish below K_MIN=100. The constant existed but was never enforced in the publication path — small cohorts were in fact getting published. Now fixed and enforced.

Second, and more important: I measured the real accuracy properly. Publishing three counts per department at ε=0.5 does not hold "±5% at n≥100" — the honest figure is ~12% at the 95th percentile for n=100. To actually keep error under 5% at ε=0.5, the threshold has to be n≥240. So K_MIN is now 240, not 100 — a measured value, not a guessed one. I also switched to a vectorised Laplace with projection onto the simplex (Hay et al. 2010), which cuts error ~25% and makes the published counts sum exactly to N.

Consequence I'll state plainly: this narrows the realistic target to larger organisations (departments of 240+), not small associations. That's a mathematical constraint of ε=0.5, not a marketing choice — and I'd rather scope it honestly than overpromise.

Every accuracy figure now has a reproducible test (test_precision_kmin.py) — the table fails the build if the docs claim a precision the mechanism doesn't deliver.

Full detail: github.com/taha-vera/Protocole-Vera/blob...A_AUDIT_REFERENCE.md

Please Log in to join the conversation.