Welcome to the LimeSurvey Community Forum

Ask the community, share ideas, and connect with other LimeSurvey users!

From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

More
2 days 16 hours ago #274558 by VERAPROTOCOLE
 Hi everyone,

Reading through the recent threads on anonymous surveys here, two honest admissions stood out to me:

- Joffm called the current situation "ethical anonymity" — because name/email are often still collected (for incentives, reminders), so the tool isn't really in a fully anonymous state; anonymity rests on the organizer behaving well.
- Holch pointed out that with fine structures (department, sub-department, unit...), you quickly drop into single-digit groups where one or two answers re-identify a specific person — and concluded "staff surveys are the hardest thing there is."
- And several users captured the core feeling: even when you promise the tokens are separated, someone always thinks "if the boss really wants to know what I wrote, he'll find a way." What they wanted was to be able to say "I cannot attribute responses" — not just promise it.

That exact gap — trust-based anonymity, and the small-cohort re-identification problem — is what I've been building an open-source protocol for (VERA), and I'd value this community's feedback on whether it could become a LimeSurvey plugin.

The idea: instead of publishing raw counts, publish an aggregate protected by differential privacy, so that no individual response — not even the lone outlier in a single-digit unit — is recoverable from the result. I structured it as a threat model of explicit "gates", each with an honest status:

- Gate 1 — Noise mechanism (CLOSED): aggregates perturbed via OpenDP (an audited DP library); epsilon = 0.5 computed analytically with meas.map(), not estimated.
- Gate 3 — Small cohorts: below a minimum participant threshold, nothing is published at all (directly addresses Holch's single-digit problem).
- Gate 4 — Composition: a capped privacy budget refuses further queries once exhausted, so anonymity can't be peeled away by re-querying.
- Gate 7 — Cohort differencing, the "49/1" attack (prototype, crypto to harden): one single-use token per participant per consultation enforces a partition. The partition logic works and is tested; the blind-signature primitive is still a homemade prototype that must be replaced by an audited library (RFC 9474) before production — I'm explicit about this.
- Gate 8 — Direct outlier inference (measured): leakage on the atypical respondent stays negligible thanks to the Gate 7 partition.

No raw responses retained after aggregation.

The goal is exactly what people in those threads asked for: turning "trust us, the tables are separated" into a mathematical guarantee that the link simply cannot be made.

What I deliberately do NOT claim: network-level observers (IP upstream) and coercion are out of scope; the GDPR qualification (anonymization vs pseudonymization, Art. 5) is left to a DPO/CNIL opinion.

Code, full threat model, reproducible proof: github.com/taha-vera/Protocole-Vera

Questions:
1. Is there interest in a plugin that publishes DP-protected aggregates instead of raw counts?
2. Which LimeSurvey event/hook would be the right place to intercept aggregation before results are stored or exported?
3. Has anyone here explored differential privacy (beyond pseudonymization tools like ALIIAS)?

Happy to be challenged on any gate.

Taha
(Write here your question/remark)

Please Log in to join the conversation.

More
2 days 1 minute ago #274569 by holch

The idea: instead of publishing raw counts, publish an aggregate protected by differential privacy, so that no individual response — not even the lone outlier in a single-digit unit — is recoverable from the result. I structured it as a threat model of explicit "gates", each with an honest status:
 
I am not sure if your protocole can solve the "anonymity" problem that Joffm and me are adressing here. 

Some people want "technical anonymity". This can be done with the anonymous mode in Limesurvey. But this mode has some drawbacks, because you can't use data from the participant table / token table. 

The issue of anonymity only arises if you need data from the token table to be connected with the survey. For example, you want to base a question on data that is stored in the participant/token table. Let's say you want to aks specific questions based on the department someone works. Or you want to ask specific questions only to a certain age group, etc.

The issue is, if you can make this connection, you also have the posibility to pass data that might identify a person (name, token, email, date of birth, etc.), which means they survey is not fully anonymous. So you need to guarantee "ethical anonymity" (you won't connect identifiable data with the survey data, even if you can) instead of technical anonymity (there is no technical way to connect identifiable data from the token table with the survey data).

So the anonymous mode works anonymous, but it has some draw backs, limitations and is less flexible.

I don't see how the Vera protocole can overcome this, as you (from my understanding) could still download the full data set. 

Help us to help you!
  • Provide your LS version and where it is installed (own server, uni/employer, SaaS hosting, etc.).
  • Always provide a LSS file (not LSQ or LSG).
Note: I answer at this forum in my spare time, I'm not a LimeSurvey GmbH employee.

Please Log in to join the conversation.

Moderators: holchtpartner

Lime-years ahead

Online-surveys for every purse and purpose