Welcome to the LimeSurvey Community Forum

Ask the community, share ideas, and connect with other LimeSurvey users!

From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

VERAPROTOCOLE
Topic Author
Offline
New Member

More

3 weeks 1 day ago #274558 by VERAPROTOCOLE

From "ethical anonymity" to provable anonymity — a differential-privacy plugin c was created by VERAPROTOCOLE

Hi everyone,

Reading through the recent threads on anonymous surveys here, two honest admissions stood out to me:

- Joffm called the current situation "ethical anonymity" — because name/email are often still collected (for incentives, reminders), so the tool isn't really in a fully anonymous state; anonymity rests on the organizer behaving well.
- Holch pointed out that with fine structures (department, sub-department, unit...), you quickly drop into single-digit groups where one or two answers re-identify a specific person — and concluded "staff surveys are the hardest thing there is."
- And several users captured the core feeling: even when you promise the tokens are separated, someone always thinks "if the boss really wants to know what I wrote, he'll find a way." What they wanted was to be able to say "I cannot attribute responses" — not just promise it.

That exact gap — trust-based anonymity, and the small-cohort re-identification problem — is what I've been building an open-source protocol for (VERA), and I'd value this community's feedback on whether it could become a LimeSurvey plugin.

The idea: instead of publishing raw counts, publish an aggregate protected by differential privacy, so that no individual response — not even the lone outlier in a single-digit unit — is recoverable from the result. I structured it as a threat model of explicit "gates", each with an honest status:

- Gate 1 — Noise mechanism (CLOSED): aggregates perturbed via OpenDP (an audited DP library); epsilon = 0.5 computed analytically with meas.map(), not estimated.
- Gate 3 — Small cohorts: below a minimum participant threshold, nothing is published at all (directly addresses Holch's single-digit problem).
- Gate 4 — Composition: a capped privacy budget refuses further queries once exhausted, so anonymity can't be peeled away by re-querying.
- Gate 7 — Cohort differencing, the "49/1" attack (prototype, crypto to harden): one single-use token per participant per consultation enforces a partition. The partition logic works and is tested; the blind-signature primitive is still a homemade prototype that must be replaced by an audited library (RFC 9474) before production — I'm explicit about this.
- Gate 8 — Direct outlier inference (measured): leakage on the atypical respondent stays negligible thanks to the Gate 7 partition.

No raw responses retained after aggregation.

The goal is exactly what people in those threads asked for: turning "trust us, the tables are separated" into a mathematical guarantee that the link simply cannot be made.

What I deliberately do NOT claim: network-level observers (IP upstream) and coercion are out of scope; the GDPR qualification (anonymization vs pseudonymization, Art. 5) is left to a DPO/CNIL opinion.

Code, full threat model, reproducible proof: github.com/taha-vera/Protocole-Vera

Questions:
1. Is there interest in a plugin that publishes DP-protected aggregates instead of raw counts?
2. Which LimeSurvey event/hook would be the right place to intercept aggregation before results are stored or exported?
3. Has anyone here explored differential privacy (beyond pseudonymization tools like ALIIAS)?

Happy to be challenged on any gate.

Taha
(Write here your question/remark)

Please Log in to join the conversation.

holch
Offline
LimeSurvey Community Team

More

3 weeks 1 day ago #274569 by holch

Replied by holch on topic From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

VERAPROTOCOLE wrote:
The idea: instead of publishing raw counts, publish an aggregate protected by differential privacy, so that no individual response — not even the lone outlier in a single-digit unit — is recoverable from the result. I structured it as a threat model of explicit "gates", each with an honest status:

I am not sure if your protocole can solve the "anonymity" problem that Joffm and me are adressing here.

Some people want "technical anonymity". This can be done with the anonymous mode in Limesurvey. But this mode has some drawbacks, because you can't use data from the participant table / token table.

The issue of anonymity only arises if you need data from the token table to be connected with the survey. For example, you want to base a question on data that is stored in the participant/token table. Let's say you want to aks specific questions based on the department someone works. Or you want to ask specific questions only to a certain age group, etc.

The issue is, if you can make this connection, you also have the posibility to pass data that might identify a person (name, token, email, date of birth, etc.), which means they survey is not fully anonymous. So you need to guarantee "ethical anonymity" (you won't connect identifiable data with the survey data, even if you can) instead of technical anonymity (there is no technical way to connect identifiable data from the token table with the survey data).

So the anonymous mode works anonymous, but it has some draw backs, limitations and is less flexible.

I don't see how the Vera protocole can overcome this, as you (from my understanding) could still download the full data set.

Help us to help you!

Provide your LS version and where it is installed (own server, uni/employer, SaaS hosting, etc.).
Always provide a LSS file (not LSQ or LSG).

Note: I answer at this forum in my spare time, I'm not a LimeSurvey GmbH employee.

Please Log in to join the conversation.

VERAPROTOCOLE
Topic Author
Offline
New Member

More

2 weeks 5 days ago #274577 by VERAPROTOCOLE

Replied by VERAPROTOCOLE on topic From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

Thanks Holch — I took your objection seriously and tested it rather than argue from theory.

First, to avoid confusion: I'm not describing LimeSurvey's native anonymous mode, but a prototype aggregation layer (VERA) that would sit on top of it. So when I say "token", I mean how that layer handles things, not how LimeSurvey stores responses.

Your key question, as I read it: if the system knows which department a respondent belongs to, that information exists somewhere — so how do you keep it from being linked back to the answer?

On the storage side: the department is used only to decide which aggregation bucket the answer goes into. The routing attribute is consumed before storage, not stored alongside the answer. For one-response-per-person, the token is hashed and that hash is kept in a separate dedup set — it is never written next to the answer, so there's no joinable token↔response row (the dedup set and the answers are different structures, never joined). What's published per department is a DP aggregate (noisy sum over noisy count, so even the exact headcount is hidden), not an individual-level dataset to download or join.

But you're pointing at the real upstream question: where does the department come from in the first place? Honestly I haven't settled that, and the choice matters — it could be (a) self-declared by the respondent as a survey question (no token attribute at all), (b) carried as a token attribute, or (c) looked up from an external/HR directory at entry. Each has a different exposure: (a) trusts the declaration, (b) means the token carries an attribute, (c) means that directory exists and could be correlated. You know the LimeSurvey workflow far better than I do — which of these would actually fit how people run cohort surveys there?

What this does NOT solve, and I won't pretend otherwise:
- Trust doesn't fully disappear, it moves: a non-technical respondent still can't verify at submission time that things work as described — they'd have to trust the implementation (or an independent audit), not just take my word for the code. I think that's honest to state up front.
- Small cohorts: with DP a tiny department's aggregate just gets too noisy to be useful (DP still holds at n=7, it's the result that's unusable, not the privacy), which is why a minimum-cohort threshold makes sense for publication.
- The per-department ε only holds cleanly if cohorts are disjoint and you don't also publish cross-cutting aggregates over the same people — otherwise the budget composes and you have to account for it.
- If you need per-individual linked data kept for reminders, response updates or later cross-analysis, this isn't the tool — it's a different class of system from standard LimeSurvey exports.
- Out of scope too: timing correlation, network metadata, and the blind-signature route (still a non-audited prototype).

The reproducible test script for exactly this (routing by cohort + DP aggregate, seed=42) is demo_cohortes.py in the repo, if you want to break it: github.com/taha-vera/Protocole-Vera/blob/main/demo_cohortes.py

Please Log in to join the conversation.

holch
Offline
LimeSurvey Community Team

More

2 weeks 5 days ago #274584 by holch

Replied by holch on topic From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

VERAPROTOCOLE wrote:
On the storage side: the department is used only to decide which aggregation bucket the answer goes into. The routing attribute is consumed before storage, not stored alongside the answer. For one-response-per-person, the token is hashed and that hash is kept in a separate dedup set — it is never written next to the answer, so there's no joinable token↔response row (the dedup set and the answers are different structures, never joined). What's published per department is a DP aggregate (noisy sum over noisy count, so even the exact headcount is hidden), not an individual-level dataset to download or join.

But the information about the participant must come from somewhere. The common way is to store it in the token / particpant table as a custom attribute. You could pass it on via the URL as well, but this has its own challenges.

If you store it in the token table, you need to get it into the survey somehow, so that you can use it.

But you're pointing at the real upstream question: where does the department come from in the first place? Honestly I haven't settled that, and the choice matters — it could be (a) self-declared by the respondent as a survey question (no token attribute at all),

If it is self declared, we do not need the additional layer, at least not for the main issue, make the connection between personal data and response data. It could help with the second scenario that allows to identify people because of small sample sizes. But, and I give you a real life example: Even if you aggregate the data, this can still be an issue. We had a internal survey for a client and we would only present aggregate information to the client. We knew who was behind the answers, but the client (in theory) didn't. To protect respondents we didn't even show the sample size (which in many cases was below n=5, in some cases n=2 or even n=1). But employees know how many people work in a certain department. If you receive a low rating from a department with n=2 or n=1 people working there, you basically know who gave you a low rating. Even if data is aggregate.

You could now say: do not present the data based on the department. But then I do not need the information about the department at all and everything is fine. That would be a totally different scenary. The task was, every "client" department will eveluate every "service provider" department. So you need to show which department evaluated which department and how.

(b) carried as a token attribute, or (c) looked up from an external/HR directory at entry. Each has a different exposure: (a) trusts the declaration, (b) means the token carries an attribute, (c) means that directory exists and could be correlated. You know the LimeSurvey workflow far better than I do — which of these would actually fit how people run cohort surveys there?

What this does NOT solve, and I won't pretend otherwise:
- Trust doesn't fully disappear, it moves: a non-technical respondent still can't verify at submission time that things work as described — they'd have to trust the implementation (or an independent audit), not just take my word for the code. I think that's honest to state up front.
- Small cohorts: with DP a tiny department's aggregate just gets too noisy to be useful (DP still holds at n=7, it's the result that's unusable, not the privacy), which is why a minimum-cohort threshold makes sense for publication.
- The per-department ε only holds cleanly if cohorts are disjoint and you don't also publish cross-cutting aggregates over the same people — otherwise the budget composes and you have to account for it.
- If you need per-individual linked data kept for reminders, response updates or later cross-analysis, this isn't the tool — it's a different class of system from standard LimeSurvey exports.
- Out of scope too: timing correlation, network metadata, and the blind-signature route (still a non-audited prototype).

The reproducible test script for exactly this (routing by cohort + DP aggregate, seed=42) is demo_cohortes.py in the repo, if you want to break it: github.com/taha-vera/Protocole-Vera/blob/main/demo_cohortes.py

To be hones, I still do not see how your proposed approach would solve the issues presente. Maybe because I do not fully understand your approach.

But I don't see how it could solve the problem when the attributes are stored in the token table and passed on to the survey for either storing it there or to use it for filtering questions, etc. If you allow this, you would still have the issue that the survey creator could pass on details fromt he token table, that should not be in the response table (name, email, token, etc.).

Now if you use the anonymous mode, everything is fine. But you can't use the attributes (e.g. the department). Your first solution was self declaration. But then all this doesn't need to be discussed because one could use the anonymous mode. But in general, what I see in the forum, most people want to pass the information into the survey, rather than self declaration. With self declaration the first anonymity problem can be easly solved without your protocol, because you can keep the token table and the response table 100% separate, which is what the current anonymous mode does.

Now for the second anonymity issue described (small sample sizes allow identification), this is still an issue even with only aggregate numbers, just as I described above. So I currently really do not see how Vera Protocole could help with these two issues.

Help us to help you!

Provide your LS version and where it is installed (own server, uni/employer, SaaS hosting, etc.).
Always provide a LSS file (not LSQ or LSG).

Note: I answer at this forum in my spare time, I'm not a LimeSurvey GmbH employee.

Please Log in to join the conversation.

Moderators: holch, tpartner

Powered by Kunena Forum

Lime-years ahead

Online-surveys for every purse and purpose

Pricing & Plans

Get started

Welcome to the LimeSurvey Community Forum

From "ethical anonymity" to provable anonymity — a differential-privacy plugin c

Lime-years ahead

Legal

About Us

Open Source