Search Data, Privacy, and the Limits of Heuristics: A Critical Reading of the EC’s Preliminary Findings against Alphabet

On 16 April 2026, the European Commission (EC) released its preliminary findings in the case DMA.100209 – SP against Alphabet. In a document open to consultation for only two weeks, the EC sets out the proposed measures that Alphabet would be required to implement to comply with its data sharing obligations under Article 6(11) of the Digital Markets Act (DMA).

This document is significant. It offers the first concrete illustration of how the EC approaches anonymisation under the EU digital rulebook, a rulebook whose one foundational pillar is the General Data Protection Regulation (GDPR). As a reminder, the proposed Digital Omnibus Regulation includes an attempt to re-delineate the concept of personal data and to recalibrate the institutional balance between the European Data Protection Board (EDPB) and the Commission through a revised Article 4 and a new Article 41a. (see here for a review of the quality of the drafting).

For context, Article 6(11) of the Digital Markets Act (DMA), provides that: “The gatekeeper shall provide to any third-party undertaking providing online search engines, at its request, with access on fair, reasonable and non-discriminatory terms to ranking, query, click and view data in relation to free and paid search generated by end users on its online search engines. Any such query, click and view data that constitutes personal data shall be anonymised.”

The goal, as expressed in Recital 61 DMA, is to “ensure the protection of the personal data of end users, including against possible re-identification risks, by appropriate means, such as anonymisation of such personal data, without substantially degrading the quality or usefulness of the data.”

As explained here, the European Data Protection Board (EDPB) and the EC, in their draft guidelines, state that the goal is “to protect the interests of end users authoring the queries and not necessarily the interests of natural persons whose data may appear in search queries. Personal data other than end-user personal data should be at a minimum pseudonymised.”

Because the DMA and the GDPR must be read together (the concept of personal data comes from the GDPR, see Article 2(25) DMA), the anonymisation standard should be coherent with GDPR Recital 26.

The EC’s preliminary findings will be followed by an implementing act (see para 99). Under Article 8(2) DMA, the EC has the power to specify the measures gatekeepers must implement to effectively comply with the obligations in Article 6 DMA. However, only supervisory authorities in the sense of the GDPR, acting under the control of competent courts, can apply the definitions of the GDPR in an independent manner as guaranteed by Article 8(3) of the European Union Charter of Fundamental Rights, as mentioned in the EDPB/EC draft guidelines.

By opening its preliminary findings to consultation, the EC is seeking feedback on three specific issues: whether the measures are effective, whether they are complete, and whether the implementation timeline is realistic.

This blog post aims to raise three concerns about the EC’s approach, against the backdrop of developing a coherent approach to anonymisation under the GDPR that must be informed by state-of-the-art techniques. To echo Guillaume Champeau’s words, the goal is not to throw the baby out with the bathwater but to find a more principled framework for balancing the competing interests at stake.

Two preliminary remarks are worth making. First, not every third-party undertaking providing online search engines in the EU and in the EEA has the right to Alphabet’s search data. Eligibility conditions must be met. Second, each search data recipient is subject to a series of downstream restrictions. As a result, eligible data recipients are only allowed to access the search data for the purpose of “optimising or improving [online search engine]services” (para. 40); they are under security obligations (see section 3.2.2); and are prohibited from sharing the data beyond their processors acting within the scope of their mandate (see para. 42).

1. The EC’s Approach to Anonymisation: Roots and Main Components

The EC pursues a hybrid approach combining both technical and contractual controls with a view to “ensure anonymisation of end users’ personal data in the Search Data” (para. 19).

As regards technical controls, the approach appears to be based on a series of heuristics reminiscent of statistical disclosure methods, which, from a regulatory perspective, trace some of their roots to guidance developed to comply with laws such as the United States Health Insurance Portability and Accountability Act (HIPAA), and its section on de-identification. The Guidance regarding methods for the de-identification of protected health information under the HIPAA Privacy Rule was released in its most significant form in 2012. A heuristic is a practical rule of thumb used to solve a problem. Heuristics are not guaranteed to be perfect; rather, they are employed because they are workable and considered sufficiently reliable in complex situations, typically where no widely accepted rule exists.

In practice, the de-identification referred to by HIPAA is often understood merely as the removal or suppression of direct personal identifiers. This, however, is a far weaker privacy objective than addressing all information that, whether alone or in combination with other data, could enable the re-identification of an individual.

Most notably, HIPAA practice today leverages formal privacy models such as k-anonymity, which is not what the EC is doing in its preliminary findings: it sticks to heuristics.

Unlike mere heuristics, formal anonymisation models provide mathematically grounded guarantees that a specific privacy objective is met, such as the indistinguishability of individual records within a dataset when an attacker has access to a dataset effectively.

Under the EC’s approach, Alphabet would be asked to share search data daily at the record level, with each record containing a query and its associated metadata. Before sharing, a set of attributes is stripped out entirely: user identifiers, precise timestamps, screen layout information, and image-based queries. What remains is the text of the query itself, alongside metadata fields covering location, inferred language, device type, and access point, plus behavioural signals such as click-back time (e.g., the amount of time a user spends on a search result page before clicking back to the search results), which is retained but rounded into broad time buckets rather than shared precisely.

The core privacy mechanism mandated by the EC is query suppression, operated through two filters applied in sequence. First, an entity-based filter checks whether every meaningful unit (entity) in the query (e.g., names, addresses, phone numbers, …) appears on a pre-built allowlist of terms submitted by at least 50 signed-in users over the past 13 months. Second, a length-based filter removes queries that exceed a character threshold calculated weekly per language, set at the point where 95% of queries fall below it. Any query failing either filter is dropped entirely from the dataset. The same thresholds apply to related text fields such as query refiners and search filters.

For queries that survive suppression, metadata is further protected through generalisation rather than removal. Location is expressed as a country plus a geographic cell covering at least 3km² and containing at least 1,000 signed-in users; if that population threshold is not met, the cell is progressively widened, device information is dropped, and location is coarsened further until the threshold is satisfied. If it still cannot be met, the record is removed. A metadata threshold is also applied, requiring that at least 50 users share the same combination of metadata. Alphabet includes all search records that have passed through the five steps (attribute suppression, allowlist creation, length-based threshold determination, query suppression, and metadata generalisation) in the final search dataset that is shared daily.

These technical measures are complemented by contractual measures, which act as a second line of defence. The second layer of the framework, which is designed to reduce residual re-identification risks to an insignificant level through obligations binding on those who receive the data. These obligations must lead to a package of organisational, administrative and technical measures applied upon the environment of the data in the hands of the recipient, such as data segmentation and unlinkability across datasets (including auxiliary advertising and analytics datasets), traceability of data flows, fine-grained access control (see section 3.2.2), as well as monitoring obligations on the part of Alphabet (see para. 138). An independent assurance mechanism must be established and an independent reasonable assurance report must be shared by the candidate data recipient (see para. 52). These measures confirm that anonymisation as a hybrid process is a regulated processing activity to be carried out for a specific and legitimate purpose, in this case “the genuine development, improvement, or optimisation of its own [online search engine]services,” and that the purpose must remain legitimate over time. They also underscore the importance of transparency and auditability in relation to such activities. Although necessary, such safeguards should not, however, be viewed as substitutes for robust technical measures, because data transformation measures protect the data itself, precisely in case of unauthorised access or unlawful disclosure.

As hinted here, for a contextual approach to anonymisation to function effectively, it is important not to base it on the actual means available to the data holder, but rather on state-of-the-art techniques and a constructive approach to controls. Building such a framework requires a well-founded understanding of the current state-of-the-art on data anonymisation, particularly given today’s markedly different environment, in which re-identification is increasingly facilitated through AI-driven pipelines. A constructive approach measures the data holder’s conduct against an objective standard of reasonable care, asking not what this particular holder did, but what a prudent and diligent actor in the same position should have done in the light of the state-of-the-art and foreseeable technological developments.

2. Three Remarks on Proposed Technical Measures That Lag Behind the State of the Art

Three types of considerations are important for assessing the approach the EC is suggesting.

The proposed technical measures do not offer robust privacy guarantees: unique search records may still remain.

The presence of unique records within a dataset is a key driver of re-identification risks. It enables an attacker, equipped with the dataset and some background information about an individual, to distinguish an individual’s record and then re-identify him/her. Robust anonymisation should at least guarantee that unequivocally unique records are not present in the dataset. Moreover, it should bound the probability of random re-identification at a level considered safe or acceptably low. Technically, these objectives constitute a formal privacy guarantee. They can be expressed and verified using precise mathematical definitions, rather than informal assurances. This is particularly important in contexts involving large-scale personal data processing and data sharing ecosystems involving parties that are not mutually trusting each other but are in competition.

The suppression mechanism proposed by the EC works by removing queries that contain rare words or entities, certainly targeting unique queries. But it does so by checking each query component (text entities or individual words) in isolation. Because it does not check whether the combination of all query components is unique, effectively, a unique query can still pass all the thresholds. To make matters worse, the query suppression and the metadata generalisation mechanisms also operate in isolation. As a result, they do not guarantee that unique search records (i.e., the combination of metadata and query) are eliminated from the so-called “anonymised” output.

In more plain words, the allowlist confirms that the words in a query are common; the length threshold confirms the query is not unusually long; the metadata threshold confirms that at least 50 users share the same metadata combination. But none of these checks asks the fundamental question: how many people submitted this exact query with this metadata combination? The answer may well be just one.

This is not a theoretical concern. Nearly 30 years of academic research has shown that the combination of a small number of data points is enough to re-identify an individual, and that effective anonymisation requires assessing data attributes in combination, rather than in isolation (see here and here).

As mentioned above, the EC’s approach is heuristic-based. It applies a set of practical rules (i.e., check the words, check the length, check the metadata), but those rules are not derived from a formal privacy model. They are engineering judgments, not mathematical guarantees. This has two consequences: privacy protection is unguaranteed, and the utility loss is not calibrated to any explicit privacy requirement.

The well-known k-anonymity privacy model, which is the privacy notion most closely aligned with how EC’s rules are defined, offers more meaningful protection. Under k-anonymity, a search record would only be released if at least k individuals share the same combination of data points. This is a direct answer to the combinatorial re-identification problem described above. It checks the combination, not just the individual attributes, thereby guaranteeing that unequivocal re-identification is no longer possible and bounding the probability of random re-identification by 1/k.

The proposed technical measures would substantially reduce the utility of the search data: they tend to ignore utility-preserving query masking.

To understand the implications of the proposed technical measures on utility preservation, it is important to bear in mind the distinction between suppression and utility-preserving masking. Suppression removes data entirely when it cannot satisfy the required privacy threshold and, thus, offers no residual utility for the removed information. Utility-preserving masking refers to privacy-enhancing transformations that reduce disclosure risk while retaining as much analytical value as possible. Rather than deleting data, such methods modify it only to the extent necessary to meet the target privacy requirement. Generalisation is the clearest example, but other techniques, such as perturbation, swapping, aggregation, or partial suppression, may serve the same function depending on the context.

As mentioned above, the core privacy mechanism mandated by the EC in its preliminary findings is query suppression, operated through two filters. Metadata generalisation only happens once the query suppression decision has been made. In other words, utility-preserving query masking is not considered to be an alternative to query suppression.

In this respect, k-anonymity is again a meaningful solution that does not necessarily require suppressing records that fail to meet the indistinguishability condition: it is sufficient to modify such records to render them indistinguishable. Furthermore, alternative formulations such as probabilistic k-anonymity allow for a broader range of masking techniques, such as data swapping. Note that we assume k-anonymity is enforced across all metadata attributes and query components, so that the privacy guarantee does not depend on any prior choice of which attributes are considered (quasi-)identifying and confidential. This is important because these prior choices are usually based upon conventions, which may vary from one expert to another.

Although a k-anonymity-based approach would mask more queries than the EC’s suppression mechanism, because enforcing indistinguishability at the record level is more stringent than doing so at the level of individual data points, it is compatible with non-suppression-based data masking. Yet, non-suppression-based data masking can preserve partial data utility, whereas suppression does not. Masking techniques such as generalisation, swapping or even partial suppression alter search records just enough to meet the privacy requirement while keeping it in the dataset and thereby partially preserving utility. For the downstream tasks the data are intended to support (e.g., training search algorithms, analysing query patterns, measuring market behaviour), this difference in utility is potentially significant.

The EC’s formulation of the indistinguishability threshold for suppression (i.e., the minimum number of records that must share the same relevant characteristics) is also problematic. The EC’s choice of 50 users for the allowlist and metadata checks is a policy decision without a formal justification. Privacy thresholds can and should be calibrated to the specific risk profile of the data and the use case. A well-established reference point, the European Medicines Agency, considers a re-identification risk threshold of 0.09, which corresponds to a k-anonymity level of 11, a conservative choice for privacy protection in the context of medical data. If a threshold of 11 is considered sufficient for data falling under some of the most sensitive categories regulated by the GDPR, then the choice of a threshold of 50 for general search data requires explanation.

Going further, the entity-based method outlined in para. 22 of the EC’s preliminary findings is poorly designed. Named entities encompassing information capable of identifying natural persons do have a central role in search queries, as users typically search for places, organisations, events or persons they have a connection with. At least two reasons make the detection of such entities in search queries particularly difficulty: (1) search queries are typically short, sometimes poorly formed and lack the context to properly disambiguate possible referents, and (2) the approach needs to be applicable to hundreds of languages/dialects (not only EU’s languages, but all languages that may be used in queries originating from Europe).

Operating with a weekly updated “allowlist” of entities is also bound to create many problems due to the inherent latency of the mechanism. For instance, if a new person becomes famous overnight (like the CEO caught having an affair during a Coldplay concert), that name will only come up on the allowlist one week later, which means the search logs for that week will have to suppress that name, rendering those queries meaningless. It would be more appropriate to evaluate the indistinguishability of search records at the dataset level, considering the full set of up to 13 months of historical search records, as it is standard in k-anonymity-based approaches.

All in all, the EC’s framework suppresses more than necessary, protects less than it should, and lacks the formal foundation needed to defend either choice.

The proposed technical measures give the gatekeeper more discretion than needed.

Although there is clearly an attempt to confine the decision-making power of Alphabet through a regulated process for search data acquisition that limits the grounds for refusing, suspending and terminating data access (see section 5.5), the technical measures for anonymisation, as drafted, leave significant room for interpretation at the implementation stage. This is not a minor technical detail. Discretion in how privacy measures are applied is discretion over whose data is protected and to what degree. That discretion should rest with the regulator as much as possible, not the gatekeeper.

The EC’s framework sets out a multi-step procedure (allowlists, length thresholds, metadata generalisation) but leaves key parameters undefined or insufficiently constrained. What counts as a “personal data detector”? How exactly are entities split and resolved? Which auxiliary data sources does the gatekeeper use to build the allowlist? These are not secondary questions. They are the questions on which the privacy guarantee depends. Yet the framework provides no verifiable answer to any of them. More specifically, defining what “entities” exactly encompass in this setting is far from straightforward, as there are many definitions of what should count as information capable of identifying natural persons (also called personally identifying information) and, for free-text queries, the space of possible entity types is effectively unbounded.

The consequence is that the gatekeeper retains substantial discretion over how the measures are implemented in practice. A gatekeeper that controls the implementation of its own privacy obligations is a gatekeeper that can, intentionally or not, calibrate those obligations to its own advantage. This is precisely the conflict of interest that regulatory intervention is designed to prevent.

Alphabet’s own proposal, which enforces 30-anonymity at the query level through query suppression, is more legible. It sets a single auditable threshold. A regulator, an independent auditor, or a court could then check whether it has been applied correctly. Unfortunately, the EC’s multi-step heuristic cannot be verified in the same way.

The above does not mean Alphabet’s proposal is entirely satisfactory. Two legitimate questions remain. First, whether a threshold of 30 is the right number: there does not seem to be a formal justification for why 30 was chosen rather than 11, 20, or 50. Second, and more fundamentally, whether below the threshold suppression is the right mechanism at all. As explained above, non-suppression-based masking techniques such as generalisation and swapping preserve more data utility without weakening the privacy guarantee. A formal k-anonymity model implemented through utility-preserving masking rather than suppression would be both more protective and more useful. Related to this, the methodology used to count equivalent queries also impacts data utility. Rather than relying on exact string matching (as Alphabet would presumably do), incorporating lightweight syntactic analysis to identify semantically equivalent queries that differ only in surface form, such as morphological inflexions, can help minimise the number of masked queries, as more will meet the defined threshold.

3. Conclusion

The EC’s undocumented measures do not seem to reflect state-of-the-art in the field of search query anonymisation or data anonymisation in general, resulting in a framework that neither adequately protects user privacy nor preserves data utility. Most importantly, they leave the gatekeeper with more discretion than a robust regulatory framework can afford. The consultation should be an opportunity to course correct. A coherent approach to anonymisation under the EU digital rulebook, one grounded in formal privacy models, calibrated thresholds, and above all verifiable implementation standards, is a better way forward.

Share this article!

Download article as PDF