Executive summary :
In January 2021, the cabinet of Mark Rutte, Dutch prime minister, was forced into early resignation following the revelation that an AI system of the Dutch tax administration had erroneously requested the reimbursement of childcare allowances received to an estimated 35.000 parents, most of which non-Dutch foreign residents.
These welfare recipients were wrongfully designated as fraudster by a predictive machine-learning classifier that used the ‘nationality’ (Dutch/non-Dutch) as one of the risk-indicators to predict the risk of fraud.
Childcare costs in the Netherlands being one of the highest in OECD countries, these errors led tens of thousands of parents into serious financial hardship. In addition, because the output of the machine-learning model was shared to public and private bodies, some parents were radiated from their banks and lost the custody of their children.
Facts of the case:
The toeslagenaffaire is the conjecture of a number of legal, political and socio-economic factors, in addition to the use of machine-learning algorithms. In this short piece, we will focus on the use of machine-learning and the discrimination which resulted from such use. Should you be interested in the complete recounting of these events, we invite you to read: D. Hadwick & S. Lan, ‘Lessons to be learned from the Dutch childcare allowance scandal: A comparative review of algorithmic governance by tax administrations in the Netherlands, France and Germany’ (2021) World Tax Journal 13(4).
As said, at the centre of the toeslagenaffaire is the fact that the risk-scoring algorithm of the Belastingdienst was labelling as fraudster a disproportionate number of foreign residents and non-Dutch welfare recipients.
This disparity was not due to natural differences in the behaviour of Dutch and non-Dutch welfare recipients, but was the result of biased institutional practices of the Belastingdienst. In their audit of the AI risk-scoring system, the Dutch Data Protection Authority (AP) details a number of questionable data governance practices, particularly the ‘black-list’ (zwarte lijst), but also ‘recurrent queries’ and ‘spontaneous queries’.
During recurrent queries, the Belastingdienst would regularly filter their data from the exclusive prism of nationality, to find correlations between certain nationalities and non-compliance. For instance, if month to month, the amount of social allowances disbursed to the welfare recipients of a particular nationality would increase, upper management would request an audit of members of that group. Naturally, other reasons come to mind for why such amount could increase, simple demographic changes for instance.
During spontaneous queries, the Belastingdienst would decide to audit every welfare recipient of a particular nationality upon the reception of signals concerned specific individuals. For instance, upon receiving information over hundred Ghanaians in Amsterdam potentially committing fraud, the Belastingdienst decided to audit every six thousand Ghanaian residing in the Netherlands.
These practices had for effect to increase the proportion of foreigners being audited, in turn leading the machine-learning model to wrongfully conclude that foreign nationality was an indicator of fraud. In truth, the Belastingdienst committed a systemic sample bias by attaching disproportionate level importance to the variable of nationality, to the detriment of objective markers of fraud.
Key takeaways:
The toeslagenaffaire is a complex case, too complex to be summarized in a few paragraphs. Hence, we will stick to the key takeaways for tax compliance and fiscal algorithmic governance.
In our opinion, the toeslagenaffaire acts as a reminder that tax compliance risk-management is an inductive process which entirely rests on data quality. Enormous importance is attached to the AI systems and machine-learning models used by tax administrations (particularly by us at taxadmin.ai!), but the data fed to the model and its quality should equally be scrutinized. Spontaneous and recurrent queries aimed specifically at the variable of nationality deviates from the inductive process that is risk-management, it theorizes before the data and in fine become a self-fulfilling prophecy.
Second, the toeslagenaffaire highlights the need for explainability and the right of AI subjects to request disclosure of the underlying logic of the model, and (at least some of) its features. In the toeslagenaffaire, a simple counter-factual explanation outlining the predictive value of the ‘Dutch/non-Dutch’ feature would most likely have spared victims a very lengthy ordeal. Solutions to provide explanations without opening the black box have been proposed in literature, and could be implemented with little costs attached.