Work Package 1: Ground Truth Data Collection

Work Package 1: Ground Truth Data Collection

The data requirements for the development and application of the comprehensive methodology are substantial and exceed in terms of volume and complexity any specific data source on corruption and fraud in French public procurement that is known to us. An important effort is needed to collect technical and judicial ground truth data. In a nutshell, Tasks 1.1 and 1.2 will allow us to collect and cross-match exhaustive raw data from contracting and judicial sources in France. Tasks 1.3 and 1.4 will provide a basis for comparison, and may complement the raw data.

Leader: F. Lombard

Task 1.1: Public procurement database collection

Public procurement process generates data during the multiple stages of the contracting process (planning, public contract announcements, tender notices, tender documents, award notices).

Since the end of 2017, the API provided by the BOAMP (Bulletin Officiel des Annonces des Marchés Publics) enables us to obtain all the public contract announcements and award notices. It allows us to collect, contract by contract, lot by lot, numerous attributes (as enumerated in the arrêté du 14 avril 2017, relative to the characteristics of the buyer, the characteristics of the call to tender (procedure type, contract purpose, selection criteria, length of advertisement period, allotment, notification date…) and the characteristics of the selected suppliers. In order to measure the difficulty of Task 1.1, we have collected more than 58 400 award notices (representing 67 410 B.e). Less exhaustive but more structured data will be added to or consolidated in our data Open data publication of essential data on public procurement (DECP – Données Essentielles de la Commande Publique) as well as a Visualisation Platform of data coming from awarded procurement (as from October 2018) offer a central point of visualisation of all data coming from French DECP and SIRENE LD. Thus, a project that aims at building a query-able graph around open data referring to legal units or institutions via their SIREN / SIRET number will provide useful additional resources.

Besides, France’s local administrative organization is based on a pyramid of administrative levels and a high number of local authorities; inter-communal public agencies appears to be numerous and heterogeneous public buyers. The National Register of Electoral Mandate of the Ministry of the Interior (RNE) will allow us to discover the functional and technical links between local authorities as well as between public buyers and some of the 29 000 companies whose manager is also an elected official.

The task will be mainly addressed by LBNC, LIA and Datactivist since it requires expertise in collecting, formatting and analyzing the data. Note that even though this will be the largest database on corruption in French public procurement, it does not qualify as Big Data, and we will be able to analyze it through traditional tools.

  • Deliverables: final reports, open database of public procurement.
  • Success indicators: scientific publications.
  • Partners involved: LIA, LBNC, Datactivist.

Task 1.2: Court data and judiciary sources collection

As stated before, a public procurement fraud case may be a criminal justice issue. It may also fall under administrative justice. So, collecting court data will involve multiple sources and notably Legifrance (Open Data of Court Decisions), which contains judiciary decisions expressed in legal terms, using natural language (French). Once collected, an identification work is needed, in order to link a specific judgment to a specific public procurement tender. Even if judicial data are rendered anonymous, the name of the public authority, the commercial name of the society and the date are still available. That will enable us to construct our raw data. Secondary sources (newspaper articles) will also be used, with the help of Transparency International France. Besides being of interest for WP 2, this data collection is worthwhile in itself. It will enable the first exhaustive studies of French procurement fraud and corruption in law and economics. Datactivist will adapt and fine-tune Natural Language Processing methods and tools to the primary (judicial) and secondary (newswire) corpora. This processing will enable a faster and deeper analysis of the documents. At the end of this task, a law symposium bringing together specialists of procurement law will allow for sharing analysis of courts decisions. The task will be mainly addressed by CRA, Datactivist and LBNC.

  • Deliverables: database of fraud in Public Procurement.
  • Success indicators: scientific publications.
  • Partners involved: LBNC, CRA, Datactivist.

Task 1.3: Survey of experts opinions and alternative datasets

In order to manage the risk of the project, a statistical survey will be sent to experts (magistrates, auditors of Regional Chamber of Accounts…) in order to collect their opinions about a sample of potentially corrupted cases. The comparison between experts opinion and reality will allow us to improve our ground truth. From a risk management of DeCoMaP perspective, it will enlarge the sample size of the grey cases if Task 1.2 turns out to be not sufficiently successful. The task will be mainly addressed by LBNC and, secondarily, CRA and Datactivist. Simultaneously, to manage the risk of the project and to provide a comparative context, we will collect alternative datasets. As described in the state of the art, even if the methodology of DeCoMaP is innovative, some works have dealt with comparable datasets of procurement data and fraud in procurement data, in other countries. From a computer science point of view, it will provide, at the beginning of the project ready-to-use datasets, to test some methodological assumptions without waiting for the result of Task 1.2. From a law and economics point of view, it will enable to compare practices between countries subject to the same requirements (UE countries) or subject to foreign laws. At the end of this task, an international workshop bringing together contacted researchers will be organized. It will enable to discuss the relative merits of alternative methods. This will, in turn, be useful for Task 2.3. The task will be addressed by all partners.

  • Deliverables: workshop report.
  • Success indicators: scientific publications.
  • Partners involved: LIA, LBNC, CRA, Datactivist.

Task 1.4: Descriptive analysis and missing data

The first objective of this task is to conduct a descriptive analysis of the DB constituted at Tasks 1.1 and 1.2 (or the alternative DB from Task 1.3), as well as the experts’ opinions from Task 1.3. Two types of econometrics analyses will be carried out. First of all, we will analyse the determinants of local and/or politically connected purchasing (that may be considered as indirect cues of fraud). Besides, based on the theory of corruption, we will use the best-known predictors for corruption to test if the sub-groups statistically diverge in risk of corruption and if sub-groups with many corruptive cases clearly indicate corruptive issues. Such findings will allow us to draw policy recommendations suggesting simple remedies to reduce the risk of corruption. Specifically, by formatting our data according to the Open Contracting Data Standard, we will be able to use the red flags highlighted by OCDS to infer fraud on the data collected. This will provide a useful benchmark for the rest of the project. We will also perform a cluster analysis (non-supervised classification) to estimate several typologies, regarding buyers, suppliers, and tenders. Using interpretable supervised methods such as Decision Trees, we will identify the characteristic profiles of the obtained clusters. These typologies will be useful to 1) better understand our DB; 2) provide a potential additional input variable for classification methods from Tasks 2.1 and 2.1; and 3) ease and improve the interpretation of our fraud detection results during WP 3. We will also leverage our ground truth to cross-check these clusters and study the distribution of corruption among them (is it concentrated in certain clusters?).

The second objective will be to analyze missing data, looking for patterns characteristic of accidentally, or intentionally, missing data, in particular by cross-referencing sources of similar data (as for example, BOAMP, DECP and SIRENE LD). The classification tools developed in WP 2 can leverage such characteristic patterns.

The third objective is to develop a theoretical model combining auction theory and law and economics (by weighing the benefits in terms of contract award deflection against the associated legal risks). By comparing the theoretical results obtained with Task 1.2, this will make it possible to identify the risk of under-representation of certain types of fraud in the cases judged and therefore present in our DB (under the assumption that certain practices are not prosecuted).

  • Deliverables: mid-term and final reports.
  • Success indicators: scientific publications.
  • Partners involved: CRA, LBNC, Datactivist.