Work Package 2: Classification tool for corruption detection in public procurement

Work Package 2: Classification tool for corruption detection in public procurement

Each task in this WP focuses on a specific type of classification methods. Task 2.1 consists in applying classic classification tools, including statistic models and machine learning approaches, to detect fraud in the DB constituted during WP 1. Task 2.2 aims at defining different methods to extract various types of graphs from this DB, and proposing efficient approaches to perform fraud prediction based on this relational information. Finally, the goal of Task 2.3 is to combine both types of information in a single method.

In order to assess the classification performance, we will adopt a classic machine learning approach by taking advantage of the dataset constituted in WP 1, especially its ground truth. These data will be split into three parts. The first one will be used to train supervised methods; the second to adjust their parameters when needed; and the third to evaluate their ability to generalize, i.e. to obtain similar performance on new data.

Leader: V. Labatut (LIA)

Task 2.1: Application of classic methods

This task will be mainly addressed by LBNC and LIA, as we will apply both a classic econometric approach and state-of-the-art machine learning tools to the dataset collected in WP 1. We expect to get unbalanced classes in our corpus, so we will work with samples.

We will first fit a Probit model aiming at providing the best probability estimate, and allowing us to assess the explanatory power of the attributes, not only separately but also in combination. We will also use LASSO to perform feature selection. Such results will be useful when designing the hybrid approaches (Task 2.3). We apply the estimated models in the rest of the dataset, where we cannot determine between corruption and lawfulness. We, therefore, estimate if public procurement markets are corruptive or lawful. By constructing a multistep approach with different statistical tests, inspired by Imhof et al., we will isolate a group of public procurement markets that clearly exhibit potential corruptive issues. By analyzing them closer, we draw policy recommendations identifying simple remedies to lower the risk of corruption.

In addition to the classification tools traditionally use in economics, we plan to apply a selection of state of the art machine learning classifiers, in order to assess the improvement they can deliver. We will consider Support Vector Machines, Decision Trees, Multilayer Perceptrons, Boosting, Random Forests, and Ensemble methods. The main objective here is to improve classification performance, so we will be able to use black box approaches such as neural networks. We also plan to identify the most informative attributes by applying standard feature selection methods, in order to compare them with those already identified through the logistic regression.

Finally, we will also perform different robustness analyses to ascertain the predictive quality of our best predictors. For example, we will drop the best predictors to gauge whether others step in as substitutes, and how this potentially affects the correct classification rate. We will also misclassify randomly corrupted and lawful cases to investigate how the correct classification rate changes.

  • Deliverables: mid-term and final reports.
  • Success indicators: scientific publications.
  • Partners involved: LIA, LBNC.

Task 2.2: Development of graph-based methods

This task will be mainly addressed by LIA since it requires expertise on graph partitioning and link prediction, with input from LBNC and CRA for judicial and practical aspects. The Ph.D. candidate advised by LIA will be particularly involved in this task and the next one.

We will start by extracting several types of signed graphs from the raw data obtained at WP 1. R. Figueiredo and V. Labatut have conducted a preliminary work leading to the identification of two methods of interest. First, we will use vertices to represent buyers/suppliers and edges to connect a supplier to a buyer when he wins a contract, with a negative sign in case of misbehavior and a positive sign otherwise. Multiple bids between a supplier and a buyer can be modeled using a pair of edge weights (positive vs. negative). This method offers the advantage of producing a bipartite graph, on which many existing tools can be used, but we lose all the temporal aspect of the data. Second, we explicitly introduce time and produce a so-called dynamic signed graph. According to our recent work on some different data, this can significantly improve the relevance classification results. We will perform a descriptive study of these graphs, similar in principle to that done in Task 1.4 on our DB, and aiming at the same objectives (better understanding of the data and providing valuable information for interpretation in Task 3.1). For this purpose, we will apply standard complex networks tools, in particular centrality measures and frequent pattern mining.

The second step is to explore the existing definitions of the concept of Structural Balance (SB, a property of signed networks). We will perform graph partition and link prediction based on these definitions and study the obtained results. We will investigate practical questions, such as: “How balanced are the graphs?”, “Can we identify clusters with a specific role?”, “Is fraud correctly predicted?”. Depending on the answers to these questions, and based on the expertise of our partners from Economy and Law, we will propose new variants of the SB. Once the relevant SB definitions have been identified, we will formalize the corresponding graph partitioning and link prediction problems, and develop efficient optimization methods to solve them, based on the techniques previously used by members of our team for related problems: integer programming approaches, sequential and parallel heuristics. We will start with the static versions of the graphs, before turning to the dynamic graphs, which are more difficult to handle. We will compare the prediction performance of the graph-based approaches with those obtained during Task 2.1.

  • Deliverables: mid-term and final reports.
  • Success indicators: scientific publications.
  • Partners involved: LIA, LBNC.

Task 2.3: Development of hybrid methods

This task will be mainly addressed by LIA. The classic approaches used in Task 2.1 rely only on the individual information present in our data, whereas those proposed in Task 2.2 will be purely based on the relational information present in the extracted graphs. The goal of this task is to combine both types of information, in order to assess how they affect fraud prediction. Put differently, we want to consider at the same time the part of the data that describes separately each element of our system (suppliers and issuers), and the part that represents how they are connected.

For this purpose, we plan to explore three options. All of them are graph-based, and integrate the individual information in this relational model in different ways. First, one can represent individual information in the graph by revising its edge weights, a method previously adopted to partition unsigned graphs. On the one hand, this allows directly applying our methods from Task 2.2 to the resulting graph, but on the other hand, its construction is likely to undergo significant information loss. Second, it is possible to add a new type of vertices to the graph, aiming at representing their attribute values. This method was previously used in to partition unsigned graphs. Third, the most natural approach is to use attributed graphs, i.e. to define individual attributes at the level of vertices, and to adapt existing graph-based methods so that they take this additional information into account when processing the graph. In all three cases, we will develop methods by generalizing our work from Task 2.2 and adapting approaches from the literature to our specific type of graphs. Like in Task 2.2, we will first deal with our static graphs, before turning to our dynamic ones. We will also compare our prediction results with the previous methods.

  • Deliverables: mid-term and final reports, fraud prediction engine.
  • Success indicators: scientific publications.
  • Partners involved: LIA, LBNC.