top of page
Search
avi126e

Iterative process to apply Standard Dialectic methodologies in conjunction with Bayesian inference

In today’s world, a lot of business decisions are made based on insights arising out of raw data, subjected through a wide range of handling processes. While there are variations in how an organisation chooses to wrangle its data, the process can be broken down into a series of general steps which includes but are not limited to:

1. Hosting Data from multiple sources: Essentially some organisations deal with data that come in the form of gushing water through a hose and not in batches of cleaned, structured & organised blocks. So, the first step is always to identify the relevant sources of data and provision apt storage mechanisms. 2. Cleaning Up the stored data: This is by far the most rigorous step. This is where data quality gets checked and the raw data is wrangled and made ready for use in building models. 3. Building data models geared towards deriving a subset of the original data which is relevant for the insights the business is looking to achieve. 4. Analytics & Visualisation: While there is an industry fad of calling the whole process as Analytics, the modeled data gets subjected to analytics and visualisation frameworks only in the last step. 5. The output of step 4 is what the business stakeholders and decision-makers consume. We may call those “insights” in common parlance. 6. Step 6 is not really a step but an overarching Governance process which guides all the above steps.

This process is easy to implement for smaller organisations but gets messier as and when we move into the realm of Big Data. What we know as “Big Data” is basically data which has crossed agreed (“agreed” – varies with organisation size, infrastructure capacity and of course industry) thresholds in terms of the 4V framework of state of data (Volume, Velocity, Variety & Veracity). For the scope of this article, we focus on dealing with the most important dimension which is the VERACITY of data. This dimension is also the most neglected by consumers, modellers and service providers. But the very relevance and usability of insights depends linearly on the VERACITY of the data.

Application of Hegelian Dialectic Process to solve the inconsistencies of Bayesian inference and extrapolate it backwards to test the usability of raw data as premises for distinct insights.

Dialectic methodology is a massive subject with a wide array of implications. For the purpose of this article, we would stick to the application of applying Plato’s reductio ad absurdum argument and Hegelian triads not on the source data but on the insights which stands on the assumed authenticity of the data. We have traditionally used sets of rules to determine veracity, but even then, the incoming data has no truth pillar against which its authenticity is referenced. This article aims to narrate a simple idea which is to to run the ‘insight’ through a prism of Hegelian triad (thesis-antithesis-synthesis). While Hegelian dialectic provides a qualitative base for such an analysis, it still must be quantified through the application of Bayesian Inference. Bayesian Inference: Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. The problem with Bayesian inference is the very automated process of reaching the insight. It considers the source data as sacrosanct and builds the Bayesian prior. Simplistically looking, any addition to the source data in terms of material data points changes the probabilistic estimates of the hypothesis creating multiple Bayesian posteriors. This leads to the second biggest problem where both the prior and posterior is taken as subjective states of knowledge. This removes any chance to conduct an analysis on the degree of bias in the source data which is the premise of the Bayesian prior. This article argues that each of the Bayesian posteriors can be passed through a sequential series of Plato’s dialectic model and Hegelian triads designed specifically for each posterior in order to take the end insight closer to the point of fair truth. Let us take an example to illustrate the proposed process. Problem Statement: El Dorado Business School receives 10,000+ applications for its coveted MBA program. The school has only 100 seats. It is humanly impossible for the admission staff to run through all the applications manually. They must use some analytics model to shortlist 300 candidates who has the highest probability to get admitted. Every application gets reduced to 6 major fields: Name, Country of Origin, Citizenship, Standardised Test Score, GPA, Ethnicity. The inference AI engine has been trained with a million data set bearing the same fields and their corresponding decision outcome. The training data which is a collection of last 30 years of application data has an unfortunate correlation between a certain ethnicity X and “Failure to gain admission”. The socio-economic conditions of the ethnic group X spread across 3 decades makes the system biased and skewed. The system cannot distinguish correlation from causation and hence it inadvertently puts a low probability to anyone belonging to ethnicity X. Thus, when the system is queried to churn out a list of 300 candidates with the highest probability of admission, very few candidates belonging to ethnicity X feature in the list. The automated nature of the elimination process renders it unquestionable unless you question the veracity or usability of the training data. The Bayesian prior here is clearly not representative simply because it’s not fair. We can spend nights to discuss the definition of “fair” though. Now the admission board adds “Need for scholarship” as an extra field in the admission process to move away from the business school’s need blind stance and create a complex model to link up the demand for scholarship to the school’s available endowment fund. Now you create a Bayesian posterior with additional data point which removes more candidates from ethnicity X from the shortlist. Thus, having more data point does not always move a hypothesis close to truth, in fact it can move it further away. Now you as an analyst have the Bayesian posterior on which you can apply Hegelian triad. Let’s take the Bayesian posterior as the Hegelian thesis. But before you do that, let us apply Plato’s dialectic method for a bit. This is the standard reductio ad absurdum argument which states if the premise of an argument lead to a contradiction, we must conclude that the premises are false. With just 3 fields viz. Name, GPA and Standardised Test Score, our Bayesian model gives a certain probabilistic hypothesis. The moment, one applies ‘Ethnicity’ field, the probabilistic model drastically changes because of the inherent bias in the source data. There arises a straight contradiction between the Bayesian prior and Bayesian posterior of this logical system. If we go by Plato’s logic, we can eliminate the premise entirely. Therefore, we eliminate the “Ethnicity” field from the admission model. Now let us look closely at the “GPA” and the “Standardised test score” field. If you eliminate one at a time, you may get a contradictory result. If you use Plato’s argument on a logical system containing two data points, you end up eliminating the premise and going nowhere. This is where Hegel saves us. The whole reason why you were getting contradictory results on these two data points is because, someone who is having a good GPA through a long term perseverance and constant good performance may not score well on a standardised test or someone who is average in terms of GPA can be efficiently trained in prep schools to get a good standardised score. So you have to keep both the data points to counterbalance each other. You can create a Hegelian thesis out of the Bayesian prior created using this two point system and 2 Hegelian anti-thesis by removing one of the data-points at a time. Now the final step. This is where you take each of the anti-thesis as starting points and gradually sublate towards a more central synthesis. You may bring in additional data points in your source data to help achieve this. In our example, you may bring in data pointers like access to prep school, socio-economic condition or family income data of the participant in order to create a negative of the anti-thesis, thereby gradually sublating towards the thesis, but not quiet reaching there. As a by-product of including these additional data pointers in the model, you can also counterbalance the effects of the other data point which is about the requirement of a scholarship. Therefore, you can skew the inclination of a need aware admission system to a more need blind system without making it one. At the time of model creation, you can run your models through iterative sequences of traditional and Hegelian dialectic to eliminate or add usable data points. This would render the ‘insight’ as close as possible to the ideal form of an unbiased truth. Even then, it would not be perfect !!


15 views0 comments

Recent Posts

See All

コメント


bottom of page