Here we are... It's time to talk about ethics, artificial intelligence that is scary... 😱
Today, we are increasingly questioning the ethics of algorithms that surround us and the future impact that artificial intelligence will have in legal decisions. The advent of subversive technologies and the progress of machines is faster than the evolution of the law, which adapts and makes up for its shortcomings.
But beyond these great technologies which show us the future (autonomous cars, ChatGPT, voice assistants, etc.), we are surrounded by algorithms whose performance and ethics are sometimes questionable...
Let's take the example of an application that has an impact on the lives - and emotional choices - of 75 million people: Tinder. The book “L'amour sous Algorithme” by Judith Duportail (a little overview of her book her) provides a frightening overview of the results of an investigation into how Tinder's algorithms work (1). The author mentions that each user has a hidden attractiveness score and that Tinder recommends profiles that show "coincidences" to create emotions and matches inspired by the most beautiful romantic romances (example: finding the same initials between two profiles that meet). Also, Tinder recommends the profiles of our future loves based on a set of macho criteria. For example, a man will have more chance to meet profiles of younger women that have a lower level of education/wealth.
This maddening observation pushed me to search for the ultimate source: the patent of Tinder (2).
Indeed, by reading a few lines, I have learned that:
People have an attractiveness score set by Tinder defined on users.
Recommended profiles are those that have similar attractiveness profiles (attractiveness reported by other users). Let's put it bluntly: the ugly meet the ugly and the beautiful are condemned to meet the beautiful.
Men are more likely to come across profiles of women who are 5 years younger than them.
Women are more likely to come across profiles of men between -2 and +5 years of their age.
All the points raised by Judith Duportail are correct.
In other words, Tinder considers that a match is more plausible between a man from a higher social background (who has a high education level and/or who earns a lot of money) and a poor young "girl" (5 years younger) who is still learning about life and who comes from a poorer class (who spent fewer years of study and/or who earns less). In short, Tinder suggests a deplorable representation of the world and morality.
In this world of algorithms created consciously or unconsciously in an immoral manner, I ask myself the following question: can we easily detect the ethical failings of the algorithms around us?
This is part of the objectives of Giskard (3) : an open-source python library whose mantra is the following:
Eliminate risks of biases, performance issues & security holes in Machine Learning models.
So I pushed it to the limit... I created an algorithm worthy of Black Mirror and The Handmaid's Tale using data from dating apps (OkCupid). I created a rotten algorithm that counts our social points by imagining a society which advocates the inequality of the sexes, the inequality of origins, the desirability of women to have children, the overpowering of heterosexuals, the absolute aestheticism body and the impeccable lifestyle (such a great society).
More precisely, if you are a woman, you earn points if:
You want children
You do not have a high level of education or you are unemployed
You don't smoke, you don't do drugs, you never drink
You are fine, athletic, skinny, or fit
You are white
I applied the same reflection for men excepted for one difference:
You earn points when you have a high level of education or when you work in a sector that earns lot of money
The perfect couple would therefore be...
And I submitted my algorithm to Giskard.
Good reading.
A - Evaluation of a happiness score prediction model depending on the environment
To test the Giskard library, I first used more simple data: those which describe people's happiness by making links with environment variables such as the price of houses or the quality of schools. The objective here is not to work on the performance of the algorithm but to test first the reliability of Giskard.
I applied a basic classification algorithm (a Logistic Regression) on my dataset.
1 - Implementation of Logistics Regression
First, let's install the necessary libraries for our study.
# Import libraries
import numpy as np
import pandas as pd
from scipy.special import softmax
from datasets import load_dataset
from giskard import Dataset, Model, scan, testing, GiskardClient, Suite
Installing the Giskard library is very easy. We need to follow the attachaed steps.
Secondly, I imported my dataset (downloaded from Kaggle (4)), defined the important columns and indicated the label to predict, that is to say, the score 0 or 1 of the "happy" column .
# Import Dataset
df_happy = pd.read_csv("happydata.csv")
df_happy
# Define X and Y
columns_X = ['infoavail', 'housecost', 'schoolquality', 'policetrust',
'streetquality', 'ëvents']
columns_Y = ['happy']
# Define dataframe X and Y
X = df_happy.loc[:, columns_X]
Y = df_happy.loc[:, columns_Y]
I separated the dataset into training and test data using the sklearn library.
# Train, test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
I called up sklearn's logistic regression and fit it to the data.
# Model of happy data
from sklearn.linear_model import LogisticRegression
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()
# Fit to data
logisticRegr.fit(x_train, y_train)
I defined my algorithm's predictions and displayed them.
# Make predictions
predictions = logisticRegr.predict(x_test)
# Display predictions
y_test['predictions']=predictions
An example here:
Now let's display the score of my algorithm. Here, since this is a classification, we can use the metric"Accuracy" (5,6). It is a metric that measures the effectiveness of a model in predicting positive and negative elements.
# Use score method to get accuracy of model
score = logisticRegr.score(X, Y)
print(score)
This accuracy amounts to 0.6013986013986014.
It's not very efficient. Probably because the X data is not linked to the happiness score. For example, having an expensive house probably doesn't affect happiness. To be sure that it didn't come from the algorithm itself, I tested an autoML method (Autogluon, 11) which tests a set of classification or regression algorithms. The output of this methodology is the most efficient algorithm for our exploration. It turns out that the chosen algorithm was less good than the logistic regression I had used beforehand (results in appendices). Therefore, I chose to keep the first logistic regression algorithm for our study.
2 - Testing the performance of the algorithm with the Giskard library
The data chosen was very clean. There was no obligation to perform data preprocessing. So I directly used the Giskard library.
giskard_dataset = Dataset(
df=df_happy, # A pandas.DataFrame that contains the raw data (before all the pre-processing steps) and the actual ground truth variable (target).
target="happy", # Ground truth variable.
name="Happiness score", # Optional.
cat_columns=columns_X
)
Afterward, I evaluated my model and specified "classification" in the "model_type" variable.
# Evaluate our model
giskard_model = Model(
model=logisticRegr, # A prediction function that encapsulates all the data pre-processing steps and that could be executed with the dataset used by the scan.
model_type="classification", # Either regression, classification or text_generation.
name="Logistic Regression on Happiness score", # Optional
classification_labels=logisticRegr.classes_, # Their order MUST be identical to the prediction_function's output order
)
We can now scan our model and view the results!
# Scan our model
results = scan(giskard_model, giskard_dataset)
display(results)
Drum roll....
Giskard indicates me that there are 4 major performance errors. I am particularly surprised by its precision and the significant contribution that the library offers us in the explainability of our algorithms. It offers us an evaluation from different angles: accuracy, precision and recall depending on the type of model (here, a classification).
Let's look at the first problem in more detail:
I learned that my algorithm is not efficient and that it is very imprecise (44.57% less) regarding the "street quality" column. On the right, we can find the "Predicted 'happy'" column with the original values and the predicted values.
Giskard also tells me that my algorithm is particularly bad with the following columns: infoavail, events and schoolquality.
From this analysis, we can identify several actions to take to improve the performance of our algorithm. These conclusions remain the responsibility of the data scientist and allow to analyze the first version of our algorithms in order to optimize them as much as possible.
Following this expertise, I could:
conduct a more in-depth study on my data to identify the columns that make the most sense in my prediction. For example, I could delete the "street quality" column if I conclude that this data has no impact on the prediction of the happiness score.
optimize my model by transforming my data (example: implementing a PCA, or other transformations) to extract more values.
use a different classification algorithm (example: an SVM).
I am pleasantly surprised by this first test.
Now let's get down to business with my death algorithm...☠️
B - Evaluation of a prediction model devoid of moral sense
Let's take the data from the dating app OKCupid. Here is an example of the first columns:
This data is a gold mine for creating our death algorithm. We can find the following columns:
age', 'status', 'sex', 'orientation', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'ethnicity', 'height', 'income', 'job', 'last_online', 'location', 'offspring', 'pets', 'religion', 'sign', 'smokes', 'speaks', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9'.
The objective is twofold:
Create an additional column that adds social points based on profile.
Apply a regression algorithm (Random Forest, 12) to predict the social points of future profiles.
Random Forest is a technique which aims to train a set of decision trees. Reading this article will enlighten you about the choices of the Random Forest algorithm.
1 - Data processing
First, let's import our dataset and the necessary libraries that we did not import in the previous test.
# Import libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# Import dataset
df = pd.read_csv('okcupid_profiles.csv')
df.head()
Next, let's create our social score.
# Create a classification based on points
#list_social_point = []
for index, row in df.iterrows():
#print(index)
social_count = 0
if row['sex']=='f':
#Orientation
if row['orientation']=='straight':
social_count += 10
if row['orientation']=='gay':
social_count -= 10
# Body Type
if row['body_type']=='thin'or row['body_type']=='athletic' or row['body_type']=='fit' or row['body_type']== 'skinny':
social_count += 10
if row['body_type']=='a little extra' or row['body_type']=='curvy' or row['body_type']=='jacked':
social_count -= 5
if row['body_type']=='full figured' or row['body_type']=='used up' or row['body_type']=='overweight':
social_count -= 10
# Drinks
if row['drinks']=='not at all':
social_count += 10
if row['drinks']=='rarely' or row['drinks']=='socially':
social_count += 5
if row['drinks']=='often' or row['drinks']=='very often':
social_count -= 5
if row['drinks']=='desperately':
social_count -= 10
#Drugs
if row['drugs']=='never':
social_count += 10
if row['drugs']=='sometimes':
social_count -= 5
if row['drugs']=='often':
social_count -= 10
#Education
if 'masters' in str(row['education']) or 'ph.d' in str(row['education']) or 'med' in str(row['education']) or 'law' in str(row['education']):
social_count -= 10
#Ethnicity
if 'white' in str(row['ethnicity']):
social_count += 10
#Job
if 'unemployed' in str(row['job']):
social_count += 10
#offspring
if 'might want them' in str(row['offspring']) or 'but wants them' in str(row['offspring']) or 'wants kids' in str(row['offspring']) or 'has a kid, and wants more' in str(row['offspring']) or 'might want more' in str(row['offspring']) or 'might want kids' in str(row['offspring']) or 'and wants more' in str(row['offspring']):
social_count += 10
if "doesn't want kids" in str(row['offspring']) or "doesn't want any" in str(row['offspring']) or "doesn't want more" in str(row['offspring']) :
social_count -= 10
#Smoke
if row['smokes']=='no':
social_count += 10
if row['smokes']=='sometimes' or row['smokes']=='when drinking':
social_count -= 5
if row['smokes']=='yes' or row['smokes']=='trying to quit':
social_count -= 10
df.loc[index,'social points'] = social_count
if row['sex']=='m':
#Orientation
if row['orientation']=='straight':
social_count += 10
if row['orientation']=='gay':
social_count -= 10
# Body Type
if row['body_type']=='thin'or row['body_type']=='athletic' or row['body_type']=='fit' or row['body_type']== 'skinny':
social_count += 10
if row['body_type']=='a little extra' or row['body_type']=='curvy' or row['body_type']=='jacked':
social_count -= 5
if row['body_type']=='full figured' or row['body_type']=='used up' or row['body_type']=='overweight':
social_count -= 10
# Drinks
if row['drinks']=='not at all':
social_count += 10
if row['drinks']=='rarely' or row['drinks']=='socially':
social_count += 5
if row['drinks']=='often' or row['drinks']=='very often':
social_count -= 5
if row['drinks']=='desperately':
social_count -= 10
#Drugs
if row['drugs']=='never':
social_count += 10
if row['drugs']=='sometimes':
social_count -= 5
if row['drugs']=='often':
social_count -= 10
#Education
if 'masters' in str(row['education']) or 'ph.d' in str(row['education']) or 'med' in str(row['education']) or 'law' in str(row['education']):
social_count += 10
#Ethnicity
if 'white' in str(row['ethnicity']):
social_count += 10
#Job
if 'unemployed' in str(row['job']):
social_count -= 10
#offspring
if 'might want them' in str(row['offspring']) or 'but wants them' in str(row['offspring']) or 'wants kids' in str(row['offspring']) or 'has a kid, and wants more' in str(row['offspring']) or 'might want more' in str(row['offspring']) or 'might want kids' in str(row['offspring']) or 'and wants more' in str(row['offspring']):
social_count += 10
if "doesn't want kids" in str(row['offspring']) or "doesn't want any" in str(row['offspring']) or "doesn't want more" in str(row['offspring']) :
social_count -= 10
#Smoke
if row['smokes']=='no':
social_count += 10
if row['smokes']=='sometimes' or row['smokes']=='when drinking':
social_count -= 5
if row['smokes']=='yes' or row['smokes']=='trying to quit':
social_count -= 10
df.loc[index,'social points'] = social_count
Let's check the result by displaying all the unique values in our new column.
df['social points'].unique()
We see that there are negative scores (the worst profiles have a score of -45) and high scores (the best profiles have a score of 80).
Now let's fill in the missing data with the most likely categorical data. To do this, we will use the "mode" method proposed by the python pandas library.
# Remove NaN value to the most common value
df = df.fillna(df.mode().iloc[0])
2 - Implementation of the Random Forest algorithm
Let's define the columns X and Y which will serve as input for our algorithm. Then, we can create their respective dataframes. We removed the "income" columns (income per profile) which seemed inconsistent, the "last_online" column (last connection of profiles) which did not provide us with relevant information on the prediction of our social score, and the "essay..." columns (description of profiles) which requires additional natural language processing (NLP).
# Define X and Y
columns_X_s = ['age', 'status', 'sex', 'orientation', 'body_type', 'diet', 'drinks','drugs', 'education', 'ethnicity', 'height', 'job', 'location', 'offspring', 'pets', 'religion', 'sign','smokes', 'speaks']
columns_Y_s = ['social points']
# Define dataframe X and Y
X = df.loc[:, columns_X_s]
Y = df.loc[:, columns_Y_s]
We end up with the following data formats:
X.shape : (59946, 21)
Y.shape : (59946, 1)
Now, we can define the categorical columns and the numerical columns. This step is essential to be able to invoke our processing pipeline and to work with the Giskard library.
# Define categorical columns
CAT_COLUMNS = ['status', 'sex', 'orientation', 'body_type', 'diet', 'drinks','drugs', 'education', 'ethnicity', 'height', 'job','location', 'offspring', 'pets', 'religion', 'sign','smokes', 'speaks']
# Define numerical columns
NUMERICAL_COLS = ['age','height']
Let's then separate the dataset into training and test data using the sklearn library.
# Train, test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
The Random Forest algorithm requires working with 1 dimension: 2-dimensional vectors are not supported. We will therefore transform our y dataframes into a series of 1-dimensional values to be able to use them.
y_train_r = y_train.values.ravel()
y_test_r = y_test.values.ravel()
Then, we can invoke our processing pipeline. In this pipeline, we choose to use two stages of processing our data:
The Standard Scaler on our numerical dataset (13)
The One Hot encoding son our categorial dataset (14)
The goal is to normalize our numerical data and transform our categorical data into binary data to be able to use our algorithm.
# Define preprocessing pipelines
preprocessor = ColumnTransformer(transformers=[
("scaler", StandardScaler(), NUMERICAL_COLS),
("one_hot_encoder", OneHotEncoder(handle_unknown="ignore", sparse=False), CAT_COLUMNS),
])
We can now call our Random Forest algorithm in regression mode (note that classification mode is available for this same algorithm).
We used a small number of decision trees (n_estimators) because the objective here is to evaluate the ethics of our model rather than to optimize its performance.
pipeline = Pipeline(steps=[
("preprocessor", preprocessor),
("Random Forest", RandomForestRegressor(n_estimators = 30, random_state = 42))
])
pipeline.fit(x_train, y_train_r)
y_train_pred = pipeline.predict(x_train)
y_test_pred = pipeline.predict(x_test)
Let's evaluate our model with the following metrics: MAE (7), MSE (8) et RMSE (9). These are the most used metrics to evaluate a Random Forest regression model.
# Evaluation of our model
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test_r, y_test_pred))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test_r, y_test_pred))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test_r, y_test_pred)))
Here are the results:
Mean Absolute Error (MAE): 5.880798199709589
Mean Squared Error (MSE): 59.6977579536871
Root Mean Squared Error (RMSE): 7.726432420832211
The translation of these metrics is as follows:
The MAE represents the average variance between the predicted data and the actual data.
the MSE represents the dispersion of prediction by evaluating the mean squared of errors between real values and predictions.
the RMSE is the MSE at the scale of our data.
We must therefore always refer to the scale of our data (let's remember the variance of our social score). Given our definitions, our model is not perfect but seems promising for a first version.
3 - Testing the algorithm with the Giskard library
The Giskard library requires having a table that contains our training data as input. To do this, let's take our dataframes (without ravel) from the result of the train/test split.
# Define raw data
raw_data = pd.concat([x_test, y_test], axis=1)
raw_data
Let's define our dataset with raw_data as input.
# Define Giskard Dataset
giskard_dataset = Dataset(
df=raw_data, # A pandas.DataFrame that contains the raw data (before all the pre-processing steps) and the actual ground truth variable (target).
target="social points", # Ground truth variable.
name="Social points", # Optional.
cat_columns=CAT_COLUMNS
)
Let's evaluate our model with the Giskard library.
# Evaluate our model
giskard_model = Model(
model=pipeline.predict, # A prediction function that encapsulates all the data pre-processing steps and that could be executed with the dataset used by the scan.
model_type="regression", # Either regression, classification or text_generation.
name="Random Forest on social points", # Optional.
feature_names=x_train.columns # Default: all columns of your dataset.
)
To be sure you don't get any errors, I advise you to test the following line of code:
raw_data[x_test.columns].head()
This allows you to check if your giskard_dataset is consistent with your giskard_model.
Now let's scan our model and display the results.
# Scan our model
results = scan(giskard_model, giskard_dataset)
display(results)
Drum roll number 2....
Giskard indicates us that our algorithm has performance issues. It tells us in particular that our MSE is less good for certain values in the “sign”, “religion”, “orientation”, “job” and “location” columns.
It is understandable to find the columns that were not used to create our social score (example: sign, religion, location). However, it is strange to have a performance decrease for the "orientation" and "job" columns. Indeed, let us remember that the social score is lower when the profile indicates homosexuality with regard to the "orientation" column.
This conclusion is interesting because it tells us that the Random Forest has not yet fully understood the reflection of our social score calculation.
However, the main problem with our algorithm lies in its ethics. However, Giskard does not diagnose us with any problem of this order.
The library gives us the possibility of using advanced parameters to force the diagnosis of our algorithm. Let's test this to evaluate the ethics of our model.
import giskard as gsk
report = gsk.scan(giskard_model, giskard_dataset, only="ethical")
Unfortunately, this returns an error and tells us that there is no ethical issue.
C - Whats is next ?
The Giskard library test is particularly promising. The indications of vulnerability regarding the performance of our two algorithms are very interesting, particularly for understanding data and questioning models.
Giskard also suggests incorporating this process into the testing phase of a CI/CD chain. I definitely recommend this approach.
One of the major limitations detected remains the poor detection of ethical problems of our second algorithm. This problem probably disappears when the dataset contains few ethically questionable columns: it is easier to realize a dissonance in the data when only one column is problematic. However, this article proposes a more pessimistic reasoning: the using of algorithms that are fundamentally bad and which are not based on moral prediction.
Giskard's next improvement would potentially be a more general diagnostic on the ethics of the algorithms, but also on the dataset itself. Proposing an analysis of the input data upstream can be a first avenue to prevent the potential immoral nature of the prediction.
We saw it in the introduction: some algorithms, like those of Tinder, are based on questionable societal rules. It is now imperative to find a way to evaluate these algorithms: those that manage our romantic choices, our loan applications, our insurance profile, our future autonomous car, and so on.
The ethics of decision-making of algorithms IS fundamental. It is an essential first filter for our current and future lives if artificial intelligence takes an increasingly important place in our lifes.
However, there is a crucial question: would this type of ethical filter be beneficial for all companies?
Unfortunately no. I am sure that Tinder would see no point in using this type of filter unless their algorithms generate damaging matches that lead to criminal convictions (a macho algorithm is not yet condemnable (🤞) but an algorithm that disadvantages non-Caucasian ethnicities is more attackable).
But Giskard offers general evaluations that go beyond ethics (for example: robustness, performance, hallucinations for LLMs).
Thus, I only recommend continuing the development of this library which takes a step in the evolution of the explainability of our algorithms.
And you, what do you think ?
Thanks for reading and feel free to comment!
Annexes - AutoML with AutoGluon
# Define train and test
df_happy_train = df_happy[:110]
df_happy_test = df_happy[111:]
from autogluon.tabular import TabularDataset, TabularPredictor
predictor = TabularPredictor(label="happy").fit(train_data=df_happy_train)
predictions = predictor.predict(df_happy_test)
predictor.evaluate(df_happy_test)
Résultats :
{'accuracy': 0.5,
'balanced_accuracy': 0.5476190476190477,
'mcc': 0.14285714285714285,
'roc_auc': 0.5813492063492063,
'f1': 0.6190476190476191,
'precision': 0.4642857142857143,
'recall': 0.9285714285714286}
BIBLIOGRAPHY
What are the sources that dissect Tinder's algorithms?
1.“L’Amour Sous Algorithme, Judith Duportail | Livre de Poche.” Accessed January 1, 2024. https://www.livredepoche.com/livre/lamour-sous-algorithme-9782253101437.
2. Rad, Sean, Todd M. Carrico, Kenneth B. Hoskins, James C. Stone, and Jonathan Badeen. Matching process system and method. United States US9733811B2, filed October 21, 2013, and issued August 15, 2017. https://patents.google.com/patent/US9733811B2/en?q=(tinder+attractivness)&oq=tinder+attractivness.
What is Giskard ?
Where to find open-source and free data?
How to evaluate a Machine Learning model?
How to use AutoGluon to do AutoML?
How to choose your regression algorithm?
How to transform your data to apply a regression?
Comments