Multilingual Toxic Comment Classification using Tensorflow and TPUs.
In this post, we’ll create an application that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. If these toxic contributions can be identified, we could have a safer, more collaborative internet.
The Conversation AI team, a research initiative founded by Jigsaw and Google builds a technology to prevent voices in Conversation. In 2020, Jigsaw organized a competition on Kaggle where the competitors has to build machine learning models that can identify toxicity in Online conversations, where toxicity is defined as anything rude, disrespectful, or otherwise likely to make someone leave the discussion. If these contributions can be identified, we could have a safer, more collaborative internet.
Dataset Description
As part of the competition, competitors were provided several files, specifically:
test.csv - comments from Wikipedia talk pages in different non-English languages.
validation.csv - comments from Wikipedia talk pages in different non-English languages
Evaluation Metric
Submissions were evaluated based on Area Under the ROC Curve between the predicted probability and the observed target.
Strategies to Tackle
Monolingual Approach
Monolingual models are the type of language models which are trained on a single language.
They are focused on understanding and generating text in a specific language.
For example, a monolingual model trained on English language will be proficient in understanding and generating English text.
These models are typically for tasks such as text classification, sentiment analysis and more within a specific language.
Monolingual models can be beneficial to utilize when we have a specific language in our training, testing datasets and in the upcoming unseen data.
Multilingual Approach
Multilingual models are, on the other hand, are trained on multiple different languages.
They are designed to handle and process text in multiple languages, allowing them to perform across different languages.
Multilingual models have the advantage of of being able to provide language-agnostic solutions, as they can handle a wide-range of languages.
They can be used for zero-shot and few-shot learning, where the model can perform a task in a language it has not been seen specifically trained on by leveraging its knowledge of other languages.
Which models to use for our problem?
As per the dataset given in the competition, we have only english data in our training dataset and very samples are given in the validation dataset containing languages Spanish, Turkish and Italian only and the Testing dataset contains languages Turkish, Spanish, Italian, Russian, French and Portugese.
Since in our validation and test dataset contains non-english languages it would be better approach to build multilingual models rather than monolingual models.
Now, if we had only one language (as stated above) building monolingual models would be a better choice.
Let’s discuss Multilingual models approach a bit more:
How are multilingual models are trained?
Multilingual models are pre-trained on a mix of different languages and they don’t distinguish between the languages.
The English BERT was pre-trained on English Wikipedia and BookCorpus dataset, while multilingual models like mBERT was pre-trained on 102 different languages from largest Wikipedia dataset and XLM-Roberta was pre-trained on CommonCrawl dataset from 100 different languages respectively.
Cross Lingual Transfer
Cross-lingual transfer refers to transfer learning using data and models available for one language for which ample such resources are available (e.g., English) to solve tasks in another, commonly more low-resource, language.
In our case, we are trying to create an application that can automatically detect whether a sentence or phrase is toxic or not.
Models like XLM-Roberta provides us the ability to fine tune it on English dataset and predict to identify comments in any different language.
XLM-R is able to take it’s specific knowledge learnt in one language and apply it to a different langauge (languages), even though it never seen the language during fine-tuning.
This concept is of transfer learning applied from one language to another language is referred to as Cross-Lingual Transfer (AKA Zero-Shot Learning) .
Another reason to use Pre-Trained multi-lingual models for a task like this (as in our case) is that is the Lack of languages by resources i.e., different languages have different amounts of training data available to build models like BERT and its variants.
Some languages like English, Chinese, Russian, Indonesian, Vietnamese etc. are the languages that have high resource languages, whereas languages like sundanese, assamese etc. are low resource languages.
Training our own BERT like model on these low resources could be very expensive in terms of data collection and performance-wise , therefore, We should leverage these multi-lingual models.
What experiments did I perform ?
At the Overall level, I performed 9 experiments with the following ideas keeping in mind.
Perform Pre-processing techniques like removal of stopwords, removing URLs, Contraction to Expansion of words, removing multiple characters from words and removal of punctuations.
From model stand point we experimented with 2 models mBERT & XLM-Roberta.
From Training dataset perspective we used 2 types of datasets: one case where we used the provided training datasets where we tried to balance the dataset by the target and the other case where we used the translated training dataset of languages provided in the test dataset along with the english language with class balancing.
We always trained on validation dataset as well to further boost the performance of the model.
Now we build a model with the following ideas: > Training on original training & validation datasets, class balancing (undersampling), fine-tuning the model for 2 epochs on training as well as validation dataset, and will not perform any preprocessing dataset. We will be leveraging the TPUs offered by Kaggle.
Collecting nltk
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 19.0 MB/s eta 0:00:0000:010:01
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/site-packages (from nltk) (4.65.0)
Requirement already satisfied: click in /usr/local/lib/python3.8/site-packages (from nltk) (8.1.3)
Requirement already satisfied: joblib in /usr/local/lib/python3.8/site-packages (from nltk) (1.2.0)
Collecting regex>=2021.8.3
Downloading regex-2023.5.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 771.9/771.9 KB 38.4 MB/s eta 0:00:00
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.5.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
D0508 07:23:28.285075581 14 config.cc:119] gRPC EXPERIMENT tcp_frame_size_tuning OFF (default:OFF)
D0508 07:23:28.285109420 14 config.cc:119] gRPC EXPERIMENT tcp_rcv_lowat OFF (default:OFF)
D0508 07:23:28.285113692 14 config.cc:119] gRPC EXPERIMENT peer_state_based_framing OFF (default:OFF)
D0508 07:23:28.285116753 14 config.cc:119] gRPC EXPERIMENT flow_control_fixes ON (default:ON)
D0508 07:23:28.285119508 14 config.cc:119] gRPC EXPERIMENT memory_pressure_controller OFF (default:OFF)
D0508 07:23:28.285122651 14 config.cc:119] gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D0508 07:23:28.285125779 14 config.cc:119] gRPC EXPERIMENT new_hpack_huffman_decoder ON (default:ON)
D0508 07:23:28.285136675 14 config.cc:119] gRPC EXPERIMENT event_engine_client OFF (default:OFF)
D0508 07:23:28.285139712 14 config.cc:119] gRPC EXPERIMENT monitoring_experiment ON (default:ON)
D0508 07:23:28.285142515 14 config.cc:119] gRPC EXPERIMENT promise_based_client_call OFF (default:OFF)
D0508 07:23:28.285145320 14 config.cc:119] gRPC EXPERIMENT free_large_allocator OFF (default:OFF)
D0508 07:23:28.285148256 14 config.cc:119] gRPC EXPERIMENT promise_based_server_call OFF (default:OFF)
D0508 07:23:28.285151131 14 config.cc:119] gRPC EXPERIMENT transport_supplies_client_latency OFF (default:OFF)
D0508 07:23:28.285153928 14 config.cc:119] gRPC EXPERIMENT event_engine_listener OFF (default:OFF)
I0508 07:23:28.285344519 14 ev_epoll1_linux.cc:122] grpc epoll fd: 62
D0508 07:23:28.297613061 14 ev_posix.cc:144] Using polling engine: epoll1
D0508 07:23:28.297635890 14 dns_resolver_ares.cc:822] Using ares dns resolver
D0508 07:23:28.297978050 14 lb_policy_registry.cc:46] registering LB policy factory for "priority_experimental"
D0508 07:23:28.297989261 14 lb_policy_registry.cc:46] registering LB policy factory for "outlier_detection_experimental"
D0508 07:23:28.297993453 14 lb_policy_registry.cc:46] registering LB policy factory for "weighted_target_experimental"
D0508 07:23:28.297997280 14 lb_policy_registry.cc:46] registering LB policy factory for "pick_first"
D0508 07:23:28.298001081 14 lb_policy_registry.cc:46] registering LB policy factory for "round_robin"
D0508 07:23:28.298004402 14 lb_policy_registry.cc:46] registering LB policy factory for "weighted_round_robin_experimental"
D0508 07:23:28.298011355 14 lb_policy_registry.cc:46] registering LB policy factory for "ring_hash_experimental"
D0508 07:23:28.298027677 14 lb_policy_registry.cc:46] registering LB policy factory for "grpclb"
D0508 07:23:28.298054234 14 lb_policy_registry.cc:46] registering LB policy factory for "rls_experimental"
D0508 07:23:28.298068190 14 lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_manager_experimental"
D0508 07:23:28.298072029 14 lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_impl_experimental"
D0508 07:23:28.298076040 14 lb_policy_registry.cc:46] registering LB policy factory for "cds_experimental"
D0508 07:23:28.298082495 14 lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_resolver_experimental"
D0508 07:23:28.298086497 14 lb_policy_registry.cc:46] registering LB policy factory for "xds_override_host_experimental"
D0508 07:23:28.298090079 14 lb_policy_registry.cc:46] registering LB policy factory for "xds_wrr_locality_experimental"
D0508 07:23:28.298093976 14 certificate_provider_registry.cc:35] registering certificate provider factory for "file_watcher"
I0508 07:23:28.300184472 14 socket_utils_common_posix.cc:408] Disabling AF_INET6 sockets because ::1 is not available.
I0508 07:23:28.318782434 242 socket_utils_common_posix.cc:337] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
E0508 07:23:28.326237166 242 oauth2_credentials.cc:236] oauth_fetch: UNKNOWN:C-ares status is not ARES_SUCCESS qtype=A name=metadata.google.internal. is_balancer=0: Domain name not found {grpc_status:2, created_time:"2023-05-08T07:23:28.326222349+00:00"}
/usr/local/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
Intializing the TPU configurations and other constants like number of epochs, batch_size (16 * number of cores offered on TPUS), MAX_LEN (length of the sentence), we use xlm-roberta-large model, number of samples (for undersampling) = 150k, Learning_rate = 1e-5 etc.
#################### TPU Configurations ##################### Detect hardware, return appropriate distribution strategytry:# TPU detection. No parameters necessary if TPU_NAME environment variable is# set: this is always the case on Kaggle. tpu = tf.distribute.cluster_resolver.TPUClusterResolver()print('Running on TPU ', tpu.master())exceptValueError: tpu =Noneif tpu: tf.config.experimental_connect_to_cluster(tpu) tf.tpu.experimental.initialize_tpu_system(tpu) strategy = tf.distribute.TPUStrategy(tpu)else:# Default distribution strategy in Tensorflow. Works on CPU and single GPU. strategy = tf.distribute.get_strategy()print("REPLICAS: ", strategy.num_replicas_in_sync)AUTO = tf.data.experimental.AUTOTUNE# ConfigurationEPOCHS =2BATCH_SIZE =16* strategy.num_replicas_in_syncMAX_LEN =192MODEL ='xlm-roberta-large'NUM_SAMPLES =150000RANDOM_STATE =42LEARNING_RATE =1e-5######################### MAIN CHANGE ############################WEIGHT_DECAY =1e-6
Running on TPU
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: local
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
REPLICAS: 8
Reading & Balancing the data by Target column
## Reading csv files train1 = pd.read_csv(toxic_comment_train_csv_path)train2 = pd.read_csv(unintended_bias_train_csv_path)valid = pd.read_csv(validation_csv_path)test = pd.read_csv(test_csv_path)sub = pd.read_csv(submission_csv_path)## Converting floating points to integers ##train2.toxic = train2['toxic'].round().astype(int)##### BALANCING THE DATA ##### # : Taking all the data from toxic_comment_train_file & all data corresponding to unintended bias train file# & sampling 150k observations randomly from non-toxic observation population.# Combine train1 with a subset of train2train = pd.concat([ train1[['comment_text', 'toxic']], train2[['comment_text', 'toxic']].query('toxic==1'), train2[['comment_text', 'toxic']].query('toxic==0').sample(n=NUM_SAMPLES, random_state=RANDOM_STATE)])## Dropping missing observations with respect to comment-text column train = train.dropna(subset=['comment_text'])
def encode(texts, tokenizer, max_len):""" Function takes a list of texts, tokenizer (object) initialized from HuggingFace library, max_len (defines of how long the sentence lengths should be). """ tokens = tokenizer(texts, max_length=max_len, truncation=True, padding='max_length', add_special_tokens=True, return_tensors='np')return tokens
Encoding comment_text
We first initialize the tokenizer from Hugging Face transformer library and encoding our training, validation and test dataset comment_texts.
Writing a function to create a tuple of inputs and outputs, where inputs have a dictionary datatype.
We’ll be leveraging tf.data.Data API to pass our inputs and outputs as tuple, i.e., (inputs, outputs), inputs are {"input_ids": input_ids, "attention_mask": attention_mask} and outputs labels.
def build_model(transformer_layer, max_len):""" Creating the model input layers, output layers, model definition and compilation. Returns: model object after compiling. """ input_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name="input_ids") attention_mask = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name="attention_mask") output = transformer_layer.roberta(input_ids, attention_mask=attention_mask)[1] x = tf.keras.layers.Dense(1024, activation='relu')(output) y = tf.keras.layers.Dense(1, activation='sigmoid',name='outputs')(x) model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=y) optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE, weight_decay=WEIGHT_DECAY) loss = tf.keras.losses.BinaryCrossentropy() AUC = tf.keras.metrics.AUC() model.compile(loss=loss, metrics=[AUC], optimizer=optimizer) return model
Loading model on TPUs
It is important to initialize & compile the model inside the with strategy.scope().
One thing I want to point out that for some reason I was getting different results even though I was setting the seed before initializing the model, but the results are always consistent even though the results differ very little every time we run the pipeline.
with strategy.scope(): transformer_layer = TFAutoModel.from_pretrained(MODEL) tf.random.set_seed(RANDOM_STATE) model = build_model(transformer_layer, max_len=MAX_LEN)model.summary()
Downloading tf_model.h5: 100%|██████████| 2.24G/2.24G [00:46<00:00, 48.0MB/s]
All model checkpoint layers were used when initializing TFXLMRobertaModel.
All the layers of TFXLMRobertaModel were initialized from the model checkpoint at xlm-roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaModel for predictions without further training.
/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:459: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 256002048 elements. This may consume a large amount of memory.
warnings.warn(
2023-05-08 07:28:46.072958: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add_790/ReadVariableOp.
2023-05-08 07:28:48.332211: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add_790/ReadVariableOp.
2023-05-08 07:53:43.374173: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add/ReadVariableOp.
2023-05-08 07:53:43.934692: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add/ReadVariableOp.
Training the model on Validation data for 2 epochs further to fine-tune on it
Public LeaderBoard score on kaggle (test dataset): 0.936 and Private LeaderBoard score : 0.9346
Results
Experiment
Public Test LeaderBoard Score
Private Test LeaderBoard Score
1 (mBERT + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)
0.8850
0.8869
2 (xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)
0.9259
0.9264
3 (mBERT + Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)
0.8259
0.8239
4 (xlm-roberta-large + Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)
0.8755
0.8754
5 (mBERT + No Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)
0.9195
0.9212
6 ((xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)
0.9329
0.9212
7 (mBERT + Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)
0.8696
0.9212
8 ((xlm-roberta-large + Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)
0.8861
0.8866
9 (xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 1e-5)
2023-05-08 08:20:04.172862: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp.
2023-05-08 08:20:04.668074: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp.
WARNING:absl:Found untraced functions such as _update_step_xla, encoder_layer_call_fn, encoder_layer_call_and_return_conditional_losses, pooler_layer_call_fn, pooler_layer_call_and_return_conditional_losses while saving (showing 5 of 829). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: ../working/Multilingual_toxic_comment_classifier/assets
INFO:tensorflow:Assets written to: ../working/Multilingual_toxic_comment_classifier/assets
Loading the model
import tensorflow as tfmodel_save_path ="../working/Multilingual_toxic_comment_classifier"loaded_model = tf.keras.models.load_model(model_save_path)y = loaded_model.predict(test_dataset.take(1))y[:6]
Writing the function to prepare for the new text, we encode the text using the tokenizer with the sentence length=192
from transformers import AutoTokenizertokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")text ="politicians are like cancer for this country"def prep_data(text, tokenizer, max_len=192): tokens = tokenizer(text, max_length=max_len, truncation=True, padding='max_length', add_special_tokens=True, return_tensors='tf')return {"input_ids": tokens['input_ids'],"attention_mask": tokens['attention_mask']}
Predicting the probability of toxic and non-toxic on a sample text.
Testing the model with the Gradio App before final pushing the model to HuggingFace Spaces
!pip3 install gradio --quietimport tensorflow as tfimport gradio as grloaded_model = tf.keras.models.load_model(model_save_path)from transformers import AutoTokenizertokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")examples_list = ["politicians are like cancer for this country", "Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было,","Для каких стан является эталоном современная система здравоохранения РФ? Для Зимбабве? Ты тупой? хох", ]def prep_data(text, tokenizer, max_len=192): tokens = tokenizer(text, max_length=max_len, truncation=True, padding='max_length', add_special_tokens=True, return_tensors='tf')return {"input_ids": tokens['input_ids'],"attention_mask": tokens['attention_mask']}def predict(text): prob_of_toxic_comment = loaded_model.predict(prep_data(text=text, tokenizer=tokenizer_, max_len=192))[0][0] prob_of_non_toxic_comment =1- prob_of_toxic_comment prob_of_toxic_comment, prob_of_non_toxic_comment probs = {"prob_of_toxic_comment": float(prob_of_toxic_comment),"prob_of_non_toxic_comment": float(prob_of_non_toxic_comment)}return probsinterface = gr.Interface(fn=predict, inputs=gr.components.Textbox(lines=4,label='Comment'), outputs=[gr.Label(label='Probabilities')], examples=examples_list, title='Multi-Lingual Toxic Comment Classification.', description='XLM-Roberta Large model')interface.launch(debug=False, share=True)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Running on local URL: http://127.0.0.1:7865
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Running on public URL: https://af370decb4339b429e.gradio.live
This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
1/1 [==============================] - 8s 8s/step
1/1 [==============================] - 1s 530ms/step
1/1 [==============================] - 1s 513ms/step
Woow!!!
Our application is up and running, this link is only temporary and and it remains ony for 72 hours. For permanent hosting, we can upload our Gradio app Interface to HuggingFace Spaces.
Now download all the files and folders from kaggle output manually & this kaggle kernel locally
Turning our Multi-Lingual Toxic Comment Classification Gradio Demo into a deployable app
We’ll deploy the demo application on HuggingFace Spaces.
What is HuggingFace Spaces?
It is a resource that allows anybody to host and share machine learning application.
Deployed Gradio App Structure
To upload our gradio app, we’ll want to put everything together into a singe directory.
For example, our demo might live at the path demos/melanoma_skin_cancer_files with the following structure:
Where: - Multilingual_toxic_comment_classifier is our saved fine-tuned model (binary files associated). - app.py contains our Gradio app, our data preprocessing function and our predict function. Note: app.py is the default filename used for HuggingFace Spaces, if we deploy our apps there. - examples contains sample dataframe which contains toxic & non-toxic comments from russian, spanish, english, italian, turkish, portugese and last french languages to showcase the demo of our Gradio application. - requirements.txt file contains the dependencies/packages to run our application such as tensorflow, gradio, transformers.
Creating a demo folder to store our Multilingual Toxic Comment Classifier App files
To begin, we’ll create an empty directory demos/ that will contain all our necessary files for the application.
We can achive this using Python’s pathlib.Path("path_of_dir") to establish directory path and then pathlib.Path("path_of_dir").mkdir() to create it.
############### ROOT_DIR : I Have put the files in my E: drive## Importing Packages import shutilfrom pathlib import Pathimport osROOT_DIR ="\\".join(os.getcwd().split("\\")[:2])## Create Melanoma skin cancer demo pathmultilingual_toxic_comment_demo_path = Path(f"{ROOT_DIR}/demos/multilingual_toxic_comment_files")## Removing files that might already exist and creating a new directory.if multilingual_toxic_comment_demo_path.exists(): shutil.rmtree(multilingual_toxic_comment_demo_path) multilingual_toxic_comment_demo_path.mkdir(parents=True, # Do we want to make parent folders? exist_ok=True) # Create even if they already exists? else:## If the path doesn't exist, create one multilingual_toxic_comment_demo_path.mkdir(parents=True, exist_ok=True)
Creating a folder of example images to use with our Melanoma skin cancer demo
Now we’ll create an empty directory called examples and store a sample dataset containing comments from the Russian, Turkish, English, Spanish, Portugese, French, Italian languages. I have collected these comments from online and created a CSV file for them.
To do so we’ll:
Create an empty directory examples/ within the demos/multilingual_toxic_comment_files directory.
Collect some comment samples from online in these languages and create a CSV file out of them containing both toxic as well as non-toxic comments.
import pandas as pdfrom pathlib import Path## Create examples directorymultilingual_toxic_comment_examples_path = multilingual_toxic_comment_demo_path /"examples"multilingual_toxic_comment_examples_path.mkdir(parents=True, exist_ok=True)sample_comments = Path(f"sample_comments.csv")comments = {"comment_text": ["Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было, кисель их бы придумал.","Страницу обнови, дебил. Это тоже не оскорбление, а доказанный факт - не-дебил про себя во множественном числе писать не будет. Или мы в тебя верим - это ты и твои воображаемые друзья?","В шапке были ссылки на инфу по текущему фильму марвел. Эти ссылки были заменены на фразу Репортим брипидора, игнорируем его посты. Если этого недостаточно, чтобы понять, что модератор абсолютный неадекват, и его нужно лишить полномочий, тогда эта борда пробивает абсолютное дно по неадекватности.","Про графику было обидно) я так то проходил все серии гта со второй части по пятую, кроме гта 4. И мне не мешала графика ни в одной из частей. На компе у меня было куча видеокарт. Начиная с 32мб RIVA TNT и заканчивая 2Гб 560Ti на которой я спокойно играю который год в танки, гта5, ведьмака3 купил на распродаже и начал проходить. Да, не на ультрах. С пониженными текстурами. И не мешает. Я не понимаю дрочева на графике, требовать графику уровня плойки 4 минимум. Мне надо чтобы глаза не резало, только и всего. По поводу управления, мне не хватает переходника на type c. У меня джойстик есть от иксбокса360. Потенциала в мобильных играх достаточно чтобы забить кнопки как забивались в той же NFS MW в 2005. Не самая плохая игра была.","This is such an urgent design problem; kudos to you for taking it on. Very impressive!","haha you guys are a bunch of losers.", "ur a sh*tty comment.","Il solito vizio,o moda, della sinistra di andare ad aiutare tutti tranne chi ne ha bisogno in casa nostra. Quanti autorespiratori si sarebbero potuti acquistare con 50 milioni di euro? Bastardi traditori della patria!!",'SIAMO ALLA FOLLIA', '20px Caro editor, encontramos problemas na edição que fez na página Sertanejo universitário. A edição teve de ser revertida por não ser adequada para a Wikipédia. Se quiser experimentar a edição de páginas pode fazê-lo à vontade na página de testes da Wikipédia. Recomenda-se a leitura das páginas Breve introdução sobre a Wikipédia, O que a Wikipédia não é e Erros comuns na Wikipédia. Obrigado pela compreensão. Vitor Mazuco Msg ',"Le contributeur y tente de prouver par l absurde que le commentaire de diff du contributeur x est ridicule en recopiant ce dernier, et supprime sans autre explication un passage apparemment parfaitement consensuel. Qui plus est, le contributeur y ne prend pas la peine de discuter de la précédente contribution du contributeur x , alors que l article a déjà un bandeau d avertissement à ne pas se lancer dans des guerres d édition. Bref, la prochaine fois, je vous bloque pour désorganisation du projet en vue d une argumentation personnelle. L article est déjà assez instable pour que vous n y mêliez pas une guerre d ego - et si vous n aimez pas qu on vous rappelle de ne pas jouer au con , qui n est en rien une insulte, mais la détection d un problème de comportement, n y jouez pas. SammyDay (discuter) "]}pd.DataFrame(comments, columns=['comment_text']).to_csv(multilingual_toxic_comment_examples_path / sample_comments, index=False)
Now we verify our example images are present, let’s list the contents of our demo/melanoma_skin_cancer/examples/ directory with os.listdir() and then format the filepaths into a list of lists (to make it compatible with the Gradio’s gradio.Interface(), example parameter).
example_list = [[example] for example in pd.read_csv(multilingual_toxic_comment_examples_path / sample_comments)['comment_text'].tolist()]example_list
[['Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было, кисель их бы придумал.'],
['Страницу обнови, дебил. Это тоже не оскорбление, а доказанный факт - не-дебил про себя во множественном числе писать не будет. Или мы в тебя верим - это ты и твои воображаемые друзья?'],
['В шапке были ссылки на инфу по текущему фильму марвел. Эти ссылки были заменены на фразу Репортим брипидора, игнорируем его посты. Если этого недостаточно, чтобы понять, что модератор абсолютный неадекват, и его нужно лишить полномочий, тогда эта борда пробивает абсолютное дно по неадекватности.'],
['Про графику было обидно) я так то проходил все серии гта со второй части по пятую, кроме гта 4. И мне не мешала графика ни в одной из частей. На компе у меня было куча видеокарт. Начиная с 32мб RIVA TNT и заканчивая 2Гб 560Ti на которой я спокойно играю который год в танки, гта5, ведьмака3 купил на распродаже и начал проходить. Да, не на ультрах. С пониженными текстурами. И не мешает. Я не понимаю дрочева на графике, требовать графику уровня плойки 4 минимум. Мне надо чтобы глаза не резало, только и всего. По поводу управления, мне не хватает переходника на type c. У меня джойстик есть от иксбокса360. Потенциала в мобильных играх достаточно чтобы забить кнопки как забивались в той же NFS MW в 2005. Не самая плохая игра была.'],
['This is such an urgent design problem; kudos to you for taking it on. Very impressive!'],
['haha you guys are a bunch of losers.'],
['ur a sh*tty comment.'],
['Il solito vizio,o moda, della sinistra di andare ad aiutare tutti tranne chi ne ha bisogno in casa nostra. Quanti autorespiratori si sarebbero potuti acquistare con 50 milioni di euro? Bastardi traditori della patria!!'],
['SIAMO ALLA FOLLIA'],
['20px Caro editor, encontramos problemas na edição que fez na página Sertanejo universitário. A edição teve de ser revertida por não ser adequada para a Wikipédia. Se quiser experimentar a edição de páginas pode fazê-lo à vontade na página de testes da Wikipédia. Recomenda-se a leitura das páginas Breve introdução sobre a Wikipédia, O que a Wikipédia não é e Erros comuns na Wikipédia. Obrigado pela compreensão. Vitor Mazuco Msg '],
['Le contributeur y tente de prouver par l absurde que le commentaire de diff du contributeur x est ridicule en recopiant ce dernier, et supprime sans autre explication un passage apparemment parfaitement consensuel. Qui plus est, le contributeur y ne prend pas la peine de discuter de la précédente contribution du contributeur x , alors que l article a déjà un bandeau d avertissement à ne pas se lancer dans des guerres d édition. Bref, la prochaine fois, je vous bloque pour désorganisation du projet en vue d une argumentation personnelle. L article est déjà assez instable pour que vous n y mêliez pas une guerre d ego - et si vous n aimez pas qu on vous rappelle de ne pas jouer au con , qui n est en rien une insulte, mais la détection d un problème de comportement, n y jouez pas. SammyDay (discuter) ']]
Moving our trained XLM-Roberta model binary files into our multilingual_toxic_comment_files demo directory.
We have saved our fine-tuned model in outout/working/multilingual_toxic_comment_files/ directory and we’ll move our model files to demos/multilingual_toxic_comment_files/ directory as specified above.
We use Python’s shutil.move() method and passing in src(the source path of the target file) and dst (the destination folder path of the target file to be moved into) parameters.
## Importing Librariesimport shutil## Create a source path for our target modelmultilingual_toxic_comment_model_dir_path =f"{ROOT_DIR}\\output\\working\\Multilingual_toxic_comment_classifier\\"## Create a destination path for our target modelmultilingual_toxic_comment_model_dir_destination = multilingual_toxic_comment_demo_path## Try to move the filetry:print(f"Attempting to move the {multilingual_toxic_comment_model_dir_path} to {multilingual_toxic_comment_model_dir_destination}")## Move the model shutil.move(src=multilingual_toxic_comment_model_dir_path, dst=multilingual_toxic_comment_model_dir_destination)print("Model move completed")## If the model has already been moved, check if it existsexcept:print(f"No model found at {multilingual_toxic_comment_model_dir_path}, perhaps it's already moved.")print(f"Model already exists at {multilingual_toxic_comment_model_dir_destination}: {multilingual_toxic_comment_model_dir_destination.exists()}")
Attempting to move the E:\MultiLingual-Toxic-Comment-Classification\output\working\Multilingual_toxic_comment_classifier\ to E:\MultiLingual-Toxic-Comment-Classification\demos\multilingual_toxic_comment_files
Model move completed
Turning our Gradio App into a Python Script (app.py)
## Now if we look into which directory we are currently, we'll find that using the following codeimport osos.getcwd()
Changing the multilingual_toxic_comment_files directory (cd multilingual_toxic_comment_files).
Creating an environment (python3 -m venv env) or use (python -m venv env).
Activating the environment (source env/Scripts/activate).
Installing the requirements.txt using pip install -r requirements.txt. > If faced any errors, we might need to upgrade pip using pip install --upgrade pip.
Run the app (python3 app.py).
This should results in a Gradio demo locally at the URL such as : http://127.0.0.1:7860/.
Uploading to Hugging Face
We’ve verified our Melanoma_skin_cancer detection application is working in our local system.
To upload our application to Hugging Face Spaces, we need to do the following.
Start a new Hugging Face Space by going to our profile at the top right corner and then select New Space.
Declare the name to the space like Chirag1994/multilingual_toxic_comment_classification_app.
Select a license (I am using MIT license).
Select Gradio as the Space SDK (software development kit).
Choose whether your Space is Public or Private (I am keeping it Public).
Click Create Space.
Clone the repository locally by running: git clone https://huggingface.co/spaces/[YOUR_USERNAME]/[YOUR_SPACE_NAME] in the terminal or command prompt. In our case mine would be like - git clone https://huggingface.co/spaces/Chirag1994/multilingual_toxic_comment_classification_app.
Copy/Move the contents of the downloaded multilingual_toxic_comment_classification_app folder to the cloned repo folder.
To upload files and track larger files (e.g., files that are greater than 10MB) for them we need to install Git LFS which stands for Git large File Storage.
Open up the cloned directory using VS code (I’m using VS code), and use the terminal (git bash in my case) and after installing the git lfs, use the command git lfs install to start tracking the file that we want to track. For example - git lfs track "Multilingual_toxic_comment_classifier" directory files.
Create a new .gitignore file and the files & folders that we don’t want git to track like :
__pycache__/
.vscode/
venv/
.gitignore
.gitattributes
Add the rest of the files and commit them with:
git add .
git commit -m "commit message that you want"
Push(load) the files to Hugging Face
git push
It might a couple of minutes to finish and then the app will be up and running.
Our Final Application deployed on HuggingFace Spaces
# IPython is a library to help make Python interactivefrom IPython.display import IFrame# Embed FoodVision Mini Gradio demoIFrame(src="https://chirag1994-multilingual-toxic-comment-classifier.hf.space", width=1000, height=800)