Multilingual Toxic Comment Classification using Tensorflow and TPUs.

Problem Statement

The Conversation AI team, a research initiative founded by Jigsaw and Google builds a technology to prevent voices in Conversation. In 2020, Jigsaw organized a competition on Kaggle where the competitors has to build machine learning models that can identify toxicity in Online conversations, where toxicity is defined as anything rude, disrespectful, or otherwise likely to make someone leave the discussion. If these contributions can be identified, we could have a safer, more collaborative internet.

Dataset Description

As part of the competition, competitors were provided several files, specifically:

jigsaw-toxic-comment-train.csv - data from the Jigsaw toxic comment classification competition. The dataset is made up of English comments from Wikipedia’s talk page edits.

jigsaw-unintended-bias-train.csv - data from the Jigsaw Unintended Bias in Toxicity Classification competition. This is an expanded version of the Civil Comments dataset with a range of additional labels.

sample_submission.csv - a sample submission file.

test.csv - comments from Wikipedia talk pages in different non-English languages.

validation.csv - comments from Wikipedia talk pages in different non-English languages

Evaluation Metric

Submissions were evaluated based on Area Under the ROC Curve between the predicted probability and the observed target.

Strategies to Tackle

`Monolingual Approach`

Monolingual models are the type of language models which are trained on a single language.

They are focused on understanding and generating text in a specific language.

For example, a monolingual model trained on English language will be proficient in understanding and generating English text.

These models are typically for tasks such as text classification, sentiment analysis and more within a specific language.

Monolingual models can be beneficial to utilize when we have a specific language in our training, testing datasets and in the upcoming unseen data.

`Multilingual Approach`

Multilingual models are, on the other hand, are trained on multiple different languages.

They are designed to handle and process text in multiple languages, allowing them to perform across different languages.

Multilingual models have the advantage of of being able to provide language-agnostic solutions, as they can handle a wide-range of languages.

They can be used for zero-shot and few-shot learning, where the model can perform a task in a language it has not been seen specifically trained on by leveraging its knowledge of other languages.

`Which models to use for our problem?`

As per the dataset given in the competition, we have only english data in our training dataset and very samples are given in the validation dataset containing languages Spanish, Turkish and Italian only and the Testing dataset contains languages Turkish, Spanish, Italian, Russian, French and Portugese.

Since in our validation and test dataset contains non-english languages it would be better approach to build multilingual models rather than monolingual models.

Now, if we had only one language (as stated above) building monolingual models would be a better choice.

Let’s discuss Multilingual models approach a bit more:

How are multilingual models are trained?

Multilingual models are pre-trained on a mix of different languages and they don’t distinguish between the languages.

The English BERT was pre-trained on English Wikipedia and BookCorpus dataset, while multilingual models like mBERT was pre-trained on 102 different languages from largest Wikipedia dataset and XLM-Roberta was pre-trained on CommonCrawl dataset from 100 different languages respectively.

`Cross Lingual Transfer`

Cross-lingual transfer refers to transfer learning using data and models available for one language for which ample such resources are available (e.g., English) to solve tasks in another, commonly more low-resource, language.

In our case, we are trying to create an application that can automatically detect whether a sentence or phrase is toxic or not.

Models like XLM-Roberta provides us the ability to fine tune it on English dataset and predict to identify comments in any different language.

XLM-R is able to take it’s specific knowledge learnt in one language and apply it to a different langauge (languages), even though it never seen the language during fine-tuning.

This concept is of transfer learning applied from one language to another language is referred to as Cross-Lingual Transfer (AKA Zero-Shot Learning) .

Another reason to use Pre-Trained multi-lingual models for a task like this (as in our case) is that is the Lack of languages by resources i.e., different languages have different amounts of training data available to build models like BERT and its variants.

Some languages like English, Chinese, Russian, Indonesian, Vietnamese etc. are the languages that have high resource languages, whereas languages like sundanese, assamese etc. are low resource languages.

Training our own BERT like model on these low resources could be very expensive in terms of data collection and performance-wise , therefore, We should leverage these multi-lingual models.

`What experiments did I perform ?`

At the Overall level, I performed 9 experiments with the following ideas keeping in mind.

Perform Pre-processing techniques like removal of stopwords, removing URLs, Contraction to Expansion of words, removing multiple characters from words and removal of punctuations.

From model stand point we experimented with 2 models mBERT & XLM-Roberta.

From Training dataset perspective we used 2 types of datasets: one case where we used the provided training datasets where we tried to balance the dataset by the target and the other case where we used the translated training dataset of languages provided in the test dataset along with the english language with class balancing.

We always trained on validation dataset as well to further boost the performance of the model.

Now we build a model with the following ideas: > Training on original training & validation datasets, class balancing (undersampling), fine-tuning the model for 2 epochs on training as well as validation dataset, and will not perform any preprocessing dataset. We will be leveraging the TPUs offered by Kaggle.

Installing Libraries

!pip install nltk
!pip install transformers --quiet

import re
import nltk
import string
import os, gc
import pandas as pd
import tensorflow as tf
from transformers import TFAutoModel
from transformers import AutoTokenizer
nltk.download('stopwords')

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 19.0 MB/s eta 0:00:0000:010:01
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/site-packages (from nltk) (4.65.0)
Requirement already satisfied: click in /usr/local/lib/python3.8/site-packages (from nltk) (8.1.3)
Requirement already satisfied: joblib in /usr/local/lib/python3.8/site-packages (from nltk) (1.2.0)
Collecting regex>=2021.8.3
  Downloading regex-2023.5.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 771.9/771.9 KB 38.4 MB/s eta 0:00:00
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.5.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.

D0508 07:23:28.285075581      14 config.cc:119]                        gRPC EXPERIMENT tcp_frame_size_tuning               OFF (default:OFF)
D0508 07:23:28.285109420      14 config.cc:119]                        gRPC EXPERIMENT tcp_rcv_lowat                       OFF (default:OFF)
D0508 07:23:28.285113692      14 config.cc:119]                        gRPC EXPERIMENT peer_state_based_framing            OFF (default:OFF)
D0508 07:23:28.285116753      14 config.cc:119]                        gRPC EXPERIMENT flow_control_fixes                  ON  (default:ON)
D0508 07:23:28.285119508      14 config.cc:119]                        gRPC EXPERIMENT memory_pressure_controller          OFF (default:OFF)
D0508 07:23:28.285122651      14 config.cc:119]                        gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D0508 07:23:28.285125779      14 config.cc:119]                        gRPC EXPERIMENT new_hpack_huffman_decoder           ON  (default:ON)
D0508 07:23:28.285136675      14 config.cc:119]                        gRPC EXPERIMENT event_engine_client                 OFF (default:OFF)
D0508 07:23:28.285139712      14 config.cc:119]                        gRPC EXPERIMENT monitoring_experiment               ON  (default:ON)
D0508 07:23:28.285142515      14 config.cc:119]                        gRPC EXPERIMENT promise_based_client_call           OFF (default:OFF)
D0508 07:23:28.285145320      14 config.cc:119]                        gRPC EXPERIMENT free_large_allocator                OFF (default:OFF)
D0508 07:23:28.285148256      14 config.cc:119]                        gRPC EXPERIMENT promise_based_server_call           OFF (default:OFF)
D0508 07:23:28.285151131      14 config.cc:119]                        gRPC EXPERIMENT transport_supplies_client_latency   OFF (default:OFF)
D0508 07:23:28.285153928      14 config.cc:119]                        gRPC EXPERIMENT event_engine_listener               OFF (default:OFF)
I0508 07:23:28.285344519      14 ev_epoll1_linux.cc:122]               grpc epoll fd: 62
D0508 07:23:28.297613061      14 ev_posix.cc:144]                      Using polling engine: epoll1
D0508 07:23:28.297635890      14 dns_resolver_ares.cc:822]             Using ares dns resolver
D0508 07:23:28.297978050      14 lb_policy_registry.cc:46]             registering LB policy factory for "priority_experimental"
D0508 07:23:28.297989261      14 lb_policy_registry.cc:46]             registering LB policy factory for "outlier_detection_experimental"
D0508 07:23:28.297993453      14 lb_policy_registry.cc:46]             registering LB policy factory for "weighted_target_experimental"
D0508 07:23:28.297997280      14 lb_policy_registry.cc:46]             registering LB policy factory for "pick_first"
D0508 07:23:28.298001081      14 lb_policy_registry.cc:46]             registering LB policy factory for "round_robin"
D0508 07:23:28.298004402      14 lb_policy_registry.cc:46]             registering LB policy factory for "weighted_round_robin_experimental"
D0508 07:23:28.298011355      14 lb_policy_registry.cc:46]             registering LB policy factory for "ring_hash_experimental"
D0508 07:23:28.298027677      14 lb_policy_registry.cc:46]             registering LB policy factory for "grpclb"
D0508 07:23:28.298054234      14 lb_policy_registry.cc:46]             registering LB policy factory for "rls_experimental"
D0508 07:23:28.298068190      14 lb_policy_registry.cc:46]             registering LB policy factory for "xds_cluster_manager_experimental"
D0508 07:23:28.298072029      14 lb_policy_registry.cc:46]             registering LB policy factory for "xds_cluster_impl_experimental"
D0508 07:23:28.298076040      14 lb_policy_registry.cc:46]             registering LB policy factory for "cds_experimental"
D0508 07:23:28.298082495      14 lb_policy_registry.cc:46]             registering LB policy factory for "xds_cluster_resolver_experimental"
D0508 07:23:28.298086497      14 lb_policy_registry.cc:46]             registering LB policy factory for "xds_override_host_experimental"
D0508 07:23:28.298090079      14 lb_policy_registry.cc:46]             registering LB policy factory for "xds_wrr_locality_experimental"
D0508 07:23:28.298093976      14 certificate_provider_registry.cc:35]  registering certificate provider factory for "file_watcher"
I0508 07:23:28.300184472      14 socket_utils_common_posix.cc:408]     Disabling AF_INET6 sockets because ::1 is not available.
I0508 07:23:28.318782434     242 socket_utils_common_posix.cc:337]     TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
E0508 07:23:28.326237166     242 oauth2_credentials.cc:236]            oauth_fetch: UNKNOWN:C-ares status is not ARES_SUCCESS qtype=A name=metadata.google.internal. is_balancer=0: Domain name not found {grpc_status:2, created_time:"2023-05-08T07:23:28.326222349+00:00"}
/usr/local/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

True

Setting data paths

main_data_dir_path = "../input/jigsaw-multilingual-toxic-comment-classification/"
toxic_comment_train_csv_path = main_data_dir_path + "jigsaw-toxic-comment-train.csv"
unintended_bias_train_csv_path = main_data_dir_path + "jigsaw-unintended-bias-train.csv"
validation_csv_path = main_data_dir_path + "validation.csv"
test_csv_path = main_data_dir_path + "test.csv"
submission_csv_path = main_data_dir_path + "sample_submission.csv"

TPU Configurations

Intializing the TPU configurations and other constants like number of epochs, batch_size (16 * number of cores offered on TPUS), MAX_LEN (length of the sentence), we use xlm-roberta-large model, number of samples (for undersampling) = 150k, Learning_rate = 1e-5 etc.

#################### TPU Configurations ####################
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

AUTO = tf.data.experimental.AUTOTUNE
# Configuration
EPOCHS = 2
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192
MODEL = 'xlm-roberta-large'
NUM_SAMPLES = 150000
RANDOM_STATE = 42
LEARNING_RATE = 1e-5 ######################### MAIN CHANGE ############################
WEIGHT_DECAY = 1e-6

Running on TPU  
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: local
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
REPLICAS:  8

Reading & Balancing the data by Target column

## Reading csv files 
train1 = pd.read_csv(toxic_comment_train_csv_path)
train2 = pd.read_csv(unintended_bias_train_csv_path)
valid = pd.read_csv(validation_csv_path)
test = pd.read_csv(test_csv_path)
sub = pd.read_csv(submission_csv_path)

## Converting floating points to integers ##
train2.toxic = train2['toxic'].round().astype(int)

##### BALANCING THE DATA ##### 
# : Taking all the data from toxic_comment_train_file & all data corresponding to unintended bias train file
# & sampling 150k observations randomly from non-toxic observation population.

# Combine train1 with a subset of train2
train = pd.concat([
    train1[['comment_text', 'toxic']],
    train2[['comment_text', 'toxic']].query('toxic==1'),
    train2[['comment_text', 'toxic']].query('toxic==0').sample(n=NUM_SAMPLES, random_state=RANDOM_STATE)
])

## Dropping missing observations with respect to comment-text column 
train = train.dropna(subset=['comment_text'])

def encode(texts, tokenizer, max_len):
    """
    Function takes a list of texts, tokenizer (object)
    initialized from HuggingFace library, max_len (defines
    of how long the sentence lengths should be).
    """       
    tokens = tokenizer(texts, max_length=max_len, 
                    truncation=True, padding='max_length',
                    add_special_tokens=True, return_tensors='np')
    
    return tokens

Encoding comment_text

We first initialize the tokenizer from Hugging Face transformer library and encoding our training, validation and test dataset comment_texts.

## Intializing the tokenizer ##
tokenizer = AutoTokenizer.from_pretrained(MODEL)

train_inputs = encode(train['comment_text'].values.tolist(), 
                      tokenizer, max_len=MAX_LEN)
validation_inputs = encode(valid['comment_text'].values.tolist(),
                          tokenizer, max_len=MAX_LEN)
test_inputs = encode(test['content'].values.tolist(),
                    tokenizer, max_len=MAX_LEN)

Downloading (…)lve/main/config.json: 100%|██████████| 616/616 [00:00<00:00, 138kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 41.6MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 9.10M/9.10M [00:00<00:00, 57.5MB/s]

Preparing data using tf.data.Data API

Writing a function to create a tuple of inputs and outputs, where inputs have a dictionary datatype.

We’ll be leveraging tf.data.Data API to pass our inputs and outputs as tuple, i.e., (inputs, outputs), inputs are {"input_ids": input_ids, "attention_mask": attention_mask} and outputs labels.

def map_fn(input_ids, attention_mask, labels=None):
    if labels is not None:
        return {"input_ids": input_ids, "attention_mask": attention_mask}, labels
    else:
        return {"input_ids": input_ids, "attention_mask": attention_mask}

train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs["input_ids"],
                                                    train_inputs["attention_mask"],
                                                   train['toxic']))
train_dataset = train_dataset.map(map_fn)
train_dataset = train_dataset.repeat().shuffle(buffer_size=2048,seed=RANDOM_STATE).batch(BATCH_SIZE).prefetch(AUTO)

validation_dataset = tf.data.Dataset.from_tensor_slices((validation_inputs['input_ids'],
                                                         validation_inputs['attention_mask'],
                                                        valid['toxic']))
validation_dataset = validation_dataset.map(map_fn)
validation_dataset = validation_dataset.batch(BATCH_SIZE).prefetch(AUTO)

test_dataset = tf.data.Dataset.from_tensor_slices((test_inputs['input_ids'],
                                                  test_inputs['attention_mask']))
test_dataset = test_dataset.map(map_fn)
test_dataset = test_dataset.batch(BATCH_SIZE)

Building the model

def build_model(transformer_layer, max_len):
    """
    Creating the model input layers, output layers,
    model definition and compilation.
        
    Returns: model object after compiling. 
    """
    input_ids = tf.keras.layers.Input(shape=(max_len,), 
                                      dtype=tf.int32, 
                                      name="input_ids")
    attention_mask = tf.keras.layers.Input(shape=(max_len,), 
                                       dtype=tf.int32, 
                                       name="attention_mask")
    output = transformer_layer.roberta(input_ids, 
                                 attention_mask=attention_mask)[1]
    x = tf.keras.layers.Dense(1024, activation='relu')(output)
    y = tf.keras.layers.Dense(1, activation='sigmoid',name='outputs')(x)
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask],
                             outputs=y)
    
    optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE,
                                         weight_decay=WEIGHT_DECAY)
    loss = tf.keras.losses.BinaryCrossentropy()
    AUC = tf.keras.metrics.AUC()
    
    model.compile(loss=loss, metrics=[AUC], optimizer=optimizer)    
    return model

Loading model on TPUs

It is important to initialize & compile the model inside the with strategy.scope().

One thing I want to point out that for some reason I was getting different results even though I was setting the seed before initializing the model, but the results are always consistent even though the results differ very little every time we run the pipeline.

with strategy.scope():
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    tf.random.set_seed(RANDOM_STATE)
    model = build_model(transformer_layer,
                        max_len=MAX_LEN)
model.summary()

Downloading tf_model.h5: 100%|██████████| 2.24G/2.24G [00:46<00:00, 48.0MB/s]
All model checkpoint layers were used when initializing TFXLMRobertaModel.

All the layers of TFXLMRobertaModel were initialized from the model checkpoint at xlm-roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaModel for predictions without further training.

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_ids (InputLayer)         [(None, 192)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 192)]        0           []                               
                                                                                                  
 roberta (TFXLMRobertaMainLayer  TFBaseModelOutputWi  559890432  ['input_ids[0][0]',              
 )                              thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 192,                                               
                                 1024),                                                           
                                 pooler_output=(Non                                               
                                e, 1024),                                                         
                                 past_key_values=No                                               
                                ne, hidden_states=N                                               
                                one, attentions=Non                                               
                                e, cross_attentions                                               
                                =None)                                                            
                                                                                                  
 dense (Dense)                  (None, 1024)         1049600     ['roberta[0][1]']                
                                                                                                  
 outputs (Dense)                (None, 1)            1025        ['dense[0][0]']                  
                                                                                                  
==================================================================================================
Total params: 560,941,057
Trainable params: 560,941,057
Non-trainable params: 0
__________________________________________________________________________________________________

Training the model on Only English data for 2 epochs

train_steps_per_epoch = train_inputs['input_ids'].shape[0] // BATCH_SIZE
train_history = model.fit(train_dataset,
                         steps_per_epoch=train_steps_per_epoch,
                         validation_data=validation_dataset,
                         epochs=2)

Epoch 1/2
3795/3795 [==============================] - ETA: 0s - loss: 0.0551 - auc: 0.99703795/3795 [==============================] - 1609s 378ms/step - loss: 0.0551 - auc: 0.9970 - val_loss: 0.4696 - val_auc: 0.8942
Epoch 2/2
3795/3795 [==============================] - 1399s 369ms/step - loss: 0.0450 - auc: 0.9979 - val_loss: 0.3125 - val_auc: 0.9083

/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:459: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 256002048 elements. This may consume a large amount of memory.
  warnings.warn(
2023-05-08 07:28:46.072958: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add_790/ReadVariableOp.
2023-05-08 07:28:48.332211: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add_790/ReadVariableOp.
2023-05-08 07:53:43.374173: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add/ReadVariableOp.
2023-05-08 07:53:43.934692: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Add/ReadVariableOp.

Training the model on Validation data for 2 epochs further to fine-tune on it

validation_steps_per_epoch = validation_inputs['input_ids'].shape[0] // BATCH_SIZE
validation_history = model.fit(validation_dataset.repeat(),
                              steps_per_epoch=validation_steps_per_epoch,
                              epochs=2)

Epoch 1/2
62/62 [==============================] - 23s 367ms/step - loss: 0.2286 - auc: 0.9315
Epoch 2/2
62/62 [==============================] - 107s 365ms/step - loss: 0.1472 - auc: 0.9726

Public LeaderBoard score on kaggle (test dataset): 0.936 and Private LeaderBoard score : 0.9346

Results

Experiment	Public Test LeaderBoard Score	Private Test LeaderBoard Score
1 (mBERT + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)	0.8850	0.8869
2 (xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)	0.9259	0.9264
3 (mBERT + Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)	0.8259	0.8239
4 (xlm-roberta-large + Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 2e-5)	0.8755	0.8754
5 (mBERT + No Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)	0.9195	0.9212
6 ((xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)	0.9329	0.9212
7 (mBERT + Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)	0.8696	0.9212
8 ((xlm-roberta-large + Preprocessing + BCE Loss + Fine tune on translated in languages present in test (along with english original english) training and validation datasets for 2 epochs each + Learning_rate = 1e-5)	0.8861	0.8866
9 (xlm-roberta-large + No Preprocessing + BCE Loss + Fine tune on original training and validation datasets for 2 epochs each + Learning_rate = 1e-5)	0.936	0.9346

Predicting on Test dataset

sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)

2023-05-08 08:20:04.172862: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp.
2023-05-08 08:20:04.668074: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp.

499/499 [==============================] - 80s 119ms/step

sub.head()

	id	toxic
0	0	0.000308
1	1	0.000241
2	2	0.266148
3	3	0.000063
4	4	0.000078

Saving the model

model_save_path = "../working/Multilingual_toxic_comment_classifier"
model.save(model_save_path)

WARNING:absl:Found untraced functions such as _update_step_xla, encoder_layer_call_fn, encoder_layer_call_and_return_conditional_losses, pooler_layer_call_fn, pooler_layer_call_and_return_conditional_losses while saving (showing 5 of 829). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: ../working/Multilingual_toxic_comment_classifier/assets

INFO:tensorflow:Assets written to: ../working/Multilingual_toxic_comment_classifier/assets

Loading the model

import tensorflow as tf

model_save_path = "../working/Multilingual_toxic_comment_classifier"
loaded_model = tf.keras.models.load_model(model_save_path)
y = loaded_model.predict(test_dataset.take(1))
y[:6]

1/1 [==============================] - 37s 37s/step

array([[3.0799874e-04],
       [2.2472920e-04],
       [2.6646560e-01],
       [5.7183450e-05],
       [7.6287179e-05],
       [3.1223629e-02]], dtype=float32)

Writing the function to prepare for the new text, we encode the text using the tokenizer with the sentence length=192

from transformers import AutoTokenizer
tokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")

text = "politicians are like cancer for this country"
def prep_data(text, tokenizer, max_len=192):
    tokens = tokenizer(text, max_length=max_len, 
                    truncation=True, padding='max_length',
                    add_special_tokens=True, return_tensors='tf')
    
    return {"input_ids": tokens['input_ids'],
            "attention_mask": tokens['attention_mask']}

Predicting the probability of toxic and non-toxic on a sample text.

prob_of_toxic_comment = loaded_model.predict(prep_data(text=text, tokenizer=tokenizer_, max_len=192))[0][0]
prob_of_non_toxic_comment = 1 - prob_of_toxic_comment
prob_of_toxic_comment, prob_of_non_toxic_comment
probs = {"prob_of_toxic_comment": prob_of_toxic_comment,
 "prob_of_non_toxic_comment": prob_of_non_toxic_comment}
probs

1/1 [==============================] - 9s 9s/step

{'prob_of_toxic_comment': 0.26497197,
 'prob_of_non_toxic_comment': 0.7350280284881592}

Testing the model with the Gradio App before final pushing the model to HuggingFace Spaces

!pip3 install gradio --quiet
import tensorflow as tf
import gradio as gr

loaded_model = tf.keras.models.load_model(model_save_path)

from transformers import AutoTokenizer
tokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")

examples_list = ["politicians are like cancer for this country", 
                 "Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было,",
                "Для каких стан является эталоном современная система здравоохранения РФ? Для Зимбабве? Ты тупой? хох",
                ]

def prep_data(text, tokenizer, max_len=192):
    tokens = tokenizer(text, max_length=max_len, 
                    truncation=True, padding='max_length',
                    add_special_tokens=True, return_tensors='tf')
    
    return {"input_ids": tokens['input_ids'],
            "attention_mask": tokens['attention_mask']}

def predict(text):
    prob_of_toxic_comment = loaded_model.predict(prep_data(text=text, tokenizer=tokenizer_, max_len=192))[0][0]
    prob_of_non_toxic_comment = 1 - prob_of_toxic_comment
    prob_of_toxic_comment, prob_of_non_toxic_comment
    probs = {"prob_of_toxic_comment": float(prob_of_toxic_comment),
             "prob_of_non_toxic_comment": float(prob_of_non_toxic_comment)}
    return probs

interface = gr.Interface(fn=predict, inputs=gr.components.Textbox(lines=4,label='Comment'),
                        outputs=[gr.Label(label='Probabilities')], examples=examples_list,
                        title='Multi-Lingual Toxic Comment Classification.',
                        description='XLM-Roberta Large model')
interface.launch(debug=False, share=True)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Running on local URL:  http://127.0.0.1:7865
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Running on public URL: https://af370decb4339b429e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
1/1 [==============================] - 8s 8s/step
1/1 [==============================] - 1s 530ms/step
1/1 [==============================] - 1s 513ms/step

Woow!!!

Our application is up and running, this link is only temporary and and it remains ony for 72 hours. For permanent hosting, we can upload our Gradio app Interface to HuggingFace Spaces.

Now download all the files and folders from kaggle output manually & this kaggle kernel locally

Turning our Multi-Lingual Toxic Comment Classification Gradio Demo into a deployable app

We’ll deploy the demo application on HuggingFace Spaces.

What is HuggingFace Spaces?

It is a resource that allows anybody to host and share machine learning application.

Deployed Gradio App Structure

To upload our gradio app, we’ll want to put everything together into a singe directory.

For example, our demo might live at the path demos/melanoma_skin_cancer_files with the following structure:

demos/
    └── multilingual_toxic_comment_files/
        ├── Multilingual_toxic_comment_classifier/
        │   ├── variable/
        │   │   ├── variables.data-00000-of-00001
        │   │   └── variables.index
        │   ├── fingerprint.pb
        │   ├── keras_metadata.pb
        │   └── saved_model.pb 
        ├── app.py
        ├── examples/
        │   └── dataset
        └── requirements.txt

Where: - Multilingual_toxic_comment_classifier is our saved fine-tuned model (binary files associated). - app.py contains our Gradio app, our data preprocessing function and our predict function. Note: app.py is the default filename used for HuggingFace Spaces, if we deploy our apps there. - examples contains sample dataframe which contains toxic & non-toxic comments from russian, spanish, english, italian, turkish, portugese and last french languages to showcase the demo of our Gradio application. - requirements.txt file contains the dependencies/packages to run our application such as tensorflow, gradio, transformers.

Creating a demo folder to store our Multilingual Toxic Comment Classifier App files

To begin, we’ll create an empty directory demos/ that will contain all our necessary files for the application.

We can achive this using Python’s pathlib.Path("path_of_dir") to establish directory path and then pathlib.Path("path_of_dir").mkdir() to create it.

############### ROOT_DIR : I Have put the files in my E: drive
## Importing Packages 
import shutil
from pathlib import Path
import os

ROOT_DIR = "\\".join(os.getcwd().split("\\")[:2])

## Create Melanoma skin cancer demo path
multilingual_toxic_comment_demo_path = Path(f"{ROOT_DIR}/demos/multilingual_toxic_comment_files")

## Removing files that might already exist and creating a new directory.
if multilingual_toxic_comment_demo_path.exists():
    shutil.rmtree(multilingual_toxic_comment_demo_path)
    multilingual_toxic_comment_demo_path.mkdir(parents=True, # Do we want to make parent folders?
                                exist_ok=True) # Create even if they already exists? 
else:
    ## If the path doesn't exist, create one 
    multilingual_toxic_comment_demo_path.mkdir(parents=True,
                                exist_ok=True)

Creating a folder of example images to use with our Melanoma skin cancer demo

Now we’ll create an empty directory called examples and store a sample dataset containing comments from the Russian, Turkish, English, Spanish, Portugese, French, Italian languages. I have collected these comments from online and created a CSV file for them.

To do so we’ll:

Create an empty directory examples/ within the demos/multilingual_toxic_comment_files directory.
Collect some comment samples from online in these languages and create a CSV file out of them containing both toxic as well as non-toxic comments.

import pandas as pd
from pathlib import Path

## Create examples directory
multilingual_toxic_comment_examples_path = multilingual_toxic_comment_demo_path / "examples"
multilingual_toxic_comment_examples_path.mkdir(parents=True, exist_ok=True)

sample_comments = Path(f"sample_comments.csv")

comments = {"comment_text": ["Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было, кисель их бы придумал.",
                 "Страницу обнови, дебил. Это тоже не оскорбление, а доказанный факт - не-дебил про себя во множественном числе писать не будет. Или мы в тебя верим - это ты и твои воображаемые друзья?",
                 "В шапке были ссылки на инфу по текущему фильму марвел. Эти ссылки были заменены на фразу Репортим брипидора, игнорируем его посты. Если этого недостаточно, чтобы понять, что модератор абсолютный неадекват, и его нужно лишить полномочий, тогда эта борда пробивает абсолютное дно по неадекватности.",
                 "Про графику было обидно) я так то проходил все серии гта со второй части по пятую, кроме гта 4. И мне не мешала графика ни в одной из частей. На компе у меня было куча видеокарт. Начиная с 32мб RIVA TNT и заканчивая 2Гб 560Ti на которой я спокойно играю который год в танки, гта5, ведьмака3 купил на распродаже и начал проходить. Да, не на ультрах. С пониженными текстурами. И не мешает. Я не понимаю дрочева на графике, требовать графику уровня плойки 4 минимум. Мне надо чтобы глаза не резало, только и всего. По поводу управления, мне не хватает переходника на type c. У меня джойстик есть от иксбокса360. Потенциала в мобильных играх достаточно чтобы забить кнопки как забивались в той же NFS MW в 2005. Не самая плохая игра была.",
                 "This is such an urgent design problem; kudos to you for taking it on. Very impressive!",
                 "haha you guys are a bunch of losers.", "ur a sh*tty comment.",
                 "Il solito vizio,o moda, della sinistra di andare ad aiutare tutti tranne chi ne ha bisogno in casa nostra. Quanti autorespiratori si sarebbero potuti acquistare con 50 milioni di euro? Bastardi traditori della patria!!",
                 'SIAMO ALLA FOLLIA', 
                 '20px Caro editor, encontramos problemas na edição que fez na página Sertanejo universitário. A edição teve de ser revertida por não ser adequada para a Wikipédia. Se quiser experimentar a edição de páginas pode fazê-lo à vontade na página de testes da Wikipédia. Recomenda-se a leitura das páginas Breve introdução sobre a Wikipédia, O que a Wikipédia não é e Erros comuns na Wikipédia. Obrigado pela compreensão.    Vitor       Mazuco    Msg ',
                 "Le contributeur  y  tente de prouver par l absurde que le commentaire de diff du contributeur  x  est ridicule en recopiant ce dernier, et supprime sans autre explication un passage apparemment parfaitement consensuel. Qui plus est, le contributeur  y  ne prend pas la peine de discuter de la précédente contribution du contributeur  x , alors que l article a déjà un bandeau d avertissement à ne pas se lancer dans des guerres d édition. Bref, la prochaine fois, je vous bloque pour désorganisation du projet en vue d une argumentation personnelle. L article est déjà assez instable pour que vous n y mêliez pas une guerre d ego - et si vous n aimez pas qu on vous rappelle de ne pas  jouer au con , qui n est en rien une insulte, mais la détection d un problème de comportement, n y jouez pas. SammyDay (discuter) "]}

pd.DataFrame(comments, 
             columns=['comment_text']).to_csv(multilingual_toxic_comment_examples_path / sample_comments,
                                              index=False)

Now we verify our example images are present, let’s list the contents of our demo/melanoma_skin_cancer/examples/ directory with os.listdir() and then format the filepaths into a list of lists (to make it compatible with the Gradio’s gradio.Interface(), example parameter).

example_list = [[example] for example in pd.read_csv(multilingual_toxic_comment_examples_path / sample_comments)['comment_text'].tolist()]
example_list

[['Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было, кисель их бы придумал.'],
 ['Страницу обнови, дебил. Это тоже не оскорбление, а доказанный факт - не-дебил про себя во множественном числе писать не будет. Или мы в тебя верим - это ты и твои воображаемые друзья?'],
 ['В шапке были ссылки на инфу по текущему фильму марвел. Эти ссылки были заменены на фразу Репортим брипидора, игнорируем его посты. Если этого недостаточно, чтобы понять, что модератор абсолютный неадекват, и его нужно лишить полномочий, тогда эта борда пробивает абсолютное дно по неадекватности.'],
 ['Про графику было обидно) я так то проходил все серии гта со второй части по пятую, кроме гта 4. И мне не мешала графика ни в одной из частей. На компе у меня было куча видеокарт. Начиная с 32мб RIVA TNT и заканчивая 2Гб 560Ti на которой я спокойно играю который год в танки, гта5, ведьмака3 купил на распродаже и начал проходить. Да, не на ультрах. С пониженными текстурами. И не мешает. Я не понимаю дрочева на графике, требовать графику уровня плойки 4 минимум. Мне надо чтобы глаза не резало, только и всего. По поводу управления, мне не хватает переходника на type c. У меня джойстик есть от иксбокса360. Потенциала в мобильных играх достаточно чтобы забить кнопки как забивались в той же NFS MW в 2005. Не самая плохая игра была.'],
 ['This is such an urgent design problem; kudos to you for taking it on. Very impressive!'],
 ['haha you guys are a bunch of losers.'],
 ['ur a sh*tty comment.'],
 ['Il solito vizio,o moda, della sinistra di andare ad aiutare tutti tranne chi ne ha bisogno in casa nostra. Quanti autorespiratori si sarebbero potuti acquistare con 50 milioni di euro? Bastardi traditori della patria!!'],
 ['SIAMO ALLA FOLLIA'],
 ['20px Caro editor, encontramos problemas na edição que fez na página Sertanejo universitário. A edição teve de ser revertida por não ser adequada para a Wikipédia. Se quiser experimentar a edição de páginas pode fazê-lo à vontade na página de testes da Wikipédia. Recomenda-se a leitura das páginas Breve introdução sobre a Wikipédia, O que a Wikipédia não é e Erros comuns na Wikipédia. Obrigado pela compreensão.    Vitor       Mazuco    Msg '],
 ['Le contributeur  y  tente de prouver par l absurde que le commentaire de diff du contributeur  x  est ridicule en recopiant ce dernier, et supprime sans autre explication un passage apparemment parfaitement consensuel. Qui plus est, le contributeur  y  ne prend pas la peine de discuter de la précédente contribution du contributeur  x , alors que l article a déjà un bandeau d avertissement à ne pas se lancer dans des guerres d édition. Bref, la prochaine fois, je vous bloque pour désorganisation du projet en vue d une argumentation personnelle. L article est déjà assez instable pour que vous n y mêliez pas une guerre d ego - et si vous n aimez pas qu on vous rappelle de ne pas  jouer au con , qui n est en rien une insulte, mais la détection d un problème de comportement, n y jouez pas. SammyDay (discuter) ']]

Moving our trained XLM-Roberta model binary files into our multilingual_toxic_comment_files demo directory.

We have saved our fine-tuned model in outout/working/multilingual_toxic_comment_files/ directory and we’ll move our model files to demos/multilingual_toxic_comment_files/ directory as specified above.

We use Python’s shutil.move() method and passing in src(the source path of the target file) and dst (the destination folder path of the target file to be moved into) parameters.

## Importing Libraries
import shutil

## Create a source path for our target model
multilingual_toxic_comment_model_dir_path = f"{ROOT_DIR}\\output\\working\\Multilingual_toxic_comment_classifier\\"

## Create a destination path for our target model
multilingual_toxic_comment_model_dir_destination = multilingual_toxic_comment_demo_path

## Try to move the file
try:
    print(f"Attempting to move the {multilingual_toxic_comment_model_dir_path} to {multilingual_toxic_comment_model_dir_destination}")
    
    ## Move the model
    shutil.move(src=multilingual_toxic_comment_model_dir_path,
           dst=multilingual_toxic_comment_model_dir_destination)
    
    print("Model move completed")
## If the model has already been moved, check if it exists
except:
    print(f"No model found at {multilingual_toxic_comment_model_dir_path}, perhaps it's already moved.")
    print(f"Model already exists at {multilingual_toxic_comment_model_dir_destination}: {multilingual_toxic_comment_model_dir_destination.exists()}")

Attempting to move the E:\MultiLingual-Toxic-Comment-Classification\output\working\Multilingual_toxic_comment_classifier\ to E:\MultiLingual-Toxic-Comment-Classification\demos\multilingual_toxic_comment_files
Model move completed

Turning our Gradio App into a Python Script (`app.py`)

## Now if we look into which directory we are currently, we'll find that using the following code
import os
os.getcwd()

'E:\\MultiLingual-Toxic-Comment-Classification\\notebooks'

Now we will move into the demos directory where we will write some helper utilities.

In cd ../demos/: .. means we are moving outside of the notebooks directory. demos/: means we moving inside the demos directory.

cd ../demos/

E:\MultiLingual-Toxic-Comment-Classification\demos

import tensorflow as tf
import gradio as gr
import pandas as pd
from transformers import AutoTokenizer

model_save_path = "Multilingual_toxic_comment_classifier/"
### Loading the fine-tuned model ###
loaded_model = tf.keras.models.load_model(model_save_path)
### Initializing the tokenizer ###
tokenizer_ = AutoTokenizer.from_pretrained("xlm-roberta-large")

examples_list = [
    [example]
    for example in pd.read_csv("examples/sample_comments.csv")["comment_text"].tolist()
]


def prep_data(text, tokenizer, max_len=192):
    tokens = tokenizer(
        text,
        max_length=max_len,
        truncation=True,
        padding="max_length",
        add_special_tokens=True,
        return_tensors="tf",
    )

    return {
        "input_ids": tokens["input_ids"],
        "attention_mask": tokens["attention_mask"],
    }


def predict(text):
    prob_of_toxic_comment = loaded_model.predict(
        prep_data(text=text, tokenizer=tokenizer_, max_len=192)
    )[0][0]
    prob_of_non_toxic_comment = 1 - prob_of_toxic_comment
    prob_of_toxic_comment, prob_of_non_toxic_comment
    probs = {
        "prob_of_toxic_comment": float(prob_of_toxic_comment),
        "prob_of_non_toxic_comment": float(prob_of_non_toxic_comment),
    }
    return probs


interface = gr.Interface(
    fn=predict,
    inputs=gr.components.Textbox(lines=4, label="Comment"),
    outputs=[gr.Label(label="Probabilities")],
    examples=examples_list,
    title="Multi-Lingual Toxic Comment Classification.",
    description="XLM-Roberta Large model",
)
interface.launch(debug=False)

Overwriting multilingual_toxic_comment_files/app.py

Creating a requirements.txt file for our Gradio App(`requirements.txt`)

This is the last file we need to create for our application.

This file contains all the necessary packages for our Gradio application.

When we deploy our demo app to HuggingFace Spaces, it will search through this file and install the dependencies we mention so our appication can run.

tensorflow==2.12
pandas==1.5.2
gradio==3.1.4
transformers==4.28.1

%%writefile multilingual_toxic_comment_files/requirements.txt
tensorflow==2.12
pandas==1.5.2
gradio==3.1.4
transformers==4.28.1

Overwriting multilingual_toxic_comment_files/requirements.txt

Deploying our Application to HuggingFace Spaces

To deploy our demo, there are 2 main options for uploading to HuggingFace Spaces

NOTE: To host any application on HuggingFace, we first need to sign up for a free HuggingFace Account

Running our Application locally

Open the terminal or command prompt.
Changing the multilingual_toxic_comment_files directory (cd multilingual_toxic_comment_files).
Creating an environment (python3 -m venv env) or use (python -m venv env).
Activating the environment (source env/Scripts/activate).
Installing the requirements.txt using pip install -r requirements.txt. > If faced any errors, we might need to upgrade pip using pip install --upgrade pip.
Run the app (python3 app.py).

This should results in a Gradio demo locally at the URL such as : http://127.0.0.1:7860/.

Uploading to Hugging Face

We’ve verified our Melanoma_skin_cancer detection application is working in our local system.

To upload our application to Hugging Face Spaces, we need to do the following.

Sign up for a Hugging Face account.
Start a new Hugging Face Space by going to our profile at the top right corner and then select New Space.
Declare the name to the space like Chirag1994/multilingual_toxic_comment_classification_app.
Select a license (I am using MIT license).
Select Gradio as the Space SDK (software development kit).
Choose whether your Space is Public or Private (I am keeping it Public).
Click Create Space.
Clone the repository locally by running: git clone https://huggingface.co/spaces/[YOUR_USERNAME]/[YOUR_SPACE_NAME] in the terminal or command prompt. In our case mine would be like - git clone https://huggingface.co/spaces/Chirag1994/multilingual_toxic_comment_classification_app.
Copy/Move the contents of the downloaded multilingual_toxic_comment_classification_app folder to the cloned repo folder.
To upload files and track larger files (e.g., files that are greater than 10MB) for them we need to install Git LFS which stands for Git large File Storage.
Open up the cloned directory using VS code (I’m using VS code), and use the terminal (git bash in my case) and after installing the git lfs, use the command git lfs install to start tracking the file that we want to track. For example - git lfs track "Multilingual_toxic_comment_classifier" directory files.
Create a new .gitignore file and the files & folders that we don’t want git to track like :
- __pycache__/
- .vscode/
- venv/
- .gitignore
- .gitattributes
Add the rest of the files and commit them with:
- git add .
- git commit -m "commit message that you want"
Push(load) the files to Hugging Face
- git push
It might a couple of minutes to finish and then the app will be up and running.

Our Final Application deployed on HuggingFace Spaces

# IPython is a library to help make Python interactive
from IPython.display import IFrame

# Embed FoodVision Mini Gradio demo
IFrame(src="https://chirag1994-multilingual-toxic-comment-classifier.hf.space", width=1000, height=800)