Building a Multi-Label Classification Model with BERT
Steps to create a multi-label classification model using BERT and the Hugging Face Transformers library: Load the Data: Read the CSV file containing text snippets and their corresponding labels. Parse Labels: Convert the comma-separated labels into a list format. Prepare Labels: Use MultiLabelBinarizer to transform label lists into multi-hot vectors. Train/Test Split: Split the data into training and validation sets. Tokenize Text: Use a BERT tokenizer to preprocess the text snippets. Create Dataset: Define a custom Dataset class for the tokenized data and labels. Load Model: Initialize a BERT model for sequence classification with multi-label support. Training Arguments: Set up training parameters such as batch size, learning rate, and number of epochs. Train the Model: Use the Hugging Face Trainer to train the model. Save the Model: Save the trained model and tokenizer for future use. Example Code import pandas as pd import torch import torch.nn as nn from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments from sklearn.model_selection import train_test_split from sklearn.preprocessing import MultiLabelBinarizer # Load and preprocess data df = pd.read_csv("data/calls_with_context_ALL_2025-03-07T01-09-11-931Z_labeled.csv") df["labels"] = df["labels"].fillna("").apply(lambda x: [lbl.strip() for lbl in x.split(",") if lbl.strip() != ""]) all_labels = ["cancel appointment", "collect patient info", "collect medicaid info", "collect insurance info", "confirm appointment", "general question", "intro/outro", "question about patient's chart", "reschedule appointment", "running late", "schedule appointment", "taking a message"] mlb = MultiLabelBinarizer(classes=all_labels) label_matrix = mlb.fit_transform(df["labels"]) # Train/test split train_df, val_df, train_labels, val_labels = train_test_split(df, label_matrix, test_size=0.1, random_state=42) # Tokenize text tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") train_encodings = tokenizer(list(train_df["snippetText"]), truncation=True, padding=True, max_length=128) val_encodings = tokenizer(list(val_df["snippetText"]), truncation=True, padding=True, max_length=128) # Create dataset class IntentDataset(torch.utils.data.Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} item["labels"] = torch.tensor(self.labels[idx]).float() return item def __len__(self): return len(self.labels) train_dataset = IntentDataset(train_encodings, train_labels) val_dataset = IntentDataset(val_encodings, val_labels) # Load model and set training arguments model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(all_labels), problem_type="multi_label_classification") training_args = TrainingArguments(output_dir="./multi_intent_model", evaluation_strategy="epoch", per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=4, learning_rate=2e-5, weight_decay=0.01, logging_steps=50, load_best_model_at_end=True, save_strategy="epoch") # Train the model trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset) trainer.train() # Save the model trainer.save_model("./multi_intent_model") tokenizer.save_pretrained("./multi_intent_model") print("Training complete!") Important Ensure you have the necessary libraries installed: transformers, torch, pandas, scikit-learn. ...