Prompt Category Classification (Part 1)

Prompt Category Classification (Part 1)

In the quest to enhance the quality and user experience of language models, prompt classification plays a vital role. To address this challenge, pulze.ai conducted an experiment to evaluate different approaches for classifying prompts given to Language Models (LLMs) into distinct categories. Our goal was to identify a robust and effective prompt classification setup that could contribute to improving the overall performance of LLMs.

The outcome of the study outlined in this blog post has been integrated in our pulze.ai API. It is not only improving the selection of the most suitable model, but we are also providing the categorization back to the user to gain further insights about their prompts.

For this experiment, we generated a dataset using GPT-4; however, it is important to note that the diversity of this dataset may not fully resemble a real-world dataset. Additionally, annotating real-world datasets can be costly. Hence, we also evaluated the performance of the prompt classification approaches under few-shot learning conditions.

This blog post highlights the four scenarios we explored and their corresponding evaluation results.

  1. Classical TF-IDF n-gram SVC: Inspired by the research paper titled "A Comparison of SVM against Pre-trained Language Models (PLMs) for Text Classification Tasks" (https://arxiv.org/pdf/2211.02563), we implemented a classical approach using Term Frequency-Inverse Document Frequency (TF-IDF) with Support Vector Classification (SVM) and Logistic Regression (LG) . This scenario served as our baseline for comparison and provided insights into the performance of traditional methods for prompt classification.
  2. Masked Language Modeling-Based Sequence Classification (BERT & others): The second approach we evaluated involved leveraging masked language models such as BERT (https://arxiv.org/pdf/1810.04805)  and its variants. By fine-tuning these models for sequence classification, we aimed to harness the power of contextualized word representations to classify prompts effectively. We assessed the performance of BERT and similar models in this scenario (FT).
  3. Pretrained Sentence BERT-Based Sentence Embeddings + LR/SVC without CL Tuning: In this scenario, we explored the utilization of Sentence BERT (SBERT) (https://arxiv.org/pdf/1908.10084)  for generating sentence embeddings. We combined SBERT's embeddings with simple linear models such as Logistic Regression (LR) or Support Vector Classification (SVC) for prompt classification. This setup allowed us to evaluate the effectiveness of sentence-level representations in improving the categorization of prompts without any Contrastive Learning (CL) tuning.
  4. SetFit-Based End-to-End Training of Sentence Embeddings with CL and Classification Head FT: Our final scenario involved employing the SetFit (https://arxiv.org/pdf/2209.11055) framework for prompt classification. SetFit facilitates end-to-end training of sentence embeddings by leveraging Contrastive Learning (CL) and fine-tuning the classification head. By integrating CL into the training pipeline, we aimed to enhance the discriminative power of sentence embeddings for prompt classification. 

Experimental Evaluation: 

Throughout the experiment, we assessed the performance of each scenario by F1 score. The evaluation was conducted on a diverse, class-balanced test dataset of prompts to ensure comprehensive results. We compared the effectiveness of each approach to determine their suitability for prompt classification in LLMs and addressed the limitations of few training data per category by evaluating a few-shot training scenario.

Results and Discussion: 

In the full dataset experiment, we observed interesting results across different prompt classification approaches. Firstly, the classical combination of Support Vector Classification (SVC) and TF-IDF yielded surprisingly strong performance, achieving an accuracy of 94.1%. This traditional approach, known for its computational efficiency, showcased its effectiveness in prompt classification. However, considering the expected diversity of vocabulary in real-world settings, we suspect that this approach may not generalize well.


Full Dataset:


Type

Model

F1

TFIDF

SVC

0.941

TFIDF

LG

0.897

SBERT

paraphrase-distilroberta-base-v1 + SVC

0.965

SBERT

paraphrase-distilroberta-base-v1 + LG

0.961

SBERT

paraphrase-mpnet-base-v2 + SVC

0.985

SBERT

paraphrase-mpnet-base-v2 + lg

0.981

SBERT

all-mpnet-base-v2 + SVC

0.976

SBERT

all-mpnet-base-v2 + lg

0.971

SetFit

paraphrase-mpnet-base-v2

0.964

SetFit

all-mpnet-base-v2

0.955

SetFit

paraphrase-distilroberta-base-v1

0.970

FT

bert-large-uncased

0.940

FT

roberta-base

0.795

FT

ernie-2.0-base-en

0.960

On the other hand, leveraging pretrained sentence embeddings with classifiers proved to be highly effective. The SBERT paraphrase-mpnet-base-v2 + SVC approach stood out with an impressive accuracy of 98.5% and demonstrated the potential of utilizing contextualized word representations pretrained for sentence similarity for prompt classification. This approach strikes a good balance between generalizability and computational efficiency.

The SetFit-based approaches, which involved end-to-end training of sentence embeddings with Contrastive Learning (CL) and fine-tuning the classification head, showed promising results, as well. While not quite reaching the performance of pretrained SBERT, they came close. However, it's important to note that a larger-scale hyperparameter study was not conducted in this experiment. We expect that with further optimization, the SetFit approaches could compete on par with pretrained SBERT. Nevertheless, it's worth considering that the SetFit approaches involve contrastive learning and fine-tuning, making them less computationally efficient compared to the pretrained SBERT models.

Both the SBERT models and the SetFit approaches exhibited high F1 scores even with limited labeled examples per class, i.e. in the few-shot learning scenario. This suggests that these methods are capable of integrating other categories and perspectives on prompts without requiring a large training set of annotated examples. This is a significant finding, as it indicates the potential for prompt classification to incorporate diverse perspectives and adapt to different prompt categories even with limited available labeled data.


Fewshot (100 per class):


Type

Model

F1

TFIDF

SVC

0.878

TFIDF

LG

0.825

SBERT

paraphrase-distilroberta-base-v1 + SVC

0.925

SBERT

paraphrase-distilroberta-base-v1 + LG

0.946

SBERT

paraphrase-mpnet-base-v2 + SVC

0.985

SBERT

paraphrase-mpnet-base-v2 + LG

0.966

SBERT

all-mpnet-base-v2 + SVC

0.960

SBERT

all-mpnet-base-v2 + LG

0.945

SetFit

paraphrase-distilroberta-base-v1

0.960

SetFit

paraphrase-mpnet-base-v2

0.960

SetFit

all-mpnet-base-v2

0.960

FT

bert-large-uncased

0.058

FT

roberta-base

0.218

FT

ernie-2.0-base-en

0.311


Fewshot (10 per class):


Type

Model

F1

TFIDF

SVC

0.491

TFIDF

LG

0.490

SBERT

paraphrase-distilroberta-base-v1 + SVC

0.822

SBERT

paraphrase-distilroberta-base-v1 + LG

0.837

SBERT

paraphrase-mpnet-base-v2 + SVC

0.916

SBERT

paraphrase-mpnet-base-v2 + LG

0.931

SBERT

all-mpnet-base-v2 + SVC

0.883

SBERT

all-mpnet-base-v2 + LG

0.896

SetFit

paraphrase-distilroberta-base-v1

0.906

SetFit

paraphrase-mpnet-base-v2

0.924

SetFit

all-mpnet-base-v2

0.920

FT

bert-large-uncased

0.017

FT

roberta-base

0.005

FT

ernie-2.0-base-en

0.025

As anticipated, fine-tuned Masked Language Models (MLMs) such as BERT performed well for sequence classification. However, it's interesting to note that the performance varied across different models. In comparison to all other models (including SVC and LG!), the fine-tuned MLM models showed lower suitability for few-shot learning, as indicated by their lower F1 scores.

Conclusion

Overall, the results suggest that leveraging pretrained SBERT-based sentence embeddings with SVC provides a strong and efficient solution for prompt classification in real-world settings. The classical combination of SVC and TF-IDF, while surprisingly effective, may face challenges in handling diverse vocabularies. The SetFit approaches show promise but require further optimization, and fine-tuned MLM models perform well for sequence classification but may not be as suitable for few-shot learning tasks.

It's important to note that these observations were made using a generated dataset with GPT-4, which may not fully capture the complexity and diversity of real-world data. Further research including hyper-parameter studies and evaluation on diverse real-world datasets are needed to validate the generalizability and effectiveness of these approaches.

The findings of this study have been successfully implemented in the pulze.ai engine and return the prompt categorization transparently to the user.

In the next post of this series, we will focus how to deploy these models in a production environment.