Hierarchical website classification system (by category)

  • Category: NLP
  • Client: Commercial Client
  • Project date: 2022

The system assigns a website to one or more of more than 500 categories and is used for marketing and advertising campaigns.

Approaches

  • Tf-Idf + NodesLocalClassifier
  • ByteLevelBPE + HMCN
  • Word-piece + RuBert

Data specifics

  • Inaccessible web pages
  • Incorrect labels
  • Class imbalance

Model evaluation

Metrics for evaluating the quality of models:

  • h-fbeta (MSE)
  • h-precision
  • h-recall

Metrics


HMCN NLC RuBert
H-fbeta 0.53 0.39 0.2
H-precision 0.87 0.48 0.93
H-recall 0.38 0.33 0.11