Web page classification with Google Image Search results
Classification of web pages has been a challenging problem from the beginning. Internet is continuously growing, and web pages have dynamically changing features that make it difficult to classify them.
In academic litterature there are quite a few research to classify web pages. These researches can be divided into four main groups according to what features are used as a classifier: (1) Textual classifications: URL address, text content, title, HTML description, HTML code, etc. (2) Visual classifications: images, design, videos, etc. (3) Graph-based classifications: hyperlink structures, neighbor web sites. (4) And other information: user behaviors, web directories, semantic web, raw data of domain (IP address, owner, hosting server, hosting country).
We used Google Image Search results to classify web pages. This is a new approach to the problem since Google Image Search results are not considered as features of web pages before. We called these kinds of features “descriptive images”. These features are provided by another source to describe the web page, and may not in the content. Our “descriptive images” approach can be easily applicable to similar deep learning problems.
As in all machine learning problems, we have (1) training and (2) testing processes.
Our training processes are as follows:
- We do a Google Image Search of the URLs of all web pages in the train set and save the first 20 images.
- We use well known CNN architectures (VGG16, DenseNets, ResNet, etc.) with ImageNet transfer learning to get a trained model.
- We saved the trained model in every 5 epochs to graph the method performance (optional — you can use only the last epoch of trained model)
Our testing processes are as follows:
- We do a Google Image Search for the URL of one testing web page and save the first 20 images.
- For each image, using the trained model, we get the images’ degrees of belonging to each classes.
- Using reduction metrics, we get the testing web page’s class from its 20 decriptive images’ classes.
We have 3 reduction metrics (summation, one-hot, and average reordered) and we applied these metrics to first 5 images, first 10, first 15, and all 20 images. Average reordered applied to first 15 images (A15) has noticeable success rate. Below graph shows the success rates of all 12 posibilities and per image accuracy (success rate for individual images, not testing web pages) for a test set contains 2000 web pages, using DenseNet169 architecture.
In this work we used the WebScreenshots dataset, that contains 20000 web pages in four classes. The dataset contains web pages URLs, classes, contents, screenshots in 1440x900 pixels, and 224x224 pixels. Google Image Search results are not part of the dataset.
You can read more in our academic research paper: https://arxiv.org/abs/2006.00226
You can download WebScreenshot dataset from: