Dataset for Cross-Media Retrieval
Welcome to PKU XMediaNet, a large-scale dataset of texts, images, videos, audios and 3D models designed for cross-media retrieval. PKU XMediaNet dataset is the first large-scale cross-media dataset which consists of five media types, with more than 100,000 media instances. Because "X" looks like a cross line, here PKU XMediaNet stands for cross-media retrieval among all the different media types. We have also released PKU XMedia dataset, which is the first cross-media dataset with five media types, containing 12,000 media instances.
The details of PKU XMediaNet dataset and PKU XMedia dataset are shown as below.
Click benchmark to see the benchmark we build up for cross-media retrieval, or click the navigation menu on the left side.
PKU XMediaNet:
Now we have constructed a new dataset named PKU XMediaNet, which consists of 5 media types (text, image, video, audio and 3D model). We select 200 category nodes from WordNet to construct this dataset to ensure the semantic hierarchy structure. These categories can be divided into two main parts: animals and artifacts. There are 48 kinds of animal such as elephant, owl, bee and frog as well as 152 kinds of artifact such as violin, airplane, shotgun, and camera. The total number of media instances will exceed 100,000, and here is some information on media instances in this new dataset:
- Text: Text paragraphs extracted from several Wikipedia articles whose topics belong to the category.
- Image: Pictures including the objects of the category from Flickr.
- Video: Video clips from YouTube including the objects of the category, whose average duration is about 100 seconds.
- Audio: Audio clips containing sounds made by the objects of the category from Findsounds and Freesound, such as dog barking, clock alarm, keyboard typing and so on.
- 3D Model: 3D models representing the objects belonging to the category from Yobi3D.
The dataset is randomly split into a training set of 81,600 media objects and a test set of 20,400 media objects. The random split is performed on each media respectively, the ratio of training set and test set is 4:1. We summarize the split of each media type in Table 1.
Table 1: Split of each media type of PKU XMediaNet dataset
Media |
Text |
Image |
Video |
Audio |
3D |
Training |
32,000 |
32,000 |
8,000 |
8,000 |
1,600 |
Testing |
8,000 |
8,000 |
2,000 |
2,000 |
400 |
We randomly select several examples of different media types of PKU XMediaNet dataset, which are shown in Figure 1.
If you use the PKU XMediaNet dataset, please cite:
Y. Peng, X. Huang, and Y. Zhao, "An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges", IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), DOI: 10.1109/TCSVT.2017.2705068, 2017.
Y. Peng, J. Qi and Y. Yuan, "Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network", IEEE Transactions on Image Processing (TIP), Vol. 27, No. 11, pp. 5585-5599, Nov. 2018.

Figure 1: Examples of PKU XMediaNet dataset.
PKU XMedia:
PKU XMedia dataset consists of 5,000 texts, 5,000 images, 500 videos, 1,000 audio clips and 500 3D models, all of which are crawled from the famous websites on the Internet as following:
Each media instance has its corresponding category label. The dataset is evenly split into 20 categories, which are insect, bird, wind, dog, tiger, explosion, elephant, flute, airplane, drum, train, laughter, wolf, thunder, horse, autobike, gun, stream, piano and violin. So there are 600 media instances with each category.
Each text is a paragraph of an article about the category in Wikipedia and most of the texts are less than 200 words. The images are pictures with high resolution which contain the object of each category. Long videos in YouTube are segmented into short clips which exactly represent the category, and the video clips in this dataset are mostly less than one minute. The collected audio clips are mostly shorter than one minute, which can stand for the category like wolf howl. The 3D models are objects standing for the 20 semantic categories, such as easily recognizable models dog and tiger. The file formats for five media types are txt, jpg, avi,wav and obj separately, which can be processed by common approaches or tools.
The dataset is randomly split into a training set of 9,600 media objects and a test set of 2,400 media objects. The random split is performed on each media respectively, the ratio of training set and test set is 4:1. We summarize the split of each media type in Table 2.
Table 2: Split of each media type of PKU XMedia dataset
Media |
Text |
Image |
Video |
Audio |
3D |
Training |
4,000 |
4,000 |
400 |
800 |
400 |
Test |
1,000 |
1,000 |
100 |
200 |
100 |
We randomly select several examples of different media types of PKU XMedia dataset, which are shown in Figure 2.
If you use the PKU XMedia dataset, please cite:
Y. Peng, X. Zhai, Y. Zhao, and X. Huang, "Semi-supervised crossmedia feature learning with unified patch graph regularization", IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 26, pp. 583 – 596, 2016.
X. Zhai, Y. Peng, and J. Xiao, "Learning cross-media joint representation with sparse and semi-supervised regularization", IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 24, no. 6, pp. 965–978, 2014.

Figure 2: Examples of PKU XMedia dataset.
Dataset Download:
PKU XMedia dataset is available with the feature files (text: 10-dimensional LDA, 3,000-dimensional BOW; image: 128-dimensional BoVW, 4,096-dimensional CNN; video: 128-dimensional BoVW, 4,096-dimensional CNN; audio: 29-dimensional MFCC; 3D: 4,700-dimensional LightField).
PKU XMediaNet dataset is available with the source URLs and features of the media instances.
Please download the Release Agreement, read it carefully, and complete it appropriately. Note that the agreement needs a handwritten signature by a full-time staff member (that is, student is not acceptable). Then, please scan the signed agreement and send it to Dr. Huang (huangxin_14@pku.edu.cn). If you are from the mainland of China, please sign the agreement in Chinese rather than English. Then we will verify your request and contact you on how to download the data.