Video Captioning with Visual and Semantic Features

Sujin Lee and Incheol Kim
Volume: 14, No: 6, Page: 1318 ~ 1330, Year: 2018
10.3745/JIPS.02.0098
Keywords: Attention-Based Caption Generation, Deep Neural Networks, Semantic Feature, Video Captioning
Full Text:

Abstract
Video captioning refers to the process of extracting features from a video and generating video captions using the extracted features. This paper introduces a deep neural network model and its learning method for effective video captioning. In this study, visual features as well as semantic features, which effectively express the video, are also used. The visual features of the video are extracted using convolutional neural networks, such as C3D and ResNet, while the semantic features are extracted using a semantic feature extraction network proposed in this paper. Further, an attention-based caption generation network is proposed for effective generation of video captions using the extracted features. The performance and effectiveness of the proposed model is verified through various experiments using two large-scale video benchmarks such as the Microsoft Video Description (MSVD) and the Microsoft Research Video-To-Text (MSR-VTT).

Article Statistics
Multiple requests among the same broswer session are counted as one view (or download).
If you mouse over a chart, a box will show the data point's value.


Cite this article
IEEE Style
S. L. I. Kim, "Video Captioning with Visual and Semantic Features ," Journal of Information Processing Systems, vol. 14, no. 6, pp. 1318~1330, 2018. DOI: 10.3745/JIPS.02.0098.

ACM Style
Sujin Lee and Incheol Kim. 2018. Video Captioning with Visual and Semantic Features , Journal of Information Processing Systems, 14, 6, (2018), 1318~1330. DOI: 10.3745/JIPS.02.0098.