Image Understanding for Visual Dialog

Yeongsu Cho, Incheol Kim, Journal of Information Processing Systems Vol. 15, No. 5, pp. 1171-1178, Oct. 2019  

Keywords: Attribute Recognition, Image Understanding, Visual Dialog


This study proposes a deep neural network model based on an encoder–decoder structure for visual dialogs. Ongoing linguistic understanding of the dialog history and context is important to generate correct answers to questions in visual dialogs followed by questions and answers regarding images. Nevertheless, in many cases, a visual understanding that can identify scenes or object attributes contained in images is beneficial. Hence, in the proposed model, by employing a separate person detector and an attribute recognizer in addition to visual features extracted from the entire input image at the encoding stage using a convolutional neural network, we emphasize attributes, such as gender, age, and dress concept of the people in the corresponding image and use them to generate answers. The results of the experiments conducted using VisDial v0.9, a large benchmark dataset, confirmed that the proposed model performed well.

Show / Hide Statistics

Statistics (Cumulative Counts from November 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.

Cite this article
[APA Style]
Yeongsu Cho and Incheol Kim (2019). Image Understanding for Visual Dialog. Journal of Information Processing Systems, 15(5), 1171-1178. DOI: 10.3745/JIPS.04.0141.

[IEEE Style]
Y. Cho and I. Kim, "Image Understanding for Visual Dialog," Journal of Information Processing Systems, vol. 15, no. 5, pp. 1171-1178, 2019. DOI: 10.3745/JIPS.04.0141.

[ACM Style]
Yeongsu Cho and Incheol Kim. 2019. Image Understanding for Visual Dialog. Journal of Information Processing Systems, 15, 5, (2019), 1171-1178. DOI: 10.3745/JIPS.04.0141.