Summarizing the Differences in Chinese-Vietnamese Bilingual News

Jinjuan Wu* , Zhengtao Yu* , Shulong Liu* , Yafei Zhang* and Shengxiang Gao*

Abstract

Abstract: Summarizing the differences in Chinese-Vietnamese bilingual news plays an important supporting role in the comparative analysis of news views between China and Vietnam. Aiming at cross-language problems in the analysis of the differences between Chinese and Vietnamese bilingual news, we propose a new method of summarizing the differences based on an undirected graph model. The method extracts elements to represent the sentences, and builds a bridge between different languages based on Wikipedia’s multilingual concept description page. Firstly, we calculate the similarity between Chinese and Vietnamese news sentences, and filter the bilingual sentences accordingly. Then we use the filtered sentences as nodes and the similarity grade as the weight of the edge to construct an undirected graph model. Finally, combining the random walk algorithm, the weight of the node is calculated according to the weight of the edge, and sentences with highest weight can be extracted as the difference summary. The experiment results show that our proposed approach achieved the highest score of 0.1837 on the annotated test set, which outperforms the state-of-the-art summarization models.

Keywords: Bilingual News , Chinese-Vietnamese , Sentence Similarity , Summarizing the Difference , Undirected Graph

1. Introduction

In the Internet Age, information spreads rapidly regardless of borders. The media in different countries will report on the same event and express different opinions because of different positions. For example, on the theme of “One Belt, One Road”, news reports in both Chinese and Vietnamese describe the content of the cooperation project agreement. However, Chinese news tend to emphasize the promotion of trade cooperation and cultural exchange, while Vietnamese articles tend to describe improvements in infrastructure construction and industrial development. This paper aims to summarize the differences in reporting between different languages, and generate a difference summary to help people understand events more comprehensively and accurately.

Within the field of summarizing the difference in bilingual news, cross-language analysis is a difficult issue. This problem generally can be addressed by the bilingual dictionary approach [1], parallel corpus approach [2,3] and machine translation approach [4,5]. The dictionary approach first builds a bilingual alignment dictionary, and aligns the key words (e.g., emotional words or entities) as required. This method assumes that bridges can be built between different languages through clearly aligned keywords. For example, Mihalcea et al. [1] use both an English-Romanian and general dictionary to construct a bilingual-aligned subjective dictionary. The parallel corpus approach mainly uses the alignment relationship of a parallel corpus which is composed of the source language and its translation into other languages. The parallel corpus includes word-level, sentence-level and chapter-level alignment, but the difficulty lies in the fact that the corpus is not easy to obtain. For example, Banea et al. [3] propose to use subjective and objective classifiers of source language and parallel corpus to classify objective language. In recent years, the results of machine translation have improved, and it is gradually becoming an effective means of cross-language analysis. For example, Banea et al. [4] propose two different cross-language methods using machine translation: source language translation to target language, and target language translation to source language. Research into news summary extraction can be divided into topic representation approaches and indicator representation approaches [6]. Topic representation approaches first convert the text into a series of topics, then calculate the importance of the sentences according to the topic, and finally select the important sentences as the summary. Gillick et al. [7] propose the use of higher frequency words as topic representations, and these higher frequency words tended to be domain specific. Celikyilmaz and Hakkani-Tur [8] suggest using the hLDA model to calculate important topics in multi-document news, and then generate a summary. The authors [9] propose to use cosine distance to compute sentence similarity, and cluster sentences to extract the topic. Indicator representation approaches directly express the sentence into the feature vector and then calculate the importance of the sentence. For example, graph models [10,11] are used to calculate the importance of sentences, where graph vertices represent sentences, edges represent cosine similarity between sentences, the random walk algorithm is used to calculate the weight of the vertices, and the high weight would be used to select the most important sentence as a summary. Wan and Zhang [12] propose a novel system to incorporate the new factor of information certainty into the summarization task, which produce better content quality. The rise in the study of deep learning has also contributed to the extractive summarization task. Some methods use neural networks in the single document summarization framework [13-15]. They formulate sentence ranking as a hierarchical regression process. Given sentences with labeled importance scores [13], or the symbol [14,15] of 0 or 1, which indicates whether to extract the sentence into the summary or not. Unfortunately, the application of neural networks methods to bilingual multi-document summarization is difficult. Not only encoding and decoding for a long sequence of multiple sentences still lack satisfactory solutions [16], but it also lacks a large-scale corpus for training.

The existing approaches to summary extraction mainly involves single language documents, which aim to extract the important content of news and eliminate redundant information. In this paper, we analyze the multilingual news and extract different information. Singh et al. [17] propose to use a restricted Boltzmann machine to generate a summary retaining its important information. In recent years, graph-based ranking algorithm has been widely used for this task, such as the research conducted by Wan et al. [18], who propose a ranking method based on a graph to score the importance and differences in Chinese-English documents and then select sentences with high scores to generate a summary. The current article focuses on Chinese and Vietnamese news documents, with the research methods divided into two steps. First, the similarity information is filtered according to the cosine similarity between Chinese and Vietnamese news sentences. Second, the graph model is constructed, and the random walk algorithm is used to extract the representative sentences to generate the summary.

2. A Summary Method of News Difference Based on a Graph Model

To reflect the difference between Chinese and Vietnamese news. First, we extract the elements contained in the news documents to characterize the sentences. Second, we calculate the similarity between cross-language news to filter out the highly similar sentences. Third, the sentences that had not been filtered out as the vertices to construct the graph model. Finally, we use the random walk algorithm to obtain the weight of the vertices, that is the importance of the sentence, with the most important (n) selected as the summary.

The method of implementation is shown in Fig. 1.

Fig. 1.
A summary method of news difference based on a graph model.
2.1 The Extraction of Bilingual News Elements

The elements [19] contain important information such as the time, place, participant, and institution in the news events.

This paper aims to extract elements contained in Chinese and Vietnamese sentences, and use them to characterize the sentences. The extraction of Chinese elements use the LTP cloud platform [20]. We set named entities as news elements then obtain the collection of Chinese elements [TeX:] $$E_{c n}=\left\{e_{c 1}, e_{c 2}, \ldots, e_{c m}\right\}$$. Due to the lack of Vietnamese named entity recognition tools in the process, word segmentation tool [21] can be used to segment sentences, and part-of-speech tagging. We then manually extract the elements according to the processing results to obtain the collection of Vietnamese elements [TeX:] $$$$E_{v e}=\left\{e_{v 1}, e_{v 2}, \ldots, e_{v n}\right\}. Chinese and Vietnamese sentences are characterized by elements, for example [TeX:] $$S_{k}=\left\{e_{1}, e_{2}, \ldots, e_{k}\right\}$$.

2.2 Filter Similar News Sentences

Chinese and Vietnamese sentences with high similarity will not reflect differences. Based on this consideration, initial filtering is carried out according to the similarity before the sentence is analyzed. As Chinese-Vietnamese machine translation technology is not mature, we cannot simply translate bilingual news into one language. We therefore seek help from the multi-language concept description pages on Wikipedia [22]. The translation between concepts corresponded, so this is used in the calculation of Chinese/Vietnamese semantic similarity to realize the analysis of sentence relations.

There are many language options in Wikipedia, in which Chinese and Vietnamese concepts are the basis for similarity calculation between Chinese and Vietnamese words. Using this method [22], we first extract the Chinese/Vietnamese concept set with correspondences in Wikipedia, constructing a bilingual concept feature space. Then, words are represented as vectors by the mapping of feature spaces. Finally, the similarity between two vectors is calculated by the cosine. In our proposed approach, the input are the Chinese word [TeX:] $$e^{c n}$$ and the Vietnamese word [TeX:] $$e^{v e}$$ , let the two vectors are represented by [TeX:] $$\vec{e}^{c n}=\left\{e_{1}^{c n}, e_{2}^{c n}, \ldots, e_{n}^{c n}\right\}$$ and [TeX:] $$\vec{e}^{v e}=\left\{e_{1}^{v e}, e_{2}^{v e}, \ldots, e_{n}^{v e}\right\}$$ , respectively. The formula for semantic similarity of Chinese and Vietnamese words is as follows:

(1)
[TeX:] $$\operatorname{sim}\left(e^{c n_{n}} e^{n e}\right)=\frac{\sum_{i=1}^{n}\left(e_{i}^{c_{n}}, e_{i}^{\mathrm{ve}}\right)}{\sqrt{\sum_{i=1}^{n}\left(e_{i}^{c n}\right)^{2}} \sqrt{\sum_{i=1}^{n}\left(e_{i}^{\mathrm{ve}}\right)^{2}}}$$

Each news sentence is characterized by one or more elements, so similarity of the sentences can be computed by the similarity of the elements it contains. Assuming two sentences [TeX:] $$s_{i}$$ and [TeX:] $$s_{j}$$ contain the elements [TeX:] $$e_{1}, e_{2}, \cdots e_{m}$$ and [TeX:] $$e_{1}, e_{2}, \cdots e_{n}$$ after the word segmentation and part-of-speech tagging, that is, [TeX:] $$s_{i}$$ is composed of [TeX:] $$m$$ words and [TeX:] $$s_{j}$$ is composed of [TeX:] $$n$$ words.

The sentence similarity calculation method is based on the set of extracted elements. Words are selected one by one from the set of elements in a sentence to calculate the similarity with words in the element set of the same language documents. The word pair that obtains maximum similarity will be selected until the sentence element collection is void. Then the similarity of these word pairs will be added, and divided by the number of words contained in the sentence element set to determine similarity of the two sentences. The formula is as follows:

(2)
[TeX:] $$w_{i j}=\sum_{u=1}^{m} \max \operatorname{Sim}\left(e_{i}, e_{j}\right) / m$$

where [TeX:] $$w_{i j}$$ represents the similarity between the sentence [TeX:] $$s_{i}$$ and [TeX:] $$s_{j}$$ in the same language document, and [TeX:] $$\operatorname{sim}\left(e_{i}, e_{j}\right)$$ means the similarity between the elements [TeX:] $$e_{i}$$ and [TeX:] $$e_{j}$$ . Assuming that [TeX:] $$S_{c n}=\left\{s_{1}^{c n}, s_{2}^{c n}, \ldots, s_{m}^{c n}\right\}$$ contains m sets of Chinese sentences, [TeX:] $$S_{v e}=\left\{s_{1}^{v e}, s_{2}^{v e}, \ldots, s_{n}^{v e}\right\}$$ contains n sets of Vietnamese sentences, and [TeX:] $$W_{i j}, i \in[1, m], j \in[1, n]$$ represents the similarity matrix between Chinese and Vietnamese sentences, which can be shown as:

(3)
[TeX:] $$W_{i j}=\left[\begin{array}{ccccc} {w_{11}} & {w_{12}} & {\cdots} & {w_{1 n-1}} & {w_{1 n}} \\ {w_{21}} & {w_{22}} & {\cdots} & {w_{2 n-1}} & {w_{2 n}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ {w_{m-11}} & {w_{m-12}} & {\cdots} & {w_{m-1 n-1}} & {w_{m-1 n}} \\ {w_{m 1}} & {w_{m 2}} & {\cdots} & {w_{m n-1}} & {w_{m n}} \end{array}\right]$$

After obtaining the similarity between Chinese and Vietnamese sentences, it is obviously unreasonable to filter sentences directly according to the similarity. For example, assume that the threshold is [TeX:] $$alpha$$, and [TeX:] $$w_{24} \geq \alpha$$ , which satisfy the condition that the similarity is greater than the threshold. If the sentences [TeX:] $$s_{2}^{c n}$$ and [TeX:] $$s_{4}^{v e}$$ are directly filtered only because of high similarity between them, the accuracy of the summary will be affected. [TeX:] $$w_{24} \geq \alpha$$ can only indicate that there is little difference between sentence [TeX:] $$s_{2}^{c n}$$ and [TeX:] $$s_{4}^{v e}$$ ,but sentence [TeX:] $$s_{2}^{c n}$$ may still be different from Vietnamese sentences with the exception of [TeX:] $$s_{4}^{v e}$$. Sentence [TeX:] $$s_{4}^{v e}$$ may still be different from Chinese sentences except for [TeX:] $$s_{2}^{c n}$$.

Based on the above considerations, the following method is adopted. First, global similarity is calculated for each sentence. Second, the sentences are filtered according to whether the global similarity of the sentences satisfy the threshold condition.

(4)
[TeX:] $$\operatorname{sim}\left(s_{i}^{c n}\right)=\frac{1}{n} \sum_{j=1}^{n} w_{i j}, i=1,2, \ldots, m$$

(5)
[TeX:] $$\operatorname{sim}\left(s_{j}^{\mathrm{ve}}\right)=\frac{1}{m} \sum_{i=1}^{m} w_{i j}, j=1,2, \ldots, n$$

where [TeX:] $$\sin \left(s_{i}^{c n}\right)$$ and [TeX:] $$\operatorname{sim}\left(s_{j}^{v e}\right)$$ represent the global similarity of Chinese sentence [TeX:] $$s_{i}^{c n}$$ and Vietnamese sentence [TeX:] $s_{j}^{v e}$$, respectively. To be specific,[TeX:] $$\sin \left(s_{i}^{c n}\right)$$ measures the similarity between a Chinese sentence and the Vietnamese full text articles. If the global similarity is higher than the threshold, it means that the difference of the Chinese sentence and Vietnamese full text is small and the sentence should be filtered out. The Vietnamese news sentences are handled in a similar way. We set the global similarity threshold to 0.2 during the experiment.

2.3 Graph Model Construction

After initial filtering of the sentence, the Chinese sentence [TeX:] $$S_{c n}=\left\{s_{1}^{c n}, s_{2}^{c n}, \ldots, s_{m}^{c n}\right\}$$ and the Vietnamese sentence [TeX:] $$S_{v e}=\left\{s_{1}^{v e}, s_{2}^{v e}, \ldots, s_{n}^{v e}\right\}$$ are obtained, where m and n are used to indicate the quantity of the remaining Chinese and Vietnamese sentences, respectively.

The remaining sentences can, in some cases, reflect differences between different languages news. The purpose of this paper is to summarize the differences between Chinese and Vietnamese news. To achieve this goal, we need to meet two conditions. First of all, the extracted news sentences should reflect that different language sentences contain different information. Second, the extracted sentences should reflect the nature of the summary, that is, they should be representative or important. In the first step, the filtering based on the global similarity of the sentence satisfied the first condition, so the Chinese and Vietnamese sentences remaining after filtering need to be processed into a summary. To achieve this goal, we calculate the scores of the different language sentences separately, and extract high value (n) scores as a summary of news differences.

To evaluate the importance of a sentence, we can consider the following features: the similarity of sentences in the same language documents, the difference of sentences in the different language documents. The higher the similarity of the same language documents, the more the sentence can reflect the news document content. The higher the difference of the different language document sets, the more the difference can be reflected in the Chinese and Vietnamese news expressions. Using this analysis, we construct the undirected graph model shown in Fig. 2.

Fig. 2.
Undirected graph model.

The vertices in the Fig. 2, indicate Chinese or Vietnamese sentences. [TeX:] $$E^{v e}$$ represents the similarity between Chinese sentences; [TeX:] $$E^{v e}$$ represents the similarity between Vietnamese sentences, and [TeX:] $$E^{c n v e}$$ represents the difference between Chinese and Vietnamese sentences. The similarity between sentences in the same language is calculated by cosine similarity. We select unigram + bigram as the feature. The sentence is represented as a vector by using the vector space model (VSM) model, and the similarity is calculated according to the cosine distance of the vector.

Bilingual sentence similarity calculation is based on Wikipedia, and the similarity of every sentence is obtained by calculating the Euclidean distance between the element vectors. Chinese word vector [TeX:] $$\vec{e}^{c n}=\left\{e_{1}^{c n}, e_{2}^{c n}, \ldots, e_{n}^{c n}\right\}$$ ,and Vietnamese word vector [TeX:] $$\vec{e}^{\mathrm{ve}}=\left\{e_{1}^{v e}, e_{2}^{v e}, \ldots, e_{n}^{v e}\right\}$$ .

The formula of similarity for Chinese and Vietnamese words is as follows:

(6)
[TeX:] $$\operatorname{Dis}\left(e_{i}^{c n}, e_{j}^{\mathrm{ve}}\right)=\frac{\left\|\vec{e}_{t}^{c n}-\vec{e}_{j}^{\mathrm{ve}}\right\|}{\left\|\vec{e}_{i}^{c n}\right\|+\left\|\vec{e}_{j}^{v e}\right\|}$$

Similarity between Chinese-Vietnamese news sentences is as follows:

(7)
[TeX:] $$w_{i j}=\sum_{u=1}^{m} \max \operatorname{Dis}\left(e_{i}^{c n}, e_{j}^{v e}\right) / m

where [TeX:] $$w_{i j}$$ represents the similarity between sentences [TeX:] $$s_{i}$$ and [TeX:] $$s_{j}$$ in different language document sets, and [TeX:] $$\operatorname{Dis}\left(e_{i}, e_{j}\right)$$ represents the similarity between elements [TeX:] $$e_{i}$$ and [TeX:] $$e_{j}$$ .

We construct the similarity matrix between Chinese and Vietnamese [TeX:] $$W_{i j}^{c n v e}$$ , [TeX:] $$i \in[1, m] ; j \in[1, n]$$, and let [TeX:] $$\left(W_{i j}^{\text {cnve }}\right)^{T}=W_{i j}^{\text {vecn }}$$ .

2.4 Graph Model Solving

The matrices [TeX:] $$w_{i j}^{c n}$$ , [TeX:] $$w_{i j}^{v e}$$ , [TeX:] $$w_{i j}^{c n v e}$$ represent the similarity between Chinese sentences, the similarity between Vietnamese sentences and the similarity between Chinese and Vietnamese sentences. The element of the matrix is equivalent to the weight of the edge in the graph model. The weight of the vertex can be calculated by the weight of the edge [18], letting [TeX:] $$u\left(s_{i}^{c n}\right)_{m \times 1}$$ and [TeX:] $$v\left(s_{j}^{v e}\right)_{n \times 1}$$ represent the scores of the Chinese sentence and the Vietnamese sentence. To achieve this goal, each matrix is first normalized to get [TeX:] $$\tilde{w}_{i j}^{c n}$$ , [TeX:] $$\tilde{w}_{i j}^{v e}$$ , [TeX:] $$\tilde{w}_{i j}^{c n v e}$$ , ensuring the sum of the elements in each row of the matrix is 1.

(7)
[TeX:] $$u\left(s_{i}^{c n}\right)=\alpha \sum_{j} \tilde{w}_{i j}^{c n} u\left(s_{j}^{c n}\right)+\beta \sum_{j} \tilde{w}_{i j}^{v e c n} u\left(s_{j}^{v e}\right)$$

(8)
[TeX:] $$u\left(s_{j}^{v e}\right)=\alpha \sum_{i} \tilde{w}_{i j}^{v e} u\left(s_{i}^{v e}\right)+\beta \sum_{i} \tilde{w}_{i j}^{c n v e} u\left(s_{i}^{c n}\right)$$

where [TeX:] $$\alpha$$ and [TeX:] $$\beta$$ indicate the effect of the same and different language similarity. Based on these assumptions we can see [TeX:] $$\alpha>0$$ , [TeX:] $$\beta>0$$ , let [TeX:] $$\alpha+\beta=1$$. The above formulae are iteratively solved. To make the solution converge, [TeX:] $$u\left(s_{i}^{c n}\right)_{m \times 1}$$ and [TeX:] $$v\left(s_{j}^{v e}\right)_{n \times 1}$$ are normalized after each iteration. When the difference between the results of the two iterations is less than the threshold, it is assumed that the iteration ends. The scores of Chinese sentences [TeX:] $$u\left(s_{i}^{c n}\right)_{m \times 1}$$ and Vietnamese sentences [TeX:] $$v\left(s_{j}^{v e}\right)_{n \times 1}$$ are obtained by this method.

To further filter redundant information, we choose the greedy algorithm [23] to deal with the current score, and get the final sentence score. The algorithm for dealing with Chinese sentences was as follows:

(1) Initializes the two collections: [TeX:] $$A=\varphi$$[TeX:] $$B=\left\{s_{i}, i=1,2, \dots, m\right\}$$ ,where set B represents the Chinese sentence set.

(2) The elements in set B are sorted in reverse order with the original score [TeX:] $$u\left(s_{i}^{c n}\right)_{m \times 1}$$ .

(3) Assuming that [TeX:] $$S_{i}$$ is ranked first, it is moved from set B to set A, and then the sentence score recalculated for similarity with [TeX:] $$s_{i}$$ in set B. [TeX:] $$s_{j}$$ is used to represent the sentence with the similarity to [TeX:] $$s_{i}$$ . The score is calculated as follows: [TeX:] $$\operatorname{score}\left(s_{j}\right)=u\left(s_{j}^{c n}\right)-\varphi \cdot w_{i j}^{c n} \cdot u\left(s_{j}^{c n}\right)$$ , where [TeX:] $$u\left(s_{j}^{c n}\right)$$ represents the original score of sentence [TeX:] $$s_{j}$$ , [TeX:] $$\varphi$$ represents penalty factor, [TeX:] $$w_{i j}^{c n}$$ and represents the similarity of [TeX:] $$s_{i}$$ and [TeX:] $$s_{j}$$ . When the penalty factor [TeX:] $\varphi$$ is 0 there is no penalty, and the sentence [TeX:] $$s_{j}$$ score is unchanged. We use an experimental selection penalty factor of 0.5.

(4) The score calculated in the previous step was used to reverse the order of the elements in set B and then return to the third step until the number of elements in set B was zero.

The algorithm for Vietnamese news processing is consistent with the above. It needs to replace the input into Vietnamese sentence sets, so that the original score matrix is replaced by [TeX:] $$v\left(s_{j}^{v e}\right)_{n \times 1}$$ , and the similarity matrix is replaced by [TeX:] $$w_{i j}^{v e}$$ . Using this method to calculate the final score of Chinese and Vietnamese sentences, and sort the sentences according to the final scores for each language, the top (n) sentences are extracted as summaries of news differences.

3. Experiments and Result

3.1 Data Set

The experimental data set contains Chinese and Vietnamese news of three topics. We searched http://google.com.hk/ to obtain the news documents related to the topics. Some documents were collected manually as data sets for the experiments. The specific information is shown in Table 1.

Table 1.
Specific data for the experiment

To evaluate the results, we read the Chinese and Vietnamese sentences on each of the three topics. Based on our full understanding of these news items, we chose 5 sentences from each language to form a summary of the differences in the news.

3.2 Evaluation Metrics

The top 5 sentences from each language were extracted as different sentences for the experiment. To evaluate the effect of the algorithm, we used the n-gram co-occurrence measure proposed by Lin and Hovy [24]. This method evaluates the model by calculating the degree of n-gram co-occurrence between the model summary and the manual summary. The higher the co-occurrence, the better the effect of the model. The calculation method is as follows:

(10)
[TeX:] $$C_{\mathrm{n}}=\frac{\sum_{C \in\{M \text { od }\ el\}} \sum_{n-g r a m \in C} \operatorname{Count}_{\text {match}}(n-g r a m)}{\sum_{C \in\{M \text { od }\ el\}} \sum_{n-g r a m \in C} \text { Count }(n-\text {gram})}$$

where [TeX:] $$\text { Count }_{\text {match}}(n-\text {gram})$$ represents the number of n-gram co-occurrences between the model summary and the manual summary, [TeX:] $$\text { Count }(n-\text {gram})$$ represents the number of n-gram in the model summary.

(11)
[TeX:] $$\operatorname{Ngram}(i, j)=\exp \left(\sum_{i}^{j} w_{n} \log C_{n}\right) i \leq j ; i, j \in[1,4]$$

where [TeX:] $$w_{n}$$ is the normalization factor and [TeX:] $$w_{n}=\frac{1}{j-i+1}$$ , when [TeX:] $$i=j=1$$ , [TeX:] $$\operatorname{Ngram}(1,1)$$ represents the degree of unigram co-occurrence, [TeX:] $$\operatorname{Ngram}(1,2)$$ represents the degree of unigram+bigram co-occurrence.

3.3 Evaluation Results

This paper selected the following three baselines methods to show the effectiveness of our proposed approaches.

Centroid [25]: A centroid-based method is used to calculate the saliency scores of sentences in the different languages. First, we calculate three scores: the centroid value, the position value and the overlapping value of the first sentence. Second, the three values are linearly summed to get the sentence score. Finally, the redundant information is removed to obtain the summary sentence. It is worth noting that this method does not use cross-language information.

Centroid++: This is an improved method based on the centroid method, which integrates cross-language information. The final score of the sentence comes from subtracting cross language similarity from scores calculated by the centroid method, and further reflects the differences between the different languages.

PBES [26]: Phrase-based extractive summarization [26] uses phrase-based scoring to represent saliency scores of sentences. We can assign phrase-based scores to sentences from the translated documents for summarization purposes. The model can operate on lexical entries with more than one word in the source and target languages. This works well with cross-language document summarization.

In the initial filtering of bilingual sentences according to global similarity, the global similarity threshold was set to 0.2, i.e. when we take 0.2, about 30% of the sentences are filtered out. In addition, the purpose of this paper was to extract the difference summary, not only concerned with the difference between the languages, but also the importance of sentences in the same language. We set [TeX:] $$\alpha=0.5$$ , [TeX:] $$\beta=0.5$$ in the random walk algorithm, which means the similarity between different languages and between the same languages contribute equally to the final score of the sentence. We used the settings to implement the methods given in this article, and to achieve three baseline methods. Based on the n-gram co-occurrence measure, the [TeX:] $$\operatorname{Ngram}(1,1)$$ and [TeX:] $$\operatorname{Ngram}(1,2)$$ of the three different methods were calculated. Table 2 shows the experimental results of the Chinese difference summary. Table 3 shows the experimental results of the Vietnamese difference summary.

Table 2.
Chinese difference summaries results
Table 3.
Vietnamese difference summaries results

We compared the output of the model to other summary systems. The first two methods pay more attention to the location characteristics of sentences when extracted. PBES analyzes the relation between bilingual sentences by machine translation. This paper studies the problem of cross-language document summarization in Chinese and Vietnamese. Vietnamese is a minority language, and the results of machine translation are not optimal. In response, our method builds a bridge between different languages based on Wikipedia’s multilingual concept description page, extracting elements to represent the sentences. It can be seen from Tables 2 and 3 that our method is superior to Centroid, Centroid++ and PBES under the same evaluation method, whether Chinese or Vietnamese news data is examined.

Fig. 3.
The influence of α value on Chinese summary.
Fig. 4.
The influence of α value on Vietnamese summary.

The effect of α on the experimental results can be observed in Figs. 3 and 4, which indicate the influence on Chinese and Vietnamese respectively. It can be seen that the experimental results gradually improve with the increase of α value, peaking at about 0.5, then gradually decreasing with the increase of α value.

Table 4.
Summary of news differences

Finally, we selected the topic “Mekong River”, and used this method to summarize the differences in Chinese-Vietnamese bilingual news as shown in Table 4. Here the proposed method extracts different viewpoints from the Chinese and Vietnamese news on the Mekong River topic. The Chinese summary paid attention to Vietnam’s severe drought and provides an objective analysis of the shortage of water resources. The Vietnamese summary emphasized the limited flow of the Mekong to particular areas and the need of the China Hydropower Station to discharge water. To a certain extent, the differences between Chinese and Vietnamese news are reflected here.

4. Conclusions

In this paper, we have proposed a method based on a graph model to summarize the differences between Chinese and Vietnamese bilingual news. In the proposed method, multilingual conceptual description pages on Wikipedia were used to analyze sentences similarity, which contribute to solving the graph model and further complete the summary task. The experiments are giving to show the effectiveness of our proposed approach.

Acknowledgement

This work was supported by National key research and development plan project (No. 2018YFC0830105, 2018YFC0830100), National Nature Science Foundation (No. 61732005, 61672271, 61761026, 61662041, 61762056), High-tech Industry Development Project of Yunnan Province (No. 201606), and Natural Science Foundation of Yunnan Province (No. 2018FB104).

Biography

Jinjuan Wu
https://orcid.org/0000-0003-1577-6445

She is current a postgraduate in the Kunming University of Science and Technology, Kunming, China. She focus on nature language processing and information retrieval.

Biography

Zhengtao Yu
https://orcid.org/0000-0002-4012-461X

He is currently a professor and Ph.D. supervisor at School of Information Engineering and Automation, and the chairman of Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming, China. He received the Ph.D. degree in Computer Application Technology from Beijing Institute of Technology, Beijing, China, in 2005. His main research interests include natural language processing, machine translation and information retrieval.

Biography

Shulong Liu
https://orcid.org/0000-0003-3063-8454

He is current a postgraduate in the Kunming University of Science and Technology, Kunming, China. He focus on nature language processing and information retrieval.

Biography

Yafei Zhang
https://orcid.org/0000-0003-2347-5642

She is currently a lecturer and master’s supervisor at College of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. She received the Ph.D. degree in Signal and information processing from Institute of Electronics, Chinese Academy of Sciences, Beijing, China, in 2008. Her main research interests include image processing and natural language processing.

Biography

Shengxiang Gao
https://orcid.org/0000-0002-2980-8420

She is lecturer at Kunming University of Science and Technology, Kunming, China. She is also a CCF member since 2013. She received the bachelor’s degree in industrial automation, the M.S. degree in pattern recognition and intelligent system and the Ph.D. degree from Kunming University of Science and Technology in 2000, 2005, and 2016, respectively. Her research interests include nature language processing, machine translation, and information retrieval.

References

  • 1 R. Mihalcea, C. Banea, J. Wiebe, "Learning multilingual subjective language via cross-lingual projections," in Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, 2007;pp. 976-983. custom:[[[-]]]
  • 2 M. S. Almeida, C. Pinto, H. Figueira, P. Mendes, A. F. Martins, "Aligning opinions: cross-lingual opinion mining with dependencies," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 2015;pp. 408-418. custom:[[[-]]]
  • 3 C. Banea, R. Mihalcea, J. Wiebe, "Porting multilingual subjectivity resources across languages," IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 211-225, 2013.doi:[[[10.1109/T-AFFC.2013.1]]]
  • 4 C. Banea, R. Mihalcea, J. Wiebe, "Multilingual subjectivity: are more languages better?," in Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 2010;pp. 28-36. custom:[[[-]]]
  • 5 C. Banea, R. Mihalcea, J. Wiebe, S. Hassan, "Multilingual subjectivity analysis using machine translation," in Proceedings of the Conference on Empirical Methods Natural Language Processing, Honolulu, HI, 2008;pp. 127-135. custom:[[[-]]]
  • 6 A. Nenkova, K. McKeown, "A survey of text summarization techniques," in Mining Text Data. BostonMA: Springer, pp. 43-76, 2012.custom:[[[-]]]
  • 7 D. Gillick, B. Favre, and D. Hakkani-Tur, 2008;, https://pageperso.lis-lab.fr/benoit.favre/papers/favre_tac2008.pdf
  • 8 A. Celikyilmaz, D. Hakkani-Tur, "A hybrid hierarchical model for multi-document summarization," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010;pp. 815-824. custom:[[[-]]]
  • 9 G. Salton, A. Singhal, M. Mitra, C. Buckley, "Automatic text structuring and summarization," Information Processing & Management, vol. 33, no. 2, pp. 193-207, 1997.doi:[[[10.1016/S0306-4573(96)00062-3]]]
  • 10 Y. Li, S. Li, "Query-focused multi-document summarization: combining a topic model with graph-based semi-supervised learning," in Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, 2014;pp. 1197-1207. custom:[[[-]]]
  • 11 D. Parveen, M. Strube, "Integrating importance, non-redundancy and coherence in graph-based extractive summarization," in Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 2015;pp. 1298-1304. custom:[[[-]]]
  • 12 X. Wan, J. Zhang, "CTSUM: extracting more certain summaries for news articles," in Proceedings of the 37th International ACM SIGIR Conference on Research & Development Information Retrieval, Gold Coast, Australia, 2014;pp. 787-796. custom:[[[-]]]
  • 13 Z. Cao, F. Wei, L. Dong, S. Li, M. Zhou, "Ranking with recursive neural networks and its application to multi-document summarization," in Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, TX, 2015;pp. 2153-2159. custom:[[[-]]]
  • 14 J. Cheng and M. Lapata, 2016;, https://arxiv.org/abs/1603.07252
  • 15 S. Narayan, N. Papasarantopoulos, S. B. Cohen, and M. Lapata, 2017;, https://arxiv.org/abs/1704.04530
  • 16 J. G. Yao, X. Wan, J. Xiao, "Recent advances in document summarization," Knowledge and Information Systems, vol. 53, no. 2, pp. 297-336, 2017.doi:[[[10.1007/s10115-017-1042-4]]]
  • 17 S. P. Singh, A. Kumar, A. Mangal, S. Singhal, "Bilingual automatic text summarization using unsupervised deep learning," in Proceedings of 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, India, 2016;pp. 1195-1200. custom:[[[-]]]
  • 18 X. Wan, H. Jia, S. Huang, J. Xiao, "Summarizing the differences in multilingual news," in Proceedings of the 34th International ACM SIGIR Conference on Research and Development Information Retrieval, Beijing, China, 2011;pp. 735-744. custom:[[[-]]]
  • 19 Linguistic Data Consortium, 2005;, https://www.ldc.upenn.edu/collaborations/past-projects/ace/annotation-tasks-and-specifications
  • 20 W. Che, Z. Li, T. Liu, "LTP: a Chinese language technology platform," in Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Beijing, China, 2010;pp. 13-16. custom:[[[-]]]
  • 21 SourceForge.Net, 2010;, http://jvntextpro.sourceforge.net/
  • 22 Q. Yang, Z. Yu, X. Hong, S. Gao, Z. Tang, "Chinese-Vietnamese word similarity computation based on Wikipedia," Journal of Nanjing University of Science and Technology, vol. 40, no. 4, pp. 461-466, 2016.doi:[[[10.14177/j.cnki.32-1397n.2016.40.04.014]]]
  • 23 X. Wan, J. Yang, J. Xiao, "Manifold-ranking based topic-focused multi-document summarization," in Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 2007;pp. 2903-2908). custom:[[[-]]]
  • 24 C. Y. Lin, E. Hovy, "Automatic evaluation of summaries using n-gram co-occurrence statistics," in Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, 2003;pp. 150-157. custom:[[[-]]]
  • 25 D. R. Radev, H. Jing, M. Stys, D. Tam, "Centroid-based summarization of multiple documents," Information Processing & Management, vol. 40, no. 6, pp. 919-938, 2004.doi:[[[10.1016/j.ipm.2003.10.006]]]
  • 26 J. G. Yao, X. Wan, J. Xiao, "Phrase-based compressive cross-language summarization," in Proceedings of the 2015 Conference on Empirical Methods Natural Language Processing, Lisbon, Portugal, 2015;pp. 118-127. custom:[[[-]]]

Table 1.

Specific data for the experiment
Topic Language Number of sentences Average length
Nguyen Phu Trong’s visit to China ChineseVietnamese 421394 3446
Releasing water into the Mekong river ChineseVietnamese 405399 2931
Defense Minister meeting ChineseVietnamese 388149 3842

Table 2.

Chinese difference summaries results
[TeX:] $$\operatorname{Ngram}(1,1)$$ [TeX:] $$\operatorname{Ngram}(1,2)$$
Centroid 0.1056 0.0837
Centroid++ 0.1301 0.1194
PBES 0.1536 0.1372
Our method 0.1837 0.1487

Table 3.

Vietnamese difference summaries results
[TeX:] $$\operatorname{Ngram}(1,1)$$ [TeX:] $$\operatorname{Ngram}(1,2)$$
Centroid 0.0949 0.0643
Centroid++ 0.1403 0.1125
PBES 0.1447 0.1292
Our method 0.1821 0.1462

Table 4.

Summary of news differences
Chinese difference summary Vietnamese difference summary
很多人为中国慷慨激昂的大国风范点赞; 越南干旱的问题显然不能怨我们,那到底该怨谁呢? 越南正在遭遇近一个世纪以来的最严重干旱, 湄公河三角洲地区农业受到严重打击.而且由于当地种植水稻,需水量也进一步加大,使得旱情显得越发严重.这次是当地90年来最严重旱灾,近百万人缺乏日常用水,近16万公顷稻田受灾.(A lot of people praise China for our manners as a big country. The drought in Vietnam obviously cannot be blamed on us, but who should take the blame? Vietnam is now suffering its worst drought in nearly a century, and the agriculture of the Mekong Delta region has been hit hard. The drought becomes even more severe because of the increasing demand for water as a result of local planting of rice. This is the most serious drought in 90 years, which causes nearly a million people to lose access to water for everyday needs and nearly 160,000 hectares of rice fields are affected). Trong khi đó, mùa này, nước thượng nguồn sông Mê Kông lại đổ về rất ít do bị ngăn cản bởi hàng loạt các công trình thủy điện của Trung Quốc, Lào, Campuchia. Vậy thì làm sao đập Cảnh Hồng có thể xả cho chúng ta trong nhiều đợt khi không đủ nước? Dòng chảy sẽ chảy qua các nước phía trên, trong khi, Thái Lan, Lào, Campuchia cũng đang bị hạn rất nặng nề. Hiện nay, các nước thuộc hệ thống sông Mekong có một cơ chế hợp tác quan trọng thông qua Hiệp hội sông Mekong. Việt Nam đề nghị Trung Quốc xả lũ cứu hạn đồng bằng sông Cửu Long.(Meanwhile, the upstream Mekong River is falling back very little in this season because it is blocked by a series of hydropower projects in China, Laos and Cambodia. Why are we still lacking water when Jinghong dam released water? The water will flow through Thailand, Laos, Cambodia and other countries which also suffer severe drought. The Mekong River Basin countries have important cooperation mechanisms through the Mekong River Commission at present. Vietnam calls on China to increase its discharge flow in the Mekong Delta).
A summary method of news difference based on a graph model.
Undirected graph model.
The influence of α value on Chinese summary.
The influence of α value on Vietnamese summary.