[READNOTE]

A novel method for topic linkages between scientific publications and patents

💡 MetaData

Title	A novel method for topic linkages between scientific publications and patents
Journal	Journal of the Association for Information Science and Technology
Authors	Shuo Xu; Dongsheng Zhai; Feifei Wang; Xin An; Hongshen Pang; Yirong Sun
Pub. date	2019
DOI	10.1002/asi.24175
JINFO	_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.24175 JCR分区: Q3 中科院分区升级版: 管理学3区影响因子: 3.28 5年影响因子: 3.697 EI: 是 SSCI: Q2 FMS: A JCI: 0.85

**Abstract

**It is increasingly important to build topic linkages between scientific publications and patents for the purpose of understanding the relationships between science and technology. Previous studies on the linkages mainly focus on the analysis of nonpatent references on the front page of patents, or the resulting citation-link networks, but with unsatisfactory performance. In the meanwhile, abundant mentioned entities in the scholarly articles and patents further complicate topic linkages. To deal with this situation, a novel statistical entity-topic model (named the CCorrLDA2 model), armed with the collapsed Gibbs sampling inference algorithm, is proposed to discover the hidden topics respectively from the academic articles and patents. In order to reduce the negative impact on topic similarity calculation, word tokens and entity mentions are grouped by the Brown clustering method. Then a topic linkages construction problem is transformed into the well-known optimal transportation problem after topic similarity is calculated on the basis of symmetrized Kullback–Leibler (KL) divergence. Extensive experimental results indicate that our approach is feasible to build topic linkages with more superior performance than the counterparts.

📜 研究概况

问题：

建立科学-技术主题的关联

现状：

以往ST关联研究-NPR，问题
- 专利申请国别问题被忽视
- NPR在引用网络中相对少，稀疏
- 引用并非实际的知识关联
基于(Xu et al., 2012)基础，但
- 提出一个新模型建立联系
- 将单词分组，把建立主题关联问题转化为最优运输问题（optimal transportation problem）

路径：

提出基于统计的实体-主题模型CCorrLDA2；基于KL散度计算将主题联系问题转为最优运输问题

贡献：

表现更优
建立了论文专利的主题关联

📊 研究细节

数据：
- CHEMDNER语料库 CHEMDER-patents
方法：
- 提出统计实体-主题模型CCorrLDA2（推导略，基于CorrLDA2、LDA，考虑word token entity mention）检测科学出版物和专利的隐含主题
- 使用Brown聚类法对单词标记和实体提及进行聚类：实体-主题分布；实体-类别分布
- 基于对称KL散度计算主题相似度
- 构建主题关联
实证：
- #10000 论文 #14000专利
- NLTK去停用词，分实体类统计；论文41221词，专利24848词，交14225/并51844
- 抽词，Gibbs采样
- 计算负对数似然采样、KL散度，计算最相似专利/论文主题并交换，再计算，获取最终的主题集合
- 无论论文到专利或从专利到论文，通过重组主题顺序，都会出现稀疏对角结构→大多数单边主题仅仅与另一边主题关联
- 评估：评估CCorrLDA2、CorrLDA2、LDA在聚类/不聚类下的联系效果。主题联系分五类(poor fair average good excellent)
  - 不聚类/不含word token entity mention，建立主题链接数较少→由于论文和专利中的陈述表达不同，许多本应联系的主题被忽略
  - 聚类提升效果：把word token合并起来有助于主题关联的建立
  - CCorrLDA2性能远超其他，说明考虑entity mention等对建立主题关联有重要贡献

🚩 主要结论

证实由于论文和专利中的陈述表达不同，许多本应联系的主题被忽略
提出了一种提升建立科技主题关联的方法

📌 创新启示

解决科技文本表达异质性对探测主题关联是很重要的基础工作

🔬 展望思考

除了概率论魔法，应该也可以考虑引入其它类型的学术/发明实体用于科技主题的联系。科技主题的联系其实也可以看做是一个消岐的任务，这里聚类思想很好，后续引入其他实体用于建立关联感觉也是可行的。

📜 原文摘录

[comment]

科技文献中存在大量不同类实体
[comment]

论文专利的引用存在噪声：主题/人（作者与审稿人、申请人与审查人）
[comment]

2004年就有学者率先建立专利分类与科学学科联系
[comment]

利用引文研究主题关联的问题：大量专利不在引文网络大组元里面，会被忽视
[comment]

专利论文的差异：目的、声明、质量等
[comment]

专利论文的语言异质性，同义知识实体可能因此分散
[comment]

将论文专利的主题距离(相似度)看作源souce与汇sink，转为最优运输距离