以語意為基礎之網路犯罪資訊搜尋研究

On Semantic-Based Intelligent Crime Information Retrieval

on the Internet

 

 

顏志平 Chih-Ping Yen

中央警察大學

電子計算機中心

peter@sun4.cpu.edu.tw

 

徐熊健 Shyong-Jian Shyu

銘傳大學

資訊管理研究所副教授

sjshyu@mcu.edu.tw

 

 

檢警機關通常藉助入口網站的搜尋引擎,進行網際網路犯罪情報的搜集,然而這種搜尋引擎由於精確率及檢出率不高,所以往往回應許多不相關的網頁,致使偵辦人員需再耗費時間逐一過濾,相當不符效益,因此本文將運用智慧型的演算方式,來提高精確率及檢出率,以改善這個問題。

首先,利用語意場理論將詞與詞的同義關係,組織成語詞庫,建立起類似WordNet 的階層式架構,同時使用這語詞庫,進行網頁內容的相似度比對。本文共推導五種相似度演算方式:包括「詞頻權重相似度」、「分類指數相似度」、「分類指數權重相似度」、「誤差校正相似度」、「詞頻權重重計」,並分別比較其間之優劣,擇出最佳的方法及推論出門檻值。

此外,本研究成果並與陳志誠於1999年之「網路上高精確率之犯罪資訊蒐尋系統」(稱為e-Detective system)的研究計畫成果作比較,在以搜尋網際網路上「販售非法軟體」為例進行評估,實驗證明本文所建議之系統,其F測量值最佳達0.5581,而前述系統最佳僅為0.2376,顯然本研究系統效能較佳。

 

關鍵詞:搜尋引擎、資訊檢索、文件分類、相似度、精確率、檢出率、語意場、網路犯罪、電子偵探

 

Abstract

We usually search the Web with the help of search engines. Due to the imprecision of the search result, we often face the problem of too many pages recommended. The reason why search engines response many irrelevant pages is that it just exactly matches the search word(s) user entered. In order to cope with the problem, we suggest the determination of similarities that should be associated with a knowledge base to a given topic. That will reduce the number of irrelevant pages significantly.

In this research we first apply to the theory of semantic fields in which a term (concept) forms a term database through its relationships to other concepts. Based on the term databases, we suggest several models to evaluate the similarity between search concepts and the contents of Web pages. They are the model of weighted terms (the modified vector space model), the model of classified weighted terms, and the exponential model of classified weighted terms. The latest one is designed based on to the Facet Analysis Method. We also evaluate the similarity with error correction and term reweighting. The approaches described in this paper are used to construct a search engine for discriminating Web pages advertising pirated compact discs (CDs) that are very difficult to be distinguished from the pages advertising legitimate CDs. We further determine an adequate threshold of term weights for our search purpose as a trade-off of recall and precision. Our search result compared with that of previous work shows the advantage of this approach.

 

Keywords: Search Engine, Information Retrieval, Text Classification, Similarity, Precision, Recall, Semantic Field, Cybercrime, e-Detective