readme.md 1.37 KB
Newer Older
Andrew Cohen's avatar
Andrew Cohen committed
1
This is the repository for the source code accompanying the manuscript 
Andrew Cohen's avatar
Andrew Cohen committed
2

Andrew Cohen's avatar
Andrew Cohen committed
3
    A.R. Cohen and P.M.B. Vitányi, "Web Similarity in Sets of Search Terms Using Database Queries," SN Computer Science, 1, 161 (2020).
Andrew Cohen's avatar
Andrew Cohen committed
4
5

The source code is available under the MIT license, see LICENSE.txt in the repository.
Andrew Cohen's avatar
Andrew Cohen committed
6
7
8
9

questions: see https://bioimage.coe.drexel.edu or email andrew.r.cohen 'at' drexel.edu

<h1> prerequisites </h1>
Andrew Cohen's avatar
Andrew Cohen committed
10
the code here uses web search result counts from wikipedia, pubmed, reddit searches as the basis for a normalized (non-metric) distance measure among multisets of objects. support for google is  included here, but is not considered reliable because (1) google result counts are  approximate and (2) generally involve a monetary payment for api access to search result counts. the function Count/GetGoogleCount parses search result html directly to extract counts. 
Andrew Cohen's avatar
Andrew Cohen committed
11
12
13
14
15
16

for wikipedia, search results are extracted from downloaded html. for pubmed and reddit results are obtained via RESTful api's that require user credential information as follows:

For pubmed, they request you provide an email address and an application id.  Create variables called 'email' and 'appID' (any meaningful string) and save them in +Count/pubmed.mat. 

For reddit, you must register to obtain a (free) client ID and secret. Create variables called 'client_id' and 'secret' and save them in +Count/reddit.mat.