This is the repository for the source code accompanying the manuscript "Web Similarity," A. R. Cohen and P. M. B. Vitanyi. While the paper is in review, this source code and all associated products are licensed only for evaluation by reviewers.
Once the manuscript is accepted for publication, the code will be released free and open source. The specific license is TBD (GPL? BSD? MIT? etc.)
questions: see or email andrew.r.cohen 'at'
<h1> prerequisites </h1>
the code here uses web search result counts from wikipedia, pubmed, reddit searches as the basis for a normalized (non-metric) distance measure among multisets of objects. support for google is not included here, as google result counts are (1) approximate and (2) generally involve a monetary payment for api access to search result counts.
for wikipedia, search results are extracted from downloaded html. for pubmed and reddit results are obtained via RESTful api's that require user credential information as follows:
For pubmed, they request you provide an email address and an application id. Create variables called 'email' and 'appID' (any meaningful string) and save them in +Count/pubmed.mat.
For reddit, you must register to obtain a (free) client ID and secret. Create variables called 'client_id' and 'secret' and save them in +Count/reddit.mat.
