Reviewed Credibility Signals

W3C Internal Draft Community Group Report,

Editor:
Sandro Hawke
Participate:
GitHub w3c/credweb
File a bug
Document data source
https://docs.google.com/document/d/1z2vU7g_VzDBezPYDfQg3kivSnn4xEZ6zy3NEIzmMGP4/edit

Abstract

Credibility signals are observations, made by humans or machines, which are used in deciding how much to trust some information. This document specifies some types of these observations which seem particularly useful in online credibility assessments, especially when assisted by machine processing and a network of people and systems making related observations.

In this release, we include only the handful of signals which have been reviewed and approved as “promising” by the W3C Credible Web Community Group, following a process started in late January 2020. The contents of this document are expected to be expanded and revised over time. Feedback and participation are encouraged.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This document was produced by the W3C CredWeb CG. This version is meant for internal review and has not been approved or endorsed by anyone.

GitHub Issues are preferred for discussion of this specification.

Publication as a does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 March 2019 W3C Process Document.

§ 1. Introduction

This document is a small step toward a world where software “credibility tools” let people know which online information is trustworthy. In this small step, we present a handful of observations a human or machine can make about web sites and web content which seem likely to be useful in deciding how trustworthy the information might be. These observations, which we call “credibility signals” are selected for having several desirable features:

We expect many more potential credibility signals also meet these criteria. We are publishing this small set to get community feedback early in the process. If the feedback is sufficiently positive, we may proceed to document additional signals. In earlier work we gathered a crowdsourced list of potential credibility signals and documented technologies for making credibility assessments.

The focus of this document is on the meaning of signals, not the data protocols and formats for exchanging data. That is expected to be the subject of a related document. For now, to give a flavor of how the data might potentially be exchanged, we include potential examples of specific data along with the signal examples. For more details on the design issues in using Semantic technologies for this data interchange, see Options for RDF Expression of Credibility Data.

§ 2. Signals

These signals have been reviewed and approved as "promising" by the W3C Credible Web Community Group, following a process started in January 2020. This initial release contains only the signals the group decided to document and approve during its initial few weeks of this process.

These signals were not necessarily selected for being the most promising or the best in any other way. Later signals may turn out to be superior. These were reviewed first largely because they seemed likely to help clarify the review process. In particular, we intend no endorsement of the Pulitzer Prize as being necessarily more credible than any other prize.

§ 2.1. Date Website First Archived

Assessed by CredWeb CG to be: Reasonably verifiable, Interoperable, Promising.

DefinitionThere was a website operational at URL [ ] as early as isodate [ ], as shown in the archive page at URL [ ]
Review Date2020-01-28
Other Label
  • Website Existed As Of Date
  • Website Age
  • First Known Website Archive
  • Website Release Date
Similar toDomain registration date

§ 2.1.1. Examples

§ 2.1.1.1. Example

StatementThere was a website operational at URL https://news.mit.edu as early as isodate 2015-09-02 as shown in the archive page at URL https://web.archive.org/web/20150902023223/http://news.mit.edu/
CSV
site,operationalAsEarlyAs,evidence
https://news.mit.edu,2015-09-02,https://web.archive.org/web/20150902023223/http://news.mit.edu/
_:a cred:site <https://news.mit.edu>  .
_:a cred:operationalAsEarlyAs "2015-09-02"^^xs:dateTimeStamp  .
_:a cred:evidence <https://web.archive.org/web/20150902023223/http://news.mit.edu/>  .
_:a rdf:type cred:ObservationOfDateWebsiteFirstArchived  .
JSON-LD
{
  "@context": "https://jsonld.example",
  "@id": "_:a",
  "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived",
  "evidence": "https://web.archive.org/web/20150902023223/http://news.mit.edu/",
  "operationalAsEarlyAs": "2015-09-02",
  "site": "https://news.mit.edu"
}

§ 2.1.1.2. Example

StatementThere was a website operational at URL https://nytimes.com as early as isodate 1996-11-12 as shown in the archive page at URL https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/
CSV
site,operationalAsEarlyAs,evidence
https://nytimes.com,1996-11-12,https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/
_:a cred:site <https://nytimes.com>  .
_:a cred:operationalAsEarlyAs "1996-11-12"^^xs:dateTimeStamp  .
_:a cred:evidence <https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/>  .
_:a rdf:type cred:ObservationOfDateWebsiteFirstArchived  .
JSON-LD
{
  "@context": "https://jsonld.example",
  "@id": "_:a",
  "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived",
  "evidence": "https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/",
  "operationalAsEarlyAs": "1996-11-12",
  "site": "https://nytimes.com"
}

§ 2.1.1.3. Example

Here we see one website with two different observations, each with a different date.

StatementThere was a website operational at URL [https://news.example] as early as isodate [2010-01-01] as shown in the archive page at URL [https://archive.example/1234]. There was a website operational at URL [https://news.example] as early as isodate [2015-01-01] as shown in the archive page at URL [https://archive.example/abcd].
CSV
site,operationalAsEarlyAs,evidence
https://news.example,2010-01-01,https://archive.example/1234
https://news.example,2015-01-01,https://archive.example/abcd
_:a cred:site <https://news.example>  .
_:a cred:operationalAsEarlyAs "2010-01-01"^^xs:dateTimeStamp  .
_:a cred:evidence <https://archive.example/1234>  .
_:a rdf:type cred:ObservationOfDateWebsiteFirstArchived  .
_:b cred:site <https://news.example>  .
_:b cred:operationalAsEarlyAs "2015-01-01"^^xs:dateTimeStamp  .
_:b cred:evidence <https://archive.example/abcd>  .
_:b rdf:type cred:ObservationOfDateWebsiteFirstArchived  .
JSON-LD
{
  "@context": "https://jsonld.example",
  "@graph": [
    {
      "@id": "_:a",
      "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived",
      "evidence": "https://archive.example/1234",
      "operationalAsEarlyAs": "2010-01-01",
      "site": "https://news.example"
    },
    {
      "@id": "_:b",
      "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived",
      "evidence": "https://archive.example/abcd",
      "operationalAsEarlyAs": "2015-01-01",
      "site": "https://news.example"
    }
  ]
}

§ 2.1.2. Motivation

Why is this signal hard to game?
  • New attackers will have to acquire a pre-existing website with a site identity suited to their attack. Since there are not likely to be many such sites, they will likely be expensive. Older attackers can be "sleeper" sites, waiting for the time to attack, but this is also expensive, and there will be more time for them to be discovered and unmasked. It is likely this signal will decrease in value if adopted, as more sleeper sites are created.
Connection to quality
  • Sites that have been around a long time have probably had more of a chance to develop good internal quality controls.
  • Sites that have been around a long time, if they are problematic, are more likely to have been flagged as such, all other things being equal.
  • Sites that have been around a long time have generated enough content/information upon which an assessment of quality can be rendered.
motivationReason
  • Over the years, these aspects of a site will tend to be exposed, even if efforts are made to hide them.
Potential side effects
  • Newer sites will have an even harder time reaching an audience
  • Reduced innovation on the web, as ranking on this signal presents an additional barrier-to-entry

§ 2.1.3. Availability

Methods for obtaining
  • Enter the URL of the site into the wayback machine at archive.org and look for the earliest snapshot that looks like a real website, not a placeholder "coming soon" site.
Issues with availability
  • It is limited by the practices of available and trustworthy archiving services. If none of them archived your site, you're out of luck. If one of them is compromised, that will devalue use of this signal from that archive.
Currently requires humans because
  • It requires some human judgement to exclude initial placeholder websites
Expected to become cheap because
  • Machines can probably be trained to recognize and exclude placeholder websites

§ 2.2. Corrections Policy

Assessed by CredWeb CG to be: Promising, Reasonably verifiable, Interoperable.

DefinitionThe news website with its main page at URL [ ] provides a corrections policy at URL [ ] and evidence of the policy being implemented is visible at URL [ ]
Review Date2020-02-05
Other Label
  • Corrections Policy Claimed
  • Corrections Policy Present

§ 2.2.1. Examples

§ 2.2.1.1. Example

StatementThe news website with its main page at URL http://thestar.com provides a corrections policy at URL https://www.thestar.com/about/statementofprinciples.html and evidence of the policy being implemented is visible at URL https://www.thestar.com/opinion/corrections.html
CSV
site,correctionsPolicy,evidence
http://thestar.com,https://www.thestar.com/about/statementofprinciples.html,https://www.thestar.com/opinion/corrections.html
_:a cred:site <http://thestar.com>  .
_:a cred:correctionsPolicy <https://www.thestar.com/about/statementofprinciples.html>  .
_:a cred:evidence <https://www.thestar.com/opinion/corrections.html>  .
_:a rdf:type cred:ObservationOfCorrectionsPolicy  .
JSON-LD
{
  "@context": "https://jsonld.example",
  "@id": "_:a",
  "@type": "http://www.w3.org/ns/credweb#ObservationOfCorrectionsPolicy",
  "correctionsPolicy": "https://www.thestar.com/about/statementofprinciples.html",
  "evidence": "https://www.thestar.com/opinion/corrections.html",
  "site": "http://thestar.com"
}

§ 2.2.2. Motivation

Why is this signal hard to game?
  • setting up a plausible corrections policy and showing evidence of implementation requires work that legitimate news organizations will need to do anyway but will be extra work for attackers
Connection to quality
  • sites that provide a corrections policy and implement it are indicating an openness to scrutiny around their mistakes, which seems likely to lead to greater accuracy
  • sites that show a record of corrections over a period of time indicate commitment to principles of accuracy and good journalistic practice.
Potential side effects
  • sites might be downranked for keeping their corrections policies private, possibly for legitimate reasons
  • sites that are mistakenly identified as news outlets, such as organizations which issue press releases about themselves, might be downranked for not following this journalistic practice
  • a bias towards larger, professional news organizations that have the resources to maintain a visible corrections practice
  • a bias against organizations that have not provided online visibility to some of their practices
  • a bias against sites that are exceedingly careful to never need corrections, and thus have little or no evidence of implementing their corrections policy

§ 2.2.3. Availability

Methods for obtaining
  • manually browsing a site, looking for any corrections policy and evidence of it being used
  • newsQ plans to offer a feed
Issues with availability
  • some sites mention their corrections policy non-obviously on one paragraph on a general ethics/journalism policies page, or a public engagement page. This may lead observers to mistakenly think there is no available corrections policy.
  • it may be hard to find evidence of corrections being done, even when they are, especially if the site does not offer a corrections "feed" page.
Expected to become cheap because
  • machines can probably be programmed/trained to observe this signal most of the time
  • sites may start to announce this information in a machine-readable form (which raises some risk of gaming)

§ 2.3. Any Award (pending)

Assessed by CredWeb CG to be: Interoperable, Reasonably verifiable.

DefinitionThe website with main page URL [ ] was honored as part of an awards process for the year [ ] for the prize with main page URL [ ]
Review Date2020-02-19

§ 2.3.1. Examples

§ 2.3.1.1. Example

This shows a Pulitzer Prize signal being conveyed as this more general "Any Award" signal attached to the Pulitzer website.

StatementThe website with main page URL https://www.propublica.org/ was honored as part of an awards process for the year 2017 for the prize with main page URL https://www.pulitzer.org/
CSV
site,awardsYear,prizeSite
https://www.propublica.org/,2017,https://www.pulitzer.org/
_:a cred:site <https://www.propublica.org/>  .
_:a cred:awardsYear "2017"^^xs:dateTimeStamp  .
_:a cred:prizeSite <https://www.pulitzer.org/>  .
JSON-LD
{
  "@context": "https://jsonld.example",
  "@id": "_:a",
  "awardsYear": "2017",
  "prizeSite": "https://www.pulitzer.org/",
  "site": "https://www.propublica.org/"
}

§ 2.3.2. Motivation

Why is this signal hard to game?
  • The effort in establishing an award process/system takes time and resources
  • Establishing a reputable award requires socialization, which is likely to involve a variety of trust evaluations and uncover most attackers.
Connection to quality
  • Outlets that have been awarded journalism awards, especially many over time, probably have internal quality controls and an organizational structure that supports good journalism.
Potential side effects
  • A bias towards older outlets that have had more time to accumulate awards
  • Increased pressure to subvert an existing awards process.
  • A surge in fake awards.
  • A need for determining which prize-granting sites are trustworthy
  • A bias in favor of outlets who campaign for awards or otherwise shift effort toward winning awards.

§ 2.3.3. Availability

Methods for obtaining
  • Determining authoritative websites or publications with awards lists, and then looking up the winners
  • Taking data for specific awards and converting them to this form
Issues with availability
  • There are many awards across many languages and cultures, and no central location to review all of them
  • Not all high-quality outlets win awards
Currently requires humans because
  • It requires judgement to know which awards are legitimate
  • Most awards do not provide data feeds or have available scrapers
Expected to become cheap because
  • Machines can probably scrape the information should some way of centralizing awards locations becomes possible
  • Awards-granting organizations might start to publish details of their awards in a machine-readable format

§ 2.4. Pulitzer Prize Recognition (pending)

Assessed by CredWeb CG to be: Interoperable, Reasonably verifiable.

DefinitionThe website with main page URL [ ] was honored as part of the Pulitzer Prize awards process for the year [ ] in the Pulitzer category [ ] as the website of a prize finalist or winner under the name [ ] as shown at URL [ ]
Review Date2020-02-19

§ 2.4.1. Examples

§ 2.4.1.1. Example

This shows two organizations sharing the prize.

StatementThe website with main page URL https://www.nydailynews.com/ was honored as part of the Pulitzer Prize awards process for the year 2017 in the Pulitzer category Public Service as the website of a prize finalist or winner under the name New York Daily News as shown at URL [https://www.pulitzer.org/winners/new-york-daily-news-and-propublica].The website with main page URL https://www.propublica.org/ was honored as part of the Pulitzer Prize awards process for the year 2017 in the Pulitzer category Public Service as the website of a prize finalist or winner under the name ProPublica as shown at URL [https://www.pulitzer.org/winners/new-york-daily-news-and-propublica].
CSV
site,awardsYear,pulitzerCategory,nameAccordingToPulitzer,evidence
https://www.nydailynews.com/,2017,Public Service,New York Daily News,https://www.pulitzer.org/winners/new-york-daily-news-and-propublica
_:a cred:site <https://www.nydailynews.com/>  .
_:a cred:awardsYear "2017"^^xs:dateTimeStamp  .
_:a cred:pulitzerCategory "Public Service"  .
_:a cred:nameAccordingToPulitzer "New York Daily News"  .
_:a cred:evidence <https://www.pulitzer.org/winners/new-york-daily-news-and-propublica>  .
JSON-LD
{
  "@context": "https://jsonld.example",
  "@id": "_:a",
  "awardsYear": "2017",
  "evidence": "https://www.pulitzer.org/winners/new-york-daily-news-and-propublica",
  "http://www.w3.org/ns/credweb#nameAccordingToPulitzer": "New York Daily News",
  "http://www.w3.org/ns/credweb#pulitzerCategory": "Public Service",
  "site": "https://www.nydailynews.com/"
}

§ 2.4.2. Motivation

Why is this signal hard to game?
  • Winning or being a finalist for a Pulitzer Prize is extremely difficult, even for experienced and well-funded professionals
Connection to quality
  • The prize process reflects a mature and well-resourced effort to recognize quality.
  • Outlets that have been awarded journalism awards, especially many over time, probably have internal quality controls and an organizational structure that supports good journalism.
Potential side effects
  • A bias towards more traditional outlets
  • A bias towards older outlets that have had more time to accumulate awards
  • A bias matching whatever biases are brought in by the Pulitzer Prize process and individual prize jury members

§ 2.4.3. Availability

Methods for obtaining
  • Visit pulitzer.org and browse for relevant entries
  • NewsQ plans to offer a feed
  • Use pulitzer.org's underlying JSON query API
Issues with availability
  • It doesn't cover very much of the news ecosystem, with many high quality sources never winning a Pulitzer Prize
Expected to become cheap because
  • The data is already available in JSON