§ 1. Introduction
This document is a small step toward a world where software “credibility tools” let people know which online information is trustworthy. In this small step, we present a handful of observations a human or machine can make about web sites and web content which seem likely to be useful in deciding how trustworthy the information might be. These observations, which we call “credibility signals” are selected for having several desirable features:
- Verifiable: When given a report of one of these observations, it is feasible to confirm the observation yourself without doing a prohibitive amount of work in comparison to the likely value of a correct credibility assessment.
- Interoperable: If multiple independent parties make these observations, guided just by this document, they will generally produce data that is effectively the same. This enables an efficient market of data-producing systems and data-consuming systems working together automatically (“interoperating”).
- Hard to game: While it seems likely that every signal could be artificially synthesized as part of a disinformation attack, for these signals the cost or difficulty in doing so seems sufficiently high to make use of the signal a net benefit.
- Available: Signal data is available now, or is likely to be soon, for a large number of websites. When desired, it is generally possible for individuals or small organizations to make these observations and produce this data themselves.
- Scalable: Machines can make this observation now or could be programmed or trained to do so in the near future, with no need for a technical breakthrough.
- Promising: In short, while we have insufficient data to say with certainty that these signals will be useful in making accurate credibility assessments, based on the information we have, we expect their use will often be worthwhile.
We expect many more potential credibility signals also meet these criteria. We are publishing this small set to get community feedback early in the process. If the feedback is sufficiently positive, we may proceed to document additional signals. In earlier work we gathered a crowdsourced list of potential credibility signals and documented technologies for making credibility assessments.
The focus of this document is on the meaning of signals, not the data protocols and formats for exchanging data. That is expected to be the subject of a related document. For now, to give a flavor of how the data might potentially be exchanged, we include potential examples of specific data along with the signal examples. For more details on the design issues in using Semantic technologies for this data interchange, see Options for RDF Expression of Credibility Data.
§ 2. Signals
These signals have been reviewed and approved as "promising" by the W3C Credible Web Community Group, following a process started in January 2020. This initial release contains only the signals the group decided to document and approve during its initial few weeks of this process.
These signals were not necessarily selected for being the most promising or the best in any other way. Later signals may turn out to be superior. These were reviewed first largely because they seemed likely to help clarify the review process. In particular, we intend no endorsement of the Pulitzer Prize as being necessarily more credible than any other prize.
§ 2.1. Date Website First Archived
Assessed by CredWeb CG to be: Reasonably verifiable, Interoperable, Promising.
Definition | There was a website operational at URL [ ] as early as isodate [ ], as shown in the archive page at URL [ ] |
---|---|
Review Date | 2020-01-28 |
Other Label |
|
Similar to | Domain registration date |
§ 2.1.1. Examples
§ 2.1.1.1. Example
Statement | There was a website operational at URL https://news.mit.edu as early as isodate 2015-09-02 as shown in the archive page at URL https://web.archive.org/web/20150902023223/http://news.mit.edu/ |
---|---|
CSV | site,operationalAsEarlyAs,evidence https://news.mit.edu,2015-09-02,https://web.archive.org/web/20150902023223/http://news.mit.edu/ |
_:a cred:site <https://news.mit.edu> . _:a cred:operationalAsEarlyAs "2015-09-02"^^xs:dateTimeStamp . _:a cred:evidence <https://web.archive.org/web/20150902023223/http://news.mit.edu/> . _:a rdf:type cred:ObservationOfDateWebsiteFirstArchived . | |
JSON-LD | { "@context": "https://jsonld.example", "@id": "_:a", "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived", "evidence": "https://web.archive.org/web/20150902023223/http://news.mit.edu/", "operationalAsEarlyAs": "2015-09-02", "site": "https://news.mit.edu" } |
§ 2.1.1.2. Example
Statement | There was a website operational at URL https://nytimes.com as early as isodate 1996-11-12 as shown in the archive page at URL https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/ |
---|---|
CSV | site,operationalAsEarlyAs,evidence https://nytimes.com,1996-11-12,https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/ |
_:a cred:site <https://nytimes.com> . _:a cred:operationalAsEarlyAs "1996-11-12"^^xs:dateTimeStamp . _:a cred:evidence <https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/> . _:a rdf:type cred:ObservationOfDateWebsiteFirstArchived . | |
JSON-LD | { "@context": "https://jsonld.example", "@id": "_:a", "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived", "evidence": "https://web.archive.org/web/19961112181513/http://www.nytimes.com:80/", "operationalAsEarlyAs": "1996-11-12", "site": "https://nytimes.com" } |
§ 2.1.1.3. Example
Here we see one website with two different observations, each with a different date.
Statement | There was a website operational at URL [https://news.example] as early as isodate [2010-01-01] as shown in the archive page at URL [https://archive.example/1234]. There was a website operational at URL [https://news.example] as early as isodate [2015-01-01] as shown in the archive page at URL [https://archive.example/abcd]. |
---|---|
CSV | site,operationalAsEarlyAs,evidence https://news.example,2010-01-01,https://archive.example/1234 https://news.example,2015-01-01,https://archive.example/abcd |
_:a cred:site <https://news.example> . _:a cred:operationalAsEarlyAs "2010-01-01"^^xs:dateTimeStamp . _:a cred:evidence <https://archive.example/1234> . _:a rdf:type cred:ObservationOfDateWebsiteFirstArchived . _:b cred:site <https://news.example> . _:b cred:operationalAsEarlyAs "2015-01-01"^^xs:dateTimeStamp . _:b cred:evidence <https://archive.example/abcd> . _:b rdf:type cred:ObservationOfDateWebsiteFirstArchived . | |
JSON-LD | { "@context": "https://jsonld.example", "@graph": [ { "@id": "_:a", "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived", "evidence": "https://archive.example/1234", "operationalAsEarlyAs": "2010-01-01", "site": "https://news.example" }, { "@id": "_:b", "@type": "http://www.w3.org/ns/credweb#ObservationOfDateWebsiteFirstArchived", "evidence": "https://archive.example/abcd", "operationalAsEarlyAs": "2015-01-01", "site": "https://news.example" } ] } |
§ 2.1.2. Motivation
- Why is this signal hard to game?
- New attackers will have to acquire a pre-existing website with a site identity suited to their attack. Since there are not likely to be many such sites, they will likely be expensive. Older attackers can be "sleeper" sites, waiting for the time to attack, but this is also expensive, and there will be more time for them to be discovered and unmasked. It is likely this signal will decrease in value if adopted, as more sleeper sites are created.
- Connection to quality
- Sites that have been around a long time have probably had more of a chance to develop good internal quality controls.
- Sites that have been around a long time, if they are problematic, are more likely to have been flagged as such, all other things being equal.
- Sites that have been around a long time have generated enough content/information upon which an assessment of quality can be rendered.
- motivationReason
- Over the years, these aspects of a site will tend to be exposed, even if efforts are made to hide them.
- Potential side effects
- Newer sites will have an even harder time reaching an audience
- Reduced innovation on the web, as ranking on this signal presents an additional barrier-to-entry
§ 2.1.3. Availability
- Methods for obtaining
- Enter the URL of the site into the wayback machine at archive.org and look for the earliest snapshot that looks like a real website, not a placeholder "coming soon" site.
- Issues with availability
- It is limited by the practices of available and trustworthy archiving services. If none of them archived your site, you're out of luck. If one of them is compromised, that will devalue use of this signal from that archive.
- Currently requires humans because
- It requires some human judgement to exclude initial placeholder websites
- Expected to become cheap because
- Machines can probably be trained to recognize and exclude placeholder websites
§ 2.2. Corrections Policy
Assessed by CredWeb CG to be: Promising, Reasonably verifiable, Interoperable.
Definition | The news website with its main page at URL [ ] provides a corrections policy at URL [ ] and evidence of the policy being implemented is visible at URL [ ] |
---|---|
Review Date | 2020-02-05 |
Other Label |
|
§ 2.2.1. Examples
§ 2.2.1.1. Example
Statement | The news website with its main page at URL http://thestar.com provides a corrections policy at URL https://www.thestar.com/about/statementofprinciples.html and evidence of the policy being implemented is visible at URL https://www.thestar.com/opinion/corrections.html |
---|---|
CSV | site,correctionsPolicy,evidence http://thestar.com,https://www.thestar.com/about/statementofprinciples.html,https://www.thestar.com/opinion/corrections.html |
_:a cred:site <http://thestar.com> . _:a cred:correctionsPolicy <https://www.thestar.com/about/statementofprinciples.html> . _:a cred:evidence <https://www.thestar.com/opinion/corrections.html> . _:a rdf:type cred:ObservationOfCorrectionsPolicy . | |
JSON-LD | { "@context": "https://jsonld.example", "@id": "_:a", "@type": "http://www.w3.org/ns/credweb#ObservationOfCorrectionsPolicy", "correctionsPolicy": "https://www.thestar.com/about/statementofprinciples.html", "evidence": "https://www.thestar.com/opinion/corrections.html", "site": "http://thestar.com" } |
§ 2.2.2. Motivation
- Why is this signal hard to game?
- setting up a plausible corrections policy and showing evidence of implementation requires work that legitimate news organizations will need to do anyway but will be extra work for attackers
- Connection to quality
- sites that provide a corrections policy and implement it are indicating an openness to scrutiny around their mistakes, which seems likely to lead to greater accuracy
- sites that show a record of corrections over a period of time indicate commitment to principles of accuracy and good journalistic practice.
- Potential side effects
- sites might be downranked for keeping their corrections policies private, possibly for legitimate reasons
- sites that are mistakenly identified as news outlets, such as organizations which issue press releases about themselves, might be downranked for not following this journalistic practice
- a bias towards larger, professional news organizations that have the resources to maintain a visible corrections practice
- a bias against organizations that have not provided online visibility to some of their practices
- a bias against sites that are exceedingly careful to never need corrections, and thus have little or no evidence of implementing their corrections policy
§ 2.2.3. Availability
- Methods for obtaining
- manually browsing a site, looking for any corrections policy and evidence of it being used
- newsQ plans to offer a feed
- Issues with availability
- some sites mention their corrections policy non-obviously on one paragraph on a general ethics/journalism policies page, or a public engagement page. This may lead observers to mistakenly think there is no available corrections policy.
- it may be hard to find evidence of corrections being done, even when they are, especially if the site does not offer a corrections "feed" page.
- Expected to become cheap because
- machines can probably be programmed/trained to observe this signal most of the time
- sites may start to announce this information in a machine-readable form (which raises some risk of gaming)
§ 2.3. Any Award (pending)
Assessed by CredWeb CG to be: Interoperable, Reasonably verifiable.
Definition | The website with main page URL [ ] was honored as part of an awards process for the year [ ] for the prize with main page URL [ ] |
---|---|
Review Date | 2020-02-19 |
§ 2.3.1. Examples
§ 2.3.1.1. Example
This shows a Pulitzer Prize signal being conveyed as this more general "Any Award" signal attached to the Pulitzer website.
Statement | The website with main page URL https://www.propublica.org/ was honored as part of an awards process for the year 2017 for the prize with main page URL https://www.pulitzer.org/ |
---|---|
CSV | site,awardsYear,prizeSite https://www.propublica.org/,2017,https://www.pulitzer.org/ |
_:a cred:site <https://www.propublica.org/> . _:a cred:awardsYear "2017"^^xs:dateTimeStamp . _:a cred:prizeSite <https://www.pulitzer.org/> . | |
JSON-LD | { "@context": "https://jsonld.example", "@id": "_:a", "awardsYear": "2017", "prizeSite": "https://www.pulitzer.org/", "site": "https://www.propublica.org/" } |
§ 2.3.2. Motivation
- Why is this signal hard to game?
- The effort in establishing an award process/system takes time and resources
- Establishing a reputable award requires socialization, which is likely to involve a variety of trust evaluations and uncover most attackers.
- Connection to quality
- Outlets that have been awarded journalism awards, especially many over time, probably have internal quality controls and an organizational structure that supports good journalism.
- Potential side effects
- A bias towards older outlets that have had more time to accumulate awards
- Increased pressure to subvert an existing awards process.
- A surge in fake awards.
- A need for determining which prize-granting sites are trustworthy
- A bias in favor of outlets who campaign for awards or otherwise shift effort toward winning awards.
§ 2.3.3. Availability
- Methods for obtaining
- Determining authoritative websites or publications with awards lists, and then looking up the winners
- Taking data for specific awards and converting them to this form
- Issues with availability
- There are many awards across many languages and cultures, and no central location to review all of them
- Not all high-quality outlets win awards
- Currently requires humans because
- It requires judgement to know which awards are legitimate
- Most awards do not provide data feeds or have available scrapers
- Expected to become cheap because
- Machines can probably scrape the information should some way of centralizing awards locations becomes possible
- Awards-granting organizations might start to publish details of their awards in a machine-readable format
§ 2.4. Pulitzer Prize Recognition (pending)
Assessed by CredWeb CG to be: Interoperable, Reasonably verifiable.
Definition | The website with main page URL [ ] was honored as part of the Pulitzer Prize awards process for the year [ ] in the Pulitzer category [ ] as the website of a prize finalist or winner under the name [ ] as shown at URL [ ] |
---|---|
Review Date | 2020-02-19 |
§ 2.4.1. Examples
§ 2.4.1.1. Example
This shows two organizations sharing the prize.
Statement | The website with main page URL https://www.nydailynews.com/ was honored as part of the Pulitzer Prize awards process for the year 2017 in the Pulitzer category Public Service as the website of a prize finalist or winner under the name New York Daily News as shown at URL [https://www.pulitzer.org/winners/new-york-daily-news-and-propublica].The website with main page URL https://www.propublica.org/ was honored as part of the Pulitzer Prize awards process for the year 2017 in the Pulitzer category Public Service as the website of a prize finalist or winner under the name ProPublica as shown at URL [https://www.pulitzer.org/winners/new-york-daily-news-and-propublica]. |
---|---|
CSV | site,awardsYear,pulitzerCategory,nameAccordingToPulitzer,evidence https://www.nydailynews.com/,2017,Public Service,New York Daily News,https://www.pulitzer.org/winners/new-york-daily-news-and-propublica |
_:a cred:site <https://www.nydailynews.com/> . _:a cred:awardsYear "2017"^^xs:dateTimeStamp . _:a cred:pulitzerCategory "Public Service" . _:a cred:nameAccordingToPulitzer "New York Daily News" . _:a cred:evidence <https://www.pulitzer.org/winners/new-york-daily-news-and-propublica> . | |
JSON-LD | { "@context": "https://jsonld.example", "@id": "_:a", "awardsYear": "2017", "evidence": "https://www.pulitzer.org/winners/new-york-daily-news-and-propublica", "http://www.w3.org/ns/credweb#nameAccordingToPulitzer": "New York Daily News", "http://www.w3.org/ns/credweb#pulitzerCategory": "Public Service", "site": "https://www.nydailynews.com/" } |
§ 2.4.2. Motivation
- Why is this signal hard to game?
- Winning or being a finalist for a Pulitzer Prize is extremely difficult, even for experienced and well-funded professionals
- Connection to quality
- The prize process reflects a mature and well-resourced effort to recognize quality.
- Outlets that have been awarded journalism awards, especially many over time, probably have internal quality controls and an organizational structure that supports good journalism.
- Potential side effects
- A bias towards more traditional outlets
- A bias towards older outlets that have had more time to accumulate awards
- A bias matching whatever biases are brought in by the Pulitzer Prize process and individual prize jury members
§ 2.4.3. Availability
- Methods for obtaining
- Visit pulitzer.org and browse for relevant entries
- NewsQ plans to offer a feed
- Use pulitzer.org's underlying JSON query API
- Issues with availability
- It doesn't cover very much of the news ecosystem, with many high quality sources never winning a Pulitzer Prize
- Expected to become cheap because
- The data is already available in JSON