Skip to content

[WIP] Episode identification & deduplicating

Decentralisation is a key aspect of the Open Podcast API specification. This means that clients need to be able to independently retrieve data from RSS feeds, which might cause duplication if episodes aren’t matched correctly. In a decentralised model, entities might also not necessarily trust other entities, which could be misbehaving and (accidentally) cause all episodes to be merged into one.

To enable a reliable and decentralised approach without duplicate episodes, three principles are adopted:

  • extensive provision of identifiers for episode matching - see the previous page Identifier fields
  • relying on clients to do episode matching - described below
  • sync_ids are used to exchange episode data after a first exchange about the episode

The waterfall process to match episodes is run:

  • at the client side when refreshing a feed
  • at the server side when recieving a ‘new episode’ instruction from a client
  • at the client side when joins an existing family (the ‘pull first, post later’ principle)

Clients and servers might also skip deduplication if a new episode is provided by a known and trusted entity (e.g. if server and client are from the same developer and have adopted shared logic).

Matching must happen only within a single feed, not across feeds. This is because the same episode_guid could (accidentally) be used between different feeds. There is one exception, however: if it concerns a remote item the episode_guid is expected to be same as in the RSS feed identified in the feedUrl or feedGuid attributes of the podcast:remoteItem element.

  • If a client or server supports the remoteItem tag, it is expected to treat the two episodes as a ‘duplicate’. (See also note on Tombstoning in the Overview page.)

The waterfall checks fields in this order:

  1. episode_guid
  2. enclosure_url
  3. Matching of at least 2 out of 3 relevant fields:
    1. publish_date
    2. episode_url
    3. title

In the waterfall process, the following principles apply:

  • Field data must be exact matches in order to proceed through the waterfall.
  • For each step:
    • If there is no match, considering again all episodes in the next step.
    • If there are one or multiple exact matches, consider (only) those matched episodes in the next step.
  • When the end of the waterfall is reached and
    • no match is found, then the episode is considered unique
    • one or more matches are found, then the episode(s) is (are) considered a duplicate and must be deduplicated (see deduplication endpoint)

[K-NL: the Capabilities response/endpoint should probably also inform the client of episode matching capabilities during regular processing, in addition to the declaring of a deduplciation endpoint.]

TO BE DESCRIBED - documentation should go to dedicated endpoint page.

Deduplication MUST be a per-user action. [K-NL: Why does it? The notes don’t explain which scenarios would require this. Probably because of what I noted under sync_id on the Overview page?]

Database architecture of servers is of course up to the implementers, but an important consideration. Single-user (‘small web’) servers might keep a simple structure with one row per episode, while big instances would probably leverage multiple tables to store episode data efficiently.

Given the decentralised nature there are certain sync scenarios which could lead to the need to match and deduplicate episodes. The above processes normally ensures that the following scenarios are handled graciously.

1 episode data gets out of sync and has to be rematched

Section titled “1 episode data gets out of sync and has to be rematched”

A user has two phones, both of which independently pull the RSS feed. Phone A dies right after it refreshed the feed. Phone B pulls the RSS feed as well, informs the server of a new episode, and receives a sync_id for the episode. The podcast publisher doesn’t follow protocol and gives the episode a new episode_guid while fixing a typo in the title. Phone B picks up both changes and tells the server to update its data too. Phone A comes back online and syncs with the server, not aware (yet) of the changed episode_guid and title. It tells the server to create a new a new episode.

  • If the server supports deduplication, it can itentify the new episode and tell Phone A that the episode already exists, providing the updated fields together with the sync_id.
  • If the server does not support deduplication, it will create a new episode in the database which is also passed on to Phone B. Phone B recognises that it is a duplicate, and makes a call to the ‘deduplication’ endpoint. When that’s done, the server informs phone A of the new situation.

2 new client with existing data joins server

Section titled “2 new client with existing data joins server”

A user has a phone and a tablet which both refresh feeds locally. The phone is already syncing with a server, but the tablet is not. The user links the tablet with the server for the first time. The server sends all information it has to the tablet, which then does deduplication, and informs the server if it has any more recent information that should be applied.