Jump to content

irc-rss-feed-bot 1.0.0

   (0 reviews)

1 Screenshot

About This File


irc-rss-feed-bot is a dockerized Python 3.11 and IRC based RSS/Atom and scraped HTML/JSON/CSV feed posting bot. It essentially posts the entries of feeds in IRC channels, one entry per message. More specifically, it posts the titles and shortened URLs of entries.



  • Multiple channels on an IRC server are supported, with each channel having its own set of feeds. For use with multiple servers, a separate instance of the bot process can be run for each server.
  • Entries are posted only if the channel has not had any conversation for a certain minimum amount of time, thereby avoiding the interruption of any preexisting conversations. This amount of time is 15 minutes for any feed which has a polling period greater than 12 minutes. There is however no delay for any feed which has a polling period less than or equal to 12 minutes as such a feed is considered urgent.
  • A SQLite database file records hashes of the entries that have been posted, thereby preventing them from being reposted.
  • Posted URLs are shortened using the da.gd service.
  • The hext, jmespath, and pandas DSLs are supported for flexibly parsing arbitrary HTML, JSON, and CSV content respectively. These parsers also support configurable recursive crawling.
  • Entry titles are formatted for neatness. Any HTML tags and excessive whitespace are stripped, all-caps are replaced, and excessively long titles are sanely truncated.
  • A TTL and ETag based compressed disk cache of URL content is used for preventing unnecessary URL reads. Any websites with a mismatched strong ETag are probabilistically detected, and this caching is then disabled for them for the duration of the process. Note that this detection is skipped for a weak ETag.
  • Encoded Google News and FeedBurner URLs are decoded.

For several more features, see the customizable global and feed-specific settings, and commands.


Caption Link
Repo https://github.com/impredicative/irc-rss-feed-bot
Changelog https://github.com/impredicative/irc-rss-feed-bot/releases
Image https://hub.docker.com/r/ascensive/irc-rss-feed-bot


<FeedBot> [ArXiv:cs.AI] Concurrent Meta Reinforcement Learning → https://arxiv.org/abs/1903.02710v1
<FeedBot> [ArXiv:cs.AI] Attack Graph Obfuscation → https://arxiv.org/abs/1903.02601v1
<FeedBot> [InfoWorld] What is a devops engineer? And how do you become one? → https://da.gd/dvXh9
<FeedBot> [InfoWorld] What is Jupyter Notebook? Data analysis made easier → https://da.gd/yrCi
<FeedBot> [AWS:OpenData] COVID-19 Open Research Dataset (CORD-19): Full-text and metadata dataset of
            COVID-19 research articles. → https://registry.opendata.aws/cord-19



For software development purposes only, the project can be set up on Ubuntu as below.

make setup-ppa
make install-py
make setup-venv
make shell
make install
make test
make build


Configuration: secret

Prepare a private secrets.env environment file using the sample below.



This is optional. Refer to the publish.github feature.

Configuration: non-secret

Prepare a version-controlled config.yaml file using the sample below. A full-fledged real-world example is also available.

host: irc.libera.chat
ssl_port: 6697
#ssl_verify: true
nick: MyFeedBot
admin: mynick!myident@myhost
alerts_channel: '#mybot-alerts'
#mirror: '#mybot-mirror'
#  github: MyGithubServiceAccountUsername/IrcServerName-MyBotName-live
#  new: all
      url: https://github.com/impredicative/irc-rss-feed-bot/releases.atom
      period: 12
      shorten: false
      url: https://registry.opendata.aws/rss.xml
        summary: true
      url: https://tools.cdc.gov/api/v2/resources/media/316422.rss
      redirect: true
      url: https://academic.oup.com/rss/site_6122/3981.xml
      mirror: false
      period: 12
          - ^Calendar\ of\ Events$
      url: https://www.ncbi.nlm.nih.gov/research/coronavirus-api/export
      pandas: |-
        read_csv(file, comment="#", sep="\t") \
        .assign(link=lambda r: "https://pubmed.ncbi.nlm.nih.gov/" + r["pmid"].astype("str")) \
      url: https://medicalxpress.com/rss-feed/search/?search=nutrition
      url: https://www.reddit.com/r/FoodNerds/new/.rss
      shorten: false
          pattern: ^https://www\.reddit\.com/r/.+?/comments/(?P<id>.+?)/.+$
          repl: https://redd.it/\g<id>
    ArXiv:cs.AI: &ArXiv
      url: http://export.arxiv.org/rss/cs.AI
      period: 1.5
      https: true
      shorten: false
      group: ArXiv:cs
        empty: false
          title: '^(?P<name>.+?)\.?\ \(arXiv:.+(?P<ver>v\d+)\ '
          title: '{name}'
          url: '{url}{ver}'
      <<: *ArXiv
      url: http://export.arxiv.org/rss/cs.NE
      <<: *ArXiv
      url: http://export.arxiv.org/rss/stat.ML
      group: null
      url: https://status.aws.amazon.com/rss/all.rss
      period: .2
      https: true
      new: none
          pattern: ^(?:Informational\ message|Service\ is\ operating\ normally):\ \[RESOLVED\]
          repl: '[RESOLVED]'
          id: /\#(?P<service>[^_]+)
          title: '[{service}] {title} | {summary}'
          url: '{id}'
      url: https://research.fb.com/publications/
      hext: |-
            <a href:link><h3 @text:title/></a>
            <div class="areas-wrapper"><a href @text:category/></div>
        <div><form class="download-form" action/></div>
          - ^(?:Facebook\ AI\ Research|Machine\ Learning|Natural\ Language\ Processing\ \&\ Speech)$
      url: https://www.infoworld.com/index.rss
      order: reverse
    j:MDPI:N:  # https://www.mdpi.com/journal/nutrients (open access)
      url: https://www.mdpi.com/rss/journal/nutrients
      www: false
      url: https://us-east1-ml-feeds.cloudfunctions.net/kdnuggets
      new: some
      url: https://libraries.io/pypi/scikit-learn/versions.atom
      new: none
      period: 8
      shorten: false
        - https://connect.medrxiv.org/medrxiv_xml.php?subject=Health_Informatics
        - https://connect.medrxiv.org/medrxiv_xml.php?subject=Nutrition
        read: false
      https: true
      url: https://www.reddit.com/r/MachineLearning/hot/.json?limit=50
      jmespath: 'data.children[*].data | [?score >= `100`].{title: title, link: join(``, [`https://redd.it/`, id])}'
      shorten: false
      url: https://www.reddit.com/r/wallstreetbets/hot/.json?limit=98
      jmespath: 'data.children[*].data | [?(not_null(link_flair_text) && score >= `50`)].{title: join(``, [`[`, link_flair_text, `] `, title]), link: join(``, [`https://redd.it/`, id]), category: link_flair_text}'
      emoji: false
      shorten: false
          - ^(?:Daily\ Discussion|Gain|Loss|Meme|Weekend\ Discussion|YOLO)$
      url: https://us-east1-ml-feeds.cloudfunctions.net/pwc/latest
      period: 0.5
      dedup: channel
      url: https://us-east1-ml-feeds.cloudfunctions.net/pwc/trending
      period: 0.5
      dedup: channel
      period: 0.2
          pattern: ^(?P<main_url>https://seekingalpha\.com/[a-z]+/[0-9]+).*$
          repl: \g<main_url>
      shorten: false
        "Daily calendar": \b(?i:economic\ calendar)\b
        "Daily prep": '^Wall\ Street\ Breakfast:\ '
        "Hourly status": ^On\ the\ hour$
        - https://seekingalpha.com/market_currents.xml
        - https://seekingalpha.com/feed.xml
        - https://seekingalpha.com/tag/etf-portfolio-strategy.xml
        - https://seekingalpha.com/tag/wall-st-breakfast.xml
      url: https://papers.ssrn.com/sol3/Jeljour_results.cfm?form_name=journalBrowse&journal_id=3526423&Network=no&lim=false&npage=1
        select: <a href:link href^="https://ssrn.com/abstract=" @text:title />
        follow: <a class="jeljour_pagination_number" @text:prepend("https://papers.ssrn.com/sol3/Jeljour_results.cfm?form_name=journalBrowse&journal_id=3526423&Network=no&lim=false&npage="):url/>
      period: 6
      url: https://www.talkrl.com/feed
      period: 8
        title: false
        summary: true
    YT:3Blue1Brown: &YT
      url: https://www.youtube.com/feeds/videos.xml?channel_id=UCYO_jab_esuFRV4b17AJtAw
      period: 12
      shorten: false
          bg: red
          fg: white
          bold: true
          pattern: ^https://www\.youtube\.com/watch\?v=(?P<id>.+?)$
          repl: https://youtu.be/\g<id>
      url: https://www.youtube.com/results?search_query=%22artificial+general+intelligence%22&sp=CAISBBABGAI%253D
      hext: <a href:filter("/watch\?v=(.+)"):prepend("https://youtu.be/"):link href^="/watch?v=" title:title/>
      period: 12
      shorten: false
        emptied: true
          - \bWikipedia\ audio\ article\b
      <<: *YT
      url: https://www.youtube.com/feeds/videos.xml?channel_id=UCSHZKyawb77ixDdsGog4iWA
          - \bAGI\b

Global settings

  • host: IRC server address.
  • ssl_port: IRC server SSL port.
  • ssl_verify: If false, the TLS/SSL certificate is not verified. Its default is true.
  • nick: This is a registered IRC nick. If the nick is in use, it will be regained. Ensure that the email verification of the registered nick, as applicable to many IRC servers, is complete. Without this email verification being completed, the bot can fail to receive the required event 900 and therefore fail to function.
  • admin: Administrative commands by this user pattern are accepted and executed. Its format is nick!ident@host. An example is JDoe11!sid654321@gateway/web/irccloud.com/x-*. A case-insensitive pattern match is tested for using fnmatch.
  • alerts_channel: Some but not all warning and error alerts are sent to this channel. Its default value is ##{nick}-alerts. The key {nick}, if present in the value, is formatted with the actual nick. For example, if the nick is MyFeedBot, alerts will by default be sent to ##MyFeedBot-alerts. Since a channel name starts with #, the name if provided must be quoted. It is recommended that the alerts channel be registered and monitored.
  • mode: This can for example be +igR for Libera and +igpR for Rizon.
  • mirror: If specified as a channel name, all posts across all channels are mirrored to this channel. This however doubles the time between consecutive posts in any given channel. Mirroring can however individually be disabled for a feed by setting <feed>.mirror.
  • publish.github: This is the username and repo name of a GitHub repo, e.g. feedarchive/libera-feedbot-live. All posts are published to the repo, thereby providing a basic option to archive them. A new CSV file is written to the repo for each posted feed having one or more new posts. The following requirements apply:
    • The repo must exist; it is not created by the bot. It is recommended that an empty new repo is used. If the repo is of public interest, it can be requested to be moved into the feedarchive organization by filing an issue.
    • The GitHub user must have access to write to the repo. It is recommended that a dedicated new service account be used, not your primary user account.
    • A GitHub personal access token is required with access to the entire repo scope. The repo scope is used for making commits. The token is provisioned for the bot via the GITHUB_TOKEN secret environment variable.
  • log.irc: If true, low level IRC events are logged by miniirc. These are quite noisy. Its default is false.
  • once: If true, each feed is queued only once. It is for testing purposes. Its default is false.
  • tracemalloc: If true, memory allocation tracing is enabled. The top usage and positive-diff statistics are then logged hourly. It is for diagnostic purposes. Its default is false.

Feed-specific settings

A feed is defined under a channel as in the sample configuration. The feed's key represents its name.

The order of execution of the interacting operations is: redirect, blacklist, whitelist, https, www, emoji, sub, format, shorten. Refer to the sample configuration for usage examples.

YAML anchors and references can be used to reuse nodes. Examples of this are in the sample.

  • <feed>.url: This is either a single URL or a list of URLs of the feed. If a list, the URLs are read in sequence with an interval of one second between them.

These are optional and are independent of each other:

  • <feed>.alerts.empty: If true, an alert is sent if any source URL of the feed has no entries before their validation. If false, such an alert is not sent. Its default value is true.
  • <feed>.alerts.emptied: If true, an alert is sent if the feed has entries before but not after their validation. If false, such an alert is not sent. Its default value is false.
  • <feed>.alerts.read: If true, an alert is sent if an error occurs three or more consecutive times when reading or processing the feed, but no more than once every 15 minutes. If false, such an alert is not sent. Its default value is true.
  • <feed>.blacklist.category: This is an arbitrarily nested dictionary or list or their mix of regular expression patterns that result in an entry being skipped if a search finds any of the patterns in any of the categories of the entry. The nesting permits lists to be creatively reused between feeds via YAML anchors and references.
  • <feed>.blacklist.title: This is an arbitrarily nested dictionary or list or their mix of regular expression patterns that result in an entry being skipped if a search finds any of the patterns in the title. The nesting permits lists to be creatively reused between feeds via YAML anchors and references.
  • <feed>.blacklist.url: Similar to <feed>.blacklist.title.
  • <feed>.dedup: This indicates how to deduplicate posts for the feed, thereby preventing them from being reposted. The default value is feed (per-feed per-channel), and an alternate possible value is channel (per-channel).
  • <feed>.emoji: If false, emojis in entry titles are removed. Its default value is null.
  • <feed>.group: If a string, this delays the processing of a feed that has just been read until all other feeds having the same group are also read. This encourages multiple feeds having the same group to be be posted in succession, except if interrupted by conversation. It is however possible that unrelated feeds of any channel gets posted between ones having the same group. To explicitly specify the absence of a group when using a YAML reference, the value can be specified as null. It is recommended that feeds in the same group have the same period.
  • <feed>.https: If true, entry links that start with http:// are changed to start with https:// instead. Its default value is false.
  • <feed>.message.summary: If true, the entry summary (description) is included in its message. The entry title, if included, is then formatted bold. This is applied using IRC formatting if a style is defined for the feed, otherwise using unicode formatting. The default value is false.
  • <feed>.message.title: If false, the entry title is not included in its message. Its default value is true.
  • <feed>.mirror: If false, mirroring is disabled for this feed. Its default value is true, subject to the global-setting for mirroring.
  • <feed>.new: This indicates up to how many entries of a new feed to post. A new feed is defined as one with no prior posts in its channel. The default value is some which is interpreted as 3. The default is intended to limit flooding a channel when one or more new feeds are added. A string value of none is interpreted as 0 and will skip all entries for a new feed. A value of all will skip no entries for a new feed; it is not recommended and should be used sparingly if at all. In any case, future entries in the feed are not affected by this option on subsequent reads, and they are all forwarded without a limit.
  • <feed>.order: If reverse, the order of the entries is reversed.
  • <feed>.period: This indicates how frequently to read the feed in hours on an average. Its default value is 1. Conservative polling is recommended. Any value below 0.2 is changed to a minimum of 0.2. Note that 0.2 hours is equal to 12 minutes. To make service restarts safer by preventing excessive reads, the first read is delayed by half the period. To better distribute the load of reading multiple feeds, a uniformly distributed random ±5% is applied to the period for each read.
  • <feed>.redirect: This indicates whether to substitute each entry URL with its redirect target. The default value is false.
  • <feed>.shorten: This indicates whether to post shortened URLs for the feed. The default value is true. The alternative value false is recommended if the URL is naturally small, or if sub or format can be used to make it small. If a "Blacklisted long URL" error is experienced for a reasonable website which should not be blacklisted, it can be reported here, using this issue as an example.
  • <feed>.style.name.bg: This is a string representing the name of a background color applied to the feed's name. It can be one of: white, black, blue, green, red, brown, purple, orange, yellow, lime, teal, aqua, royal, pink, grey, silver. The channel modes must allow formatting for this option to be effective.
  • <feed>.style.name.bold: If true, bold formatting is applied to the feed's name. Its default value is false. The channel modes must allow formatting for this option to be effective.
  • <feed>.style.name.fg: Foreground color similar to <feed>.style.name.bg.
  • <feed>.topic: This updates the channel topic with the short URL of a matching entry. It requires auto-op (+O) to allow the topic to be updated. The topic is divided into logical sections separated by | (<space><pipe><space>). For any matching entry, only its matching section in the topic is updated. Its value can be a dictionary in which each key is a section name and each value is a regular expression pattern. If a regular expression search matches an entry's title, the section in the topic is updated with the entry's short URL. The topic's length is not checked.
  • <feed>.whitelist.category: This is an arbitrarily nested dictionary or list or their mix of regular expression patterns that result in an entry being skipped unless a search finds any of the patterns in any of the categories of the entry. The nesting permits lists to be creatively reused between feeds via YAML anchors and references.
  • <feed>.whitelist.explain: This applies only to <feed>.whitelist.title. It can be useful for understanding which portion of a post's title matched the whitelist. If true, the first match of each posted title is italicized. This is applied using IRC formatting if a style is defined for the feed, otherwise using unicode formatting. For example, "This is a matching sample title". The default value is false.
  • <feed>.whitelist.title: This is an arbitrarily nested dictionary or list from which all leaf values are used. The leaf values are regular expression patterns. This result in an entry being skipped unless a search finds any of the patterns in the title. The nesting permits lists to be creatively reused between feeds via YAML anchors and references.
  • <feed>.whitelist.url: Similar to <feed>.whitelist.title.
  • <feed>.www: If false, entry links that contain the www. prefix are changed to remove this prefix. Its default value is null.

For a non-XML feed, one of the following non-default parsers can be used. Multiple parsers cannot be used for a feed. The parsers are searched for in the alphabetical order listed below, and the first to be found is used. Each parsed entry must at a minimum return a title, a link, an optional summary (description), and zero or more values for category The title can be a string or a list of strings.

  • <feed>.hext: This is a string representing the hext DSL for parsing a list of entry dictionaries from an HTML web page. Before using, it can be tested in the form here. Note that max_searches is set to 100_000 to protect against resource exhaustion.
  • <feed>.jmespath: This is a string representing the jmespath DSL for parsing a list of entry dictionaries from JSON. Before using, it can be tested in the form here.
  • <feed>.pandas: This is a string command evaluated using pandas for parsing a dataframe of entries. The raw content is made available to the parser as a file-like object named file. This parser uses eval which is unsafe, and so its use must be confirmed to be safe. The provisioned packages are json, numpy (as np), and pandas (as pd). The value requires compatibility with the versions of pandas and numpy defined in requirements.txt, noting that these version requirements are expected to be routinely updated.

For recursive crawling, the value of a parser can alternatively be:

  • <feed>.<parser>.select: This is the string which was hitherto documented as the value for <feed>.<parser>.. The parser uses it to return the entries to post.
  • <feed>.<parser>.follow: The is an optional string which the parser uses to return zero or more additional URLs to read. The returned URLs can a list of strings or a list of dictionaries with the key url. Crawling applies recursively to each returned URL. Each unique URL is read once. There is an interval of at least one second between the end of a read and the start of the next read. Care should nevertheless be taken to avoid crawling a large number of URLs.

Some sites require a custom user agent or other custom headers for successful scraping; such a customization can be requested by creating an issue.


The sample configuration above contains examples of these:

  • <feed>.format.re.title: This is a single regular expression pattern that is searched for in the title. It is used to collect named key-value pairs from the match if there is one.
  • <feed>.format.re.url: Similar to <feed>.format.re.title.
  • <feed>.format.str.title: The key-value pairs collected using <feed>.format.re.title and <feed>.format.re.url, both of which are optional, are combined along with the default additions of title, url, categories, and feed.url as keys. Any additional keys returned by the parser are also available. The key-value pairs are used to format the provided quoted title string. If the title formatting fails for any reason, a warning is logged, and the title remains unchanged. The default value is {title}.
  • <feed>.format.str.url: Similar to <feed>.format.str.title. The default value is {url}. If this is specified, it can sometimes be relevant to set shorten to false for the feed.
  • <feed>.sub.summary.pattern: This is a single regular expression pattern that if found results in the entry summary being substituted.
  • <feed>.sub.summary.repl: If <feed>.sub.summary.pattern is found, the entry summary is replaced with this replacement, otherwise it is forwarded unchanged.
  • <feed>.sub.title.pattern: Similar to <feed>.sub.summary.pattern.
  • <feed>.sub.title.repl: Similar to <feed>.sub.summary.repl.
  • <feed>.sub.url.pattern: Similar to <feed>.sub.summary.pattern. If a pattern is specified, it can sometimes be relevant to set shorten to false for the feed.
  • <feed>.sub.url.repl: Similar to <feed>.sub.summary.repl.

Feed default settings

A global default value can optionally be set under defaults for some feed-specific settings, namely new and shorten. This value overrides its internal default. It facilitates not having to set the same value individually for many feeds.

Refer to "Feed-specific settings" for the possible values and internal defaults of these settings. Refer to the embedded sample configuration for a usage example.


Commands can be sent to the bot either as a private message or as a directed public message. Private messages may however be prohibited for security purposes using the mode configuration. Public messages to the bot must be directed as MyBotNick: my_command.


Administrative commands are accepted from the configured admin. If admin is not configured, the commands are not processed. It is expected but not required that administrative commands to the bot will typically be sent in the alerts_channel. The supported commands are:

  • exit: Gracefully exit with code 0. The exit is delayed until any feeds that are currently being posted finish posting and being written to the database. If running the bot as a Docker Compose service, using this command with restart: on-failure will (due to code 0) prevent the bot from automatically restarting. Note that a repeated invocation of this command has no effect.
  • fail: Similar to exit but with code 1. If running the bot as a Docker Compose service, using this command with restart: on-failure will (due to a nonzero code) cause the bot to automatically be restarted.
  • quit: Alias of exit.


  • As a reminder, it is recommended that the alerts channel be registered and monitored.

  • It is recommended that the bot be auto-voiced (+V) in each channel. Failing this, messages from the bot risk being silently dropped by the server. This is despite the bot-enforced limit of two seconds per message across the server.

  • It is recommended that the bot be run as a Docker container using using Docker ≥18.09.2, possibly with Docker Compose ≥1.24.0. To run the bot using Docker Compose, create or add to a version-controlled docker-compose.yml file such as:

version: '3.7'
    container_name: irc-rss-feed-bot
    image: ascensive/irc-rss-feed-bot:<VERSION>
#    network_mode: host  # If having DNS name resolution issues.
    restart: on-failure
#    restart: always
        max-size: 2m
        max-file: "5"
      - ./irc-rss-feed-bot:/config
      - ./irc-rss-feed-bot/secrets.env
      TZ: America/New_York  # Select TZ database name from https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
  • In the above service definition in docker-compose.yml:

    • image: Use a specific versioned tag, e.g. 0.12.0.
    • volumes: Customize the relative path to the previously created config.yaml file, e.g. ./irc-rss-feed-bot. This volume source directory must be writable by the container using the UID defined in the Dockerfile; it is 999. A simple way to ensure it is writable is to run a command such as chmod -R a+w ./irc-rss-feed-bot once on the host.
    • env_file: Customize the relative path to secrets.env.
    • environment: Optionally customize the environment variable TZ to the preferred time zone as represented by a TZ database name. Note that the date and time are prefixed in each log message.
  • From the directory containing docker-compose.yml, run docker-compose up -d irc-rss-feed-bot. Use docker logs -f irc-rss-feed-bot to see and follow informational logs.



It is recommended that the supported administrative commands be used together with Docker Compose or a comparable container service manager to shutdown or restart the service.


  • If config.yaml is updated, the container must be restarted to use the updated file.
  • If secrets.env or the service definition in docker-compose.yml are updated, the container must be recreated (and not merely restarted) to use the updated file.


  • A posts.v2.db database file is written by the bot in the same directory as config.yaml. This database file must be preserved with routine backups. After restoring a backup, before starting the container, ensure the database file is writable by running a command such as chmod a+w ./irc-rss-feed-bot/posts.v2.db.
  • The database file grows as new posts are made. For the most part this indefinite growth can be ignored. Currently, the standard approach for handling this, if necessary, is to stop the bot and delete the database file if it has grown unacceptably large. Restarting the bot after deleting the database will then create a new database file, and all configured feeds will be handled as new. This deletion is however discouraged as a routine measure.

Disk cache

  • An ephemeral directory /app/.ircrssfeedbot_cache is written by the bot in the container. It contains one or more independent disk caches. The size of each independent disk cache in this directory is limited to approximately 2 GiB. If needed, this directory can optionally be mounted as an external volume.

User Feedback

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.


  • This will not be shown to other users.
  • Add a review...

    ×   Pasted as rich text.   Paste as plain text instead

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...