Documentation
Introduction
At the time of writing, script4rss is somewhat useful but quite primitive. The creation of a filter will be described here as a small tutorial. The documentation applies to the 0.4x series.
Script4rss takes a description file as input. This file describes the name, link, etc. of the feed, trhe rules to convert the HTML, and the author, license, etc. of the generated script. As of version 0.4a, script4rss only partially checks the validity if the input file
The file has a very simple syntax, consisting of a variable name, a colon, and the value.
- Strings and numbers are given literally.
- Booleans can be "true", "false", "yes", "no", "1", or "0" (without the quotes).
- Regular expressions should be writte as they are used in perl, i.e. between forward slashes, modifiers can be added.
- Comments are prepended by a hash mark (#).
- Some values can extend over multiple lines. This can be done by putting the variable name and colon on all these lines.
- Hash marks and back slashes need to be escaped with a backslash.
- Additional searches for keywords may be introduced. These can be used to keep track of things like the catagory, base uri, etc.
Name | Type | Optional/Required |
# Information about the script | ||
script_name: | STRING | (required) |
script_author: | STRING | (required) |
license: | STRINGS | (required) |
comment: | STRINGS | (optional) |
# Information about the feed | ||
feed_title: | STRING | (required) |
feed_uri: | STRING | (required) |
feed_description: | STRING | (optional) |
feed_image_uri: | STRING | (optional) |
feed_interval: | TIME_STRING | (optional) |
# Modify the behavior of the script | ||
fix_html: | BOOLEAN | (optional) |
# Optional used-defined keywords | ||
keyword_search: | REGEX | (optional) |
keyword_match_start: | REGEX | (optional) |
keyword_match_end: | REGEX | (optional) |
keyword_match: | REGEX | (optional) |
keyword: | NUMBER or STRING WITH BACKREFS | (optional) |
keyword_prefix: | STRING | (optional) |
keyword_postfix: | STRING | (optional) |
# The actual search and match patterns | ||
# There may be multiple catagories, each should start with a bracketed name | ||
[STRING] | (required) | |
search: | REGEX | (required) |
match_start: | REGEX | (optional) |
match_end: | REGEX | (optional) |
match: | REGEX | (required) |
title: | NUMBER or STRING WITH BACKREFS | (required) |
title_prefix: | STRING | (optional) |
title_postfix: | STRING | (optional) |
link: | NUMBER or STRING WITH BACKREFS | (required) |
link_prefix: | STRING | (optional) |
link_postfix: | STRING | (optional) |
description: | NUMBER or STRING WITH BACKREFS | (optional) |
description_prefix: | STRING | (optional) |
description_postfix: | STRING | (optional) |
An example
As an example, we will use the PLoS Biology magazine. This is an open-access scientific magazine for biologists, but it should be instructional for creating filters for other sites. So lets start. We get our content from the page with the current issue, which much too our luck is always the same: http://www.plosbiology.org/plosonline/?request=get-issue. First create a plain-text description file. It's best to have your description files have an extension .s4r, but at the moment script4rss doesn't care what you call it. We call out file plosbio2rss.s4r.
General information about the script
First the stuff about the generetad script is declared, like author, license, etc. A comment at the start of the script is generated with this information
# About the script | ||
script_name | : | plosbiology2rss |
script_author | : | Pieter Edelman |
license | : | Released under the terms of the GNU General Public License (GPL) Version 2. |
license | : | See http://www.gnu.org/ for details. |
comment | : | PLoS Biology is an open-access scientific biology magazine |
comment | : | This script attempts to convert the page linking to the current issue to an RSS feed |
General information about the feed
Then, general information about the feed is declared.
# Information concerning the feed | ||
feed_title | : | PLoS Biology |
feed_uri | : | http://www.plosbiology.org/plosonline/?request=get-issue |
feed_description | : | Public Library of Sciences: Biology |
feed_image_uri | : | # There is a logo on the page, but it's not really suited for an RSS feed |
feed_interval | : | 7d |
If you know RSS, you know which elements are optional and which are required (script4rss tries to check this but I can't guarantee it works). As you can see, script4rss ingnores empty or commented values.
The update interval (feed_interval) can be in the format of XXwXXdXXhXXm to specify the number of weeks, days, hours, and minutes respectively. Other combinations are also possible, for example 1h40m if you want the feed to be updated every one and a half hour. It should be noted however, that liferea excepts at most 10,000 minutes as update interval, so 1 week is capped to 6 days, 22 hours, and 40 minutes :-P
Influence of the behaviour of the script
Unfortunately, if you take a close look at the HTML code, there are some unclosed tags. We can't really on these as markers as they are maybe fixed in the feature (or indeed, in the same issue). That's why we want to filter everything through a cleanup function.
# The HTML of this site is not very good and requires fixing :( | ||
fix_html | : | true |
Matching news items
Now we get to the real stuff. You can specify multiple catagories based on different HTML uses. Unfortunately, script4rss is not smart enough (yet) to figure out how to parse the document for the catagorie an item belongs to, like it is the case here (Essay, Features, etc.). You have to give every catagory a name in between square brackets however, even if there's only one.
# Matches: | ||
[Everything] | ||
search | : | /<a href=\".*?\" class=\"smallNav\"><img src=\"\/images\/icons\/.*?\" .*?><\/a><b>.*<\/b><br>/i |
match_end | : | /<\/div>/ |
match | : | /<a href=\"(.*?)\" class=\"smallNav\"><img src=\"\/images\/icons\/(.*?)" alt=".*?><\/a><b>(.*?)<\/b><br>(.*?)<br>(.*)<br><div class=\"smallNav\">/i |
title_prefix: | : | Article |
title | : | 3 |
link | : | 1 |
description : | : | <img src="http://www.plosbiology.org/images/icons/\2" />By \4<br />\5 |
As you can see there are several regexes above. The first one is "search". This is a regex which is typical for the beginning of a news story. It does not need to be the beginning of a news story, it can also be an announcement (unfortunately script4rss cannot read back lines -- yet). Script4rss reads everything which should be used in an item into a single line ended whenever "match_end" is found. In this case, a closing </div>. If the news item should begin later, "match_start" should be specified. Then the final match: this is a regular expression with certain match groups (5 in tihs case). These saved groups are used for the construction of the final item,which can consist of a title, link, and description. Each of these can be specified by using the number of the match group and an optional "_prefix" and "_postfix" (as for the title), or by using backreferences in a string directly (as for the description). A combination is also possible.
You can get everything from sourceforge: http://sourceforge.net/projects/script4rss/
Pieter Edelman