Niwt - Nifty Integrated Web Tools
Niwt is a project begun by MicahCowan, one-time maintainer of GNU Wget.
Niwt is a tool for downloading resources from the web, and aims to (eventually) reproduce most of the functionality of GNU Wget (and some additional)—in particular, it will support automatic connection restoration, and recursive website fetching/mirroring. However, Niwt's design philosophy differs radically from Wget's—in particular, it is built entirely around Unix pipelines, and facilities to easily swap out or extend every existing piece of functionality with an alternative (or additional) program that offers equivalent (or improved) functionality (as opposed to Wget's more monolithic nature). It is meant to be "on-the-fly patchable", even by non-programmers. It is expected that this will result in a big trade-off between, the relative efficiency and lower resource consumption that Wget enjoys (which Niwt will certainly not), and extreme and relatively easy customization.
Wget is felt to be a very powerful and flexible tool for fetching files from the web, but perhaps suffers from trying to do too many things at once, the result being that rather than wget being a tool that follows the Unix development principle of "doing one thing and doing it well", it's a tool that does several things, some of which might be said to be done in only a mediocre fashion. A primary goal of Niwt is to separate the various tasks performed by a tool like wget, into pipelines of distinct and user-replaceable programs, each of which actually does take responsibility for as little as possible. The overall result may not necessarily be viewed as "better" than wget, which does its job quite well, and with much more efficiency than Niwt is ever likely to achieve. Niwt's design model is explicitly to go "to ridiculous extremes" in modularity, and both its strengths and weaknesses are primarily a result of that design philosophy.
Imagine being able to use grep to decide which links wget follows; or automatically extract tarballs via tar when they're downloaded (in a safe manner). Imagine being able to tell wget whether to follow a link based on which page it was found in. Or to use timestamps for one section of the website, but unconditionally pull the rest. Or follow links that were found in downloaded PDF files, not just HTML. Imagine being able to transform links that wget parses, before it follows them (perhaps to redirect to a mirror site?). These are the sorts of things that Niwt intends to make possible through its extensible design.
The Niwt project is (or will be) a weaving together of many of the ideas that MicahCowan either formed, was exposed to, or had discussions about during his time as GNU Wget's maintainer, as well as (of course) the existing strengths that Wget has to offer.
For more information, you can also hang out at the IRC channel #niwt @ irc.freenode.net, or check out the mailing list (http://addictivecode.org/mailman/listinfo/niwt-users/).
To download and install, see InstallingNiwt.
Niwt’s source code is free and open source software, and is available under the MIT (simple BSD-style) license. Unlike Wget, Niwt is not affiliated with the GNU Project.
Project Goals & Motivations
Some examples of facilities that Wget provides, but which could benefit from separation into distinct, user-replaceable programs with distinct responsibilities:
Parsing. Wget is capable of parsing HTML and CSS content; but it's not possible to specify that wget parse tags from emerging standards that it can't recognize, or that it use some heuristic to parse links from JavaScript content, etc. In addition, Wget can't parse links from content types other than HTML and CSS, such as XML, PDF, or text files. Allowing the user to specify a separate program for handling link-parsing could provide much more flexibility.
Accept/Reject rules. Wget provides some fairly fine-grained controls for deciding whether or not to follow a link. However, these facilities currently do not provide means to match against every portion of a URI; in particular, there is no way to conditionally reject links for downloading, based on a query-string portion (anything following a question-mark "?"). The job of accepting or rejecting links could easily be passed on to a grep-like tool; and in fact, such a tool could base its choices on more than just the link itself—it could select the link based on other information; perhaps something previously stored in a custom database.
Timestamping. Wget currently supports conditional downloading based on whether server content has changed relative to the last version previously downloaded. However, it doesn't save identifying information such as HTTP Entity tags, which could be used to similar effect. In addition, a user could add additional capabilities to handle more exotic cases, like noting when two URIs have the same content length and MD5 checksum, and supply a link between the files rather than re-downloading.
Debugging. Wget provides the --debug option, which gives very verbose traces of most of wget's decision making. However, it often provides much more information than we're interested in, and yet again sometimes it doesn't include information on the specific decision we're interested in. Since Niwt is just a pipeline composed of many separate programs, it's easy to grab just a specific portion of that pipeline, and see how it's affected by different inputs. This removes a lot of guesswork for when we are having trouble understanding why Niwt isn't behaving as we expected it to.
And here are some more exotic ideas about benefits that could be provided by such a model, that do not have any current equivalents in Wget:
Extensibility Niwt is extremely easy to extend, as it provides a shell API for defining arbitrary new options, and modifying behavior (by modifying pipelines). Don't like the way Niwt handles certain things? You can replace pretty much any chunk of Niwt behavior, with a scriptlet of your own to do things differently. Niwt can easily be coerced into doing really pretty much anything.
Link transformation. Filters could be inserted to transform URIs; perhaps to recognize when the same resource is available via a closer mirror, etc, or to resolve URNs to URLs, or user-customized shorthand forms of links into true URIs.
Content transformation. Filters could be added to the stream to perform transformations on content; for instance, arbitrary Content-Encoding values could be supported for automatic decoding (such as gzipped files).
Customized storage. Rather than saving files directly to disk, they could be stored directly into a database, or transformed on-the-fly for archival (zip? tar.gz? multipart/related (.mhtml)?).
Cookie imports. Users could customize cookie handling so that Niwt reads in cookie information from a browser profile.
Protocol translation. Users could add filters that transparently translate between HTTP and other protocols (such as FTP), extending Niwt's functionality to support anything desired.
Security management. Since separate responsibilities are partitioned into separate processes, the privileges of these separate processes may be fine-tuned to ensure that, if any of them turn out to be exploitable via malicious server responses, any risks can be severely limited.
Of course, not everything is a bowl of cherries: there are marked trade-offs—especially in the area of performance.
Portability. Wget runs quite well on Windows, without the need for any artificial "Unix-like" environment such as Cygwin. Niwt's heavy reliance on shell pipelines, with lots of process forking, pretty much assures that Niwt will always remain Unix-specific; though it could still be run under an appropriate environment such as Cygwin or MSYS, or Microsoft's own Unix compatibility suite.
Resource consumption. Niwt will consume far greater resources than wget, which is relatively lean for what it does.
Data redundancy. Data will be repeatedly copied around between processes (and the kernel buffers), leading to another performance impact.
Context switches (time inefficiency). In order to split HTTP headers from payload, certain programs must read only a byte or two at a time until they reach the header, which results in a lot of expensive process context-switches. (Although, in practice, headers make up a relatively small portion of the total traffic, and any losses in efficiency are expected to be of no real consequence in general, since the network transmissions themselves will tend to take much longer than any data processing.)
Fragility. Inevitably, more moving parts means more potential for breaking down. And of course, giving users more control over Niwt behavior, means giving them more potential to break it. All these processes communicating via an HTTP-based protocol; if any one of these programs writes screwy data to the data stream, it'll break Niwt.
For this reason, Niwt will clearly not serve the needs of every user that currently finds Wget useful, and Wget will continue to be a vital tool in many users' toolbelts. In fact, while it is hoped that Niwt will be of interest to a large number of users, it is certain that Wget will continue to meet needs that Niwt never can. Niwt makes some fairly extreme trades, primarily of efficiency and resource consumption, in return for great flexibility and customization.
See TryingOutNiwt for an overview of how niwt is used, how it's designed, and what it can currently be made to do.
