[Rpm-metadata] Proposal: using a DBMS for package metadata
luca.barbieri at gmail.com
Sat Oct 2 14:06:25 EDT 2004
-----BEGIN PGP SIGNED MESSAGE-----
seth vidal wrote:
| What Luca described is close to something like rhn. You have a big RDBMS
| on the backend - an xmlrpc server that the client talks to get dep
| information as needed.
| The problems with that are simple - it's hard to get mirrors to put cgis
| of any type on their systems and it's even harder to make all this be
| done securely.
| It's perfectly reasonable to store your metadata in a db of some kind
| provided you control the only mirror or control ALL the mirrors.
Currently, the Fedora Development repository has a mean RPM size of
about 1 MB, a mean .hdr size of about 10 KB and 100-200 mirrors
(counting them precisely requires to avoid double counting ones with
both http and ftp).
Assuming that most traffic is related to update+upgrade, it means that
with a scheme that only downloads metadata for packages that are to be
upgraded, the bandwidth used by metadata exchange is about 1/100 of the
one for the download of RPM packages.
Thus, assuming that the current mirrors can sustain the package traffic,
it is possible that very few servers, or even a single one, could
sustain metadata traffic (RPMs would of course not be put in the DBMS).
| it can be potentially very fast, except that you have to design a
| communication protocol for getting the data from the server(s).
This can be sidestepped by simply opening the DBMS port to the world if
a sufficiently secure DBMS exists; otherwise, a thin wrapper sanitizing
and limiting SQL queries could do (it may have already been written).
Alternatively, accepting some increase of disk usage, which may not be
significant considering the above remarks about the relative size of
RPMs and metadata, common queries can be "baked" in the filesystem.
For instance, assuming that package updates are uniform in time, it
seems that by using several files for power-of-2-sized time intervals, a
"changed-time > x" query can be done with at most twice the optimal
bandwidth and between 2 and lg(t) times the optimal server disk space
where lg is the base-2 logarithm and t is the repository life time
expressed in units of the smallest update delta time.
Queries of "which packages include the given file" can also be baked
trivially by creating a file for each RPM file including the name of the
packages that contain it. However, this will waste a lot of disk space,
pollute disk caches, possibly require hard drive seeks, etc.
Alternatively more than one RPM-file could be packaged in a filesystem
file, but this will probably require more roundtrips.
So, a dumb server requires to accept increased server disk space usage,
increased bandwidth usage or increased latency, and also to extend the
repository format for every new query one wants to efficiently support
on the client.
Actually, the dumb server approach might be better, but the alternative
might be worth considering.
Finally, even if a dumb server is chosen, trading more server disk space
for less bandwidth might be a better approach than the current ones:
using the aforementioned data structure for updating is an example of a
possible way to implement such a strategy.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the Rpm-metadata