README.txt

Last updated: 28 January 2008

***What is this?  

This is a small collection of scripts that will go get pages changed, added, or deleted 
to a wikiproject and update the full xml dump accordingly, to produce a current snapshot.

***System requirements:

linux (untested on other unix variants)
bash, awk, sed, grep and all the usual goodies
perl
curl

***How to install:

See the file INSTALL for details.

**How to run:

cd into the directory where this package was unpacked. 

If this is the first time running it, you will want a full copy of the XML dump 
for your project to start with.  See 
http://download.wikimedia.org/backup-index.html
for these dumps.  You will want the one that has "All pages, current versions only", including 
discussion pages. (pages-meta-current.*.xml)

*Copy* this file into last_full.xml (the file will be overwritten later with the new
snapshot).  If you edited config.txt to change the values for snapshot or snapshotdir,
move the file accordingly.
Look at the date of the xml dump; you are going to want to get everything from that date
through today.  So for example if your dump is dated Jan 13 and today is Jan 18, you will
want to get 6 days of data.  There may be some overlap with existing content; this is
ok. Overlap ensures that you don't miss some changes.

For the full run, type
./getrcs.sh today today-numdays

In our example, we would have 
./getrcs.sh today today-6

Now wait for a while.  The script will update you as it goes along.  It fetches
the (relevant part of the) rc logs, the move logs, the import logs, the upload logs, and
the delete logs.  It then retrieves all pages called for by those logs (except for the 
deletes :-) )  Finally, it merges these pages into the last full dump and produces
a new current snapshot., which will be found in last_full.xml

The next time you run this script, look at the date you generated the last_full.xml
file, and make the same calculation for the number of days. 

If you are running this script on more than one wiki for example, you can use multiple
configuration files and give the command 
./getrcs.sh today today-6 my-config-file

You can also specify absolute timestamps, a base date that depends on the last run time, 
or hour increments instead of days. Type 
./getrcs.sh
for more information. 

**Other info

Temporary files live in ./tmp, and you can remove them when the run is finished.
They will be removed by the script itself on the next run, all except for the
files with the extension raw or raw.save; these you will want to clear out yourself
once in awhile, as they are kept in case debugging is needed.

This script is meant to be run no more than once a day, so the timestamps it
puts in filenames are the date.  If you need to run it more often because
you are on a very active project, change the file naming convention by editing
the variable "ext" near the top of the file.

**Still more info (What are all these scripts?)

sort.pl and uniq.pl             tiny scripts that replace sort and uniq on linux 
                                cause they are busted for some utf8 characters in my 
                                locale, as I found out the hard way. 
merge-pages-main-and-export.pl  grabs the pages we exported from the live dbn
                                and folds them into the last full xml file.
merge-deletes.pl                deletes selected pages based on the retrieved
                                portion of the delete log.
do-links.sh                     makes symlinks to the getrcs.sh script in case 
                                you want to run certain phases of it in 
                                isolation.
symlinks (getchanges.sh etc)    allows you to run each phase of the script
                                separately; see the INSTALL file for more info.  

**A teeny bit more info

By default the script sleeps 5 seconds between requesting page exports, which are done 
in batches of 500 pages each, and 2 seconds between requests of log portions, which are 
500 lines each.

See TODO for things that probably should be... done.

**Copyright

This little mess is released for use under the GPL v3 or later, as well as under the GFDL 1.2 
or later; the reader may choose which one to use.

Copyright (C) Ariel T. Glenn 2008 (and all other editors of this page; please see
the history page for details).  Please improve it and share!  


INSTALL.txt

Last updated: 20 January 2008

***System requirements:

linux (untested on other unix variants)
bash, awk, sed, grep and all the usual goodies
perl
curl

***How to install:

Untar the file into a convenient location.

Edit the file config.txt and change the line 
wiki="en.wiktionary.org"
to contain the name of your project.

Change the line 
expurl='Special:Export'
to contain the name of your Special:Export page.

You can change the number of seconds between requests for pieces of
the various logs by editing the line
logsecs=2

You can change the number of seconds between requests for page
exports (500 pages each export) by editing the line
pagesecs=5

You can also change the temporary work directory by editing the line
tmp="tmp"

You can change the directory and the name where the last snapshot 
be found, by editing the lines
snapshot="last_full.xml"
and
snapshotdir="./"

You can change the name of the file where we put the startdate for
this run by editing the line
lastrun="lastrun"

If you want to run the script on more than one wiki for example, you
can create multiple configuration files, one for each.


After that, run the command
./do-links.sh

This will create sym links to other names you can use for invoking the script 
one piece at a time.  

**Note

You can run pieces of this script at a time. If you do a directory listing,
you can see that there are several symlinks; each of these is to a name that, 
if used to invoke the script, will run that phase.

Generally, if you are going to do that, you should pass the same number of days
to each phase that you run separately, with the exception of the getpages and domerge
phases, where the number of days isn't actually used :-)

The phases are:

(1)
getchanges.sh  
getmoves.sh  
getimports.sh
getuploads.sh
getdeletes.sh  

(2)
getpages.sh  

(3)
domerges.sh  

The scripts in phase 1 should be run before phase 2 which should be run before
phase 3. 

The scripts in phase 1 retrieve the appropriate part of the specified log.
Titles of pages to retrieve or to delete are generated from these lists.

The script in phase 2 retrieves all pages (except for the deletes) that we put
together in phase 1.

THe script in phase 3 merges these new pages into the old full dump and 
deletes any that need to be removed, checking timestamps to see which 
version is most current or whether the deletion was before a recreation
of the page (for example).

Why would you want to do this?  Maybe you are debugging :-/
But, more likely, you may want to get all the log updates once a day but only 
build a snapshot once a week. (Depending on how large your project is, 
building the whole snapshot could take a long time.)
You would have to do some manual catting of files together for this
but it might be useful.