Wednesday, July 25, 2012

Offline wikipedia howto (with Apache, and mediawiki CMS, PHP, mysql)


Starting point –
Wiki allows downloading of their databases for offline use, these include static HTML (not updated) to SQL dumps to XML dumps.
The most tempting way is to use wiki's official CMS (mediwiki) and host it using Apache locally for internal use. You need to compile php with xml viewer and session support. To import database, you need to build it with the shell interpreter also.
Mediawiki uses PHP and some kind of SQL (mysql, postgre, or sqlite); it depends on the configuration of PHP on what backend will be used.
Next you need to download the XML dumps, follow the above links to do so. As an alternative, follow –
Both point to the same place actually.
Download xml dumps intended as backups (thus may be incomplete). If you want a single XML, download from

The keyword here is the 'pages-articles'. The above file is one large tarball of all articles, other files with name '*pages-articles*' are parts of the actual articles.

These dumps don't have images, only text.

Next, you need to get the sql server started and set the root password (how you do it depends on the OS), next start the apache server, open the wiki CMS and start the setup.

You'll be asked for the username/password for the SQL database you set up for r/w access.

Your wiki will soon be ready and a conf file (in php) will be made available for download which you have to place in the root dir of the installed CMS.

To import the xml database dump, you may use several ways as listed in

The most reliable but slow method is using the importDump.php php script as root to import the pages.

The maintenance folder has this importDump.php script with a lot more script which's used for housekeeping and management. To import –

php importDump.php

e.g –

php importDump.php '/home/de/media_writeit!/temp(others)/enwiki-20120702-pages-articles1.xml-p000000010p000010000'

This's going to take a lot of time. As the script imports, it will show progress on the number of pages imported.

You may check the import by opening random pages, but all of these will be without images.

Next, the images have to be fetched straight from the Internet in a crawler-like fashion, you may use 'wikix', a C program which does so and puts the images in a directory structure which the wiki CMS can read. Wikix takes input of the same xml files you used to import pages into the database from stdin; as output (stdout) it's going to make a bash script to fetch the images and put it in a directory tree (which's intended to be put in the 'images' directory in the place where you extracted the wiki CMS) which the wikimedia CMS can read.

E.g –

./wikix < ../../enwiki-20120702-pages-articles1.xml-p000000010p000010000.xml > scripts.sh

This makes a single script which uses curl to fetch the images.

Now the script has to be placed in the 'images' directory of mediawiki CMS and made to run.

If you use -p switch many scripts will be made which're intended to be run in parallel to allow parallel fetching (via a single master script which calls these scripts). The stdout doesn't have much when this -p switch is used, there's no output.

Download the source code of Wikix (copy-paste in text files) from –

The compiled program is recommended to be placed in the images directory, but you can place it anywhere, just ensure the directory structure holding the download pictures are placed in the images directory of the CMS.

Once downloaded (or partially download) and placed in the images folder, run –

php ./rebuildImages.php –missing

This's going to index the images and link it to the images which're linked in the rendered CMS which makes the images available.
The script lies in the maintenance folder.

1 comment: