Linux tricks, tutorials, hardware, politics and philosophy: July 2012

When you think about disks -- thousands of complications come to mind: starting from the disk itself (+/-, DL, RAM etc...) to doubt about the data written on the disk (data corruption) caused by 1000s of possible combinations of problems (dirty lens, bad mechanics, old/faulty driver, manufacturing defect, software fault, writing speed, wrong way of handling, a fingerprint or scratch on the surface etc...)

Following all these facts, it's a matter of chance why a disk should write properly, and in most of my personal experience, they never do, even if they do, they degrade quickly over time and are susceptible of things like fungus etc..

Then it doesn't have any stable shelf life either, at most, it'll be 5 years; i.e you've to rewrite all your disks every 2 years in order to ensure that your data is OK.

Well, after all this, considering the amount of blank disks you waste and the data loss that occurs, disks ARE the most expensive and most unreliable medium.

A much better alternative is to buy hard disks for archival purposes or use tapes.

Starting point –

http://en.wikipedia.org/wiki/Wikipedia:Database_download

Wiki allows downloading of their databases for offline use, these include static HTML (not updated) to SQL dumps to XML dumps.

The most tempting way is to use wiki's official CMS (mediwiki) and host it using Apache locally for internal use. You need to compile php with xml viewer and session support. To import database, you need to build it with the shell interpreter also.

Mediawiki uses PHP and some kind of SQL (mysql, postgre, or sqlite); it depends on the configuration of PHP on what backend will be used.

Next you need to download the XML dumps, follow the above links to do so. As an alternative, follow –

http://dumps.wikimedia.org/

Both point to the same place actually.

Download xml dumps intended as backups (thus may be incomplete). If you want a single XML, download from

http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

The keyword here is the 'pages-articles'. The above file is one large tarball of all articles, other files with name '*pages-articles*' are parts of the actual articles.

These dumps don't have images, only text.

Next, you need to get the sql server started and set the root password (how you do it depends on the OS), next start the apache server, open the wiki CMS and start the setup.

You'll be asked for the username/password for the SQL database you set up for r/w access.

Your wiki will soon be ready and a conf file (in php) will be made available for download which you have to place in the root dir of the installed CMS.

To import the xml database dump, you may use several ways as listed in

http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps

The most reliable but slow method is using the importDump.php php script as root to import the pages.

The maintenance folder has this importDump.php script with a lot more script which's used for housekeeping and management. To import –

php importDump.php

e.g –

php importDump.php '/home/de/media_writeit!/temp(others)/enwiki-20120702-pages-articles1.xml-p000000010p000010000'

This's going to take a lot of time. As the script imports, it will show progress on the number of pages imported.

You may check the import by opening random pages, but all of these will be without images.

Next, the images have to be fetched straight from the Internet in a crawler-like fashion, you may use 'wikix', a C program which does so and puts the images in a directory structure which the wiki CMS can read. Wikix takes input of the same xml files you used to import pages into the database from stdin; as output (stdout) it's going to make a bash script to fetch the images and put it in a directory tree (which's intended to be put in the 'images' directory in the place where you extracted the wiki CMS) which the wikimedia CMS can read.

E.g –

./wikix < ../../enwiki-20120702-pages-articles1.xml-p000000010p000010000.xml > scripts.sh

This makes a single script which uses curl to fetch the images.

Now the script has to be placed in the 'images' directory of mediawiki CMS and made to run.

If you use -p switch many scripts will be made which're intended to be run in parallel to allow parallel fetching (via a single master script which calls these scripts). The stdout doesn't have much when this -p switch is used, there's no output.

Download the source code of Wikix (copy-paste in text files) from –

http://meta.wikimedia.org/wiki/Wikix

The compiled program is recommended to be placed in the images directory, but you can place it anywhere, just ensure the directory structure holding the download pictures are placed in the images directory of the CMS.

Once downloaded (or partially download) and placed in the images folder, run –

php ./rebuildImages.php –missing

This's going to index the images and link it to the images which're linked in the rendered CMS which makes the images available.

The script lies in the maintenance folder.

Linux tricks, tutorials, hardware, politics and philosophy

Tuesday, July 31, 2012

Optical disks -- The most unreliable and expensive data storage medium.

Wednesday, July 25, 2012

Offline wikipedia howto (with Apache, and mediawiki CMS, PHP, mysql)

Followers