Searching wikimedia for images

I’ve been making noodledig 2.0 now for a while. In the latest version, it will search wikimedia and find decade-specific images based on music, tv or film. At the moment, I’ve got it searching for music-related images. It took a bit of faffing, but here’s how I’ve done it…

So there’s a few different urls that can be searched, including http://en.wikipedia.org and https://commons.wikimedia.org. In the end, I found the latter one to be easier to get meaningful data out of. The basic idea was to get the ID’s for decade-related music pages then use a concatenation of those to grab a load of image urls.

Get the Id’s of categories

http://commons.wikimedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:1980s_music

This query returns the page ID’s for each sub-category in the category ‘1980s Music.’

Get image urls from Id’s

https://commons.wikimedia.org/w/api.php?action=query&pageids=19107090&prop=imageinfo&iiprop=url

This query uses that page id and gets the image urls of all referenced images on that page. Multiple pages can be queried using a pipe-separated list.

Use a generator

The Wikimedia api also allows generators to be created, where the first query is used to form the basis of an immediate second one.

https://commons.wikimedia.org/w/api.php?action=query&generator=categorymembers&gcmlimit=15&gcmtype=file&gcmpageid=2826951|2828640|2828960&continue=&prop=imageinfo&iiprop=url

This query uses the ‘categorymembers’ property, specifying 3 page id’s (corresponding to categories) and then looks up each page, returning the image urls on each page.

A load of parameters

For my solution, I set up an array with all the mandatory parameters, looking like this…

$this->params = array(
 'action' => 'query',
 'generator' => 'categorymembers',
 'gcmlimit' => self::BATCH_PROCESS_THRESHOLD,
 'gcmtype' => 'file',
 'prop' => 'imageinfo', //get imageinfo
 'iiprop' => 'url|extmetadata|mediatype', //get the image url and extra metadata used on the page
 'format' => 'json',
 'iiurlwidth' => self::IMAGESIZE_THRESHOLD, //specify a thumbnail url to return
 'iiextmetadatafilter' => 'ObjectName|Categories|ImageDescription', //filter extra meta data by just these options
 'iiextmetadatalanguage' => 'en'
 );

Details on most of these properties can be found here. The only other bit of info that needs adding is the pipe-separated list of page ids.  To specify those, I used this…

'gcmpageid'     =>      $pageIds

Using these parameters, I could get a list of thumbnail image urls, with their descriptions, and related categories using PHP’s CURL. I didn’t bother to parse the result as simple xml so it just returns it as json, which can be parsed using json_decode and then accessed using bracket syntax ($data[‘something’][‘something-else’]).

CURLOPT_USERAGENT

To make queries work, you have to use this header (if you’re using PHP) to identify your app. Details can be found here, but basically you need to specify the url the request is coming from, a contact email address and version info about the app.

There you go

For my solution, I’m grabbing about 50 results and caching them using APC, then randomly selecting 5 entries to return to the user. Wikimedia does encourage users to cache data wherever possible so as not to hit their servers too much/often.

That’s about it. Maybe of use to someone, who knows!

 

Leave a Reply