How to programmatically get a favicon

I haven’t updated this blog or my favicon fetcher in a long time. To be honest I’ve forgotten myself partly how the code fits together, so maybe I should write it anew from scratch. But first let’s analyze what I actually did… (so I can understand again myself). And maybe in a way for other programmers so they can use this to write their own favicon fetchers or further optimize it.

  1. Database Icon Table

Let’s start with the database. It contains 1 table called “Icon Table” that contains all data needed in this plugin. Since volumes are great: easily up to tens of thousands of url’s it will contain, the database demands are actually quite big for this plugin. This is also why I went for the client-server model: you would not want this plugin to be installed as part of every site but you would want 1 central location that would hold your massive database, so you can request it from any place. In reality and out of experience you will need a dedicated instance just for this favicon fetching.

name

type

description

id

bigint(20)

autoincrement unique id

uri

varchar(2048)

The full URL to a webpage (the “uri”) UNIQUE KEY: the unique key is the URI e.g. the webadres of a single webpage.

uri_hash

varchar(32)

A hashed string of this URI (unique)

favicon_uri

varchar(2048)

The full URL to a favicon (the “furi”)

favicon_uri_hash

varchar(32)

A hashed string of this FURI

favicon_type

varchar(10)

The type e.g. a png or ico or jpg (this will be used as an indicator but we will always check ourselves since often the given extension is not matching the actual filetype)

Favicon_filters

varchar(255)

A list of filters that were applied to the icon, so we know what has happened to it. The plugin defines a set of filters which can be applied to an icon e.g. convert to png (and so we know afterwards that the original one was e.g. ico and now we have physically a png)

Favicon_source

varchar(255)

Information on where we got the icon from, the plugin defines several sources e.g. from the google service or any other public favicon service or using our own favicon fetcher services.

Favicon_default

tinyint(1)

When no icons are found the plugin offers to used default image providers e.g. the gravatar service. If one is chosen then we set this to true so that we know that to this uri/furi combination we need to get the default icon.

Favicon_parent

bigint(20)

The parent of the icon: the first time we find a unique icon it will be the parent record. All the next times we find the same icon we don’t want to fetch it over and over again but link to the parent.

Comment

varchar(255)

Comments given to the icon but can be used to store structural tags to be used by code.

itime

timestamp

The timestamp of when we wrote this record.

 

We have a second table that stores requests e.g. curl actions on sites. This is used for caching purposes: we don´t want to explode our server by massively curling out the Internet and we don´t want to trouble external sites by spidering them over and over again.

  1. Favicon Class

The central object in the favicon plugin is the class Favicon, which is seen in \includes\server\plugins\metadata_favicon\inc\,
the idea was that this thing should be pretty self-contained. The instances are passed along in a factory kind of way and I really wanted to have just 1 class to not overcomplicate stuff.

Before explaining the members etc… of this class, first something about the FURI and the URI because it is needed to understand the complexity behind it:

2.1 Database operations

One of the lessons learned was that fetching a unique correct path to the icon we need the “FURI” (Favicon Unique Resource Indicator) is needed. With normal scrapers you would only need to follow URI‘s but fetching favicons is more complex since with every URI you would follow in a normal scraper you now a 1:m relationship with a second URI, the FURI.

FURI and URI Database operations

As with a regular scraper you will have read and you will have write operations to the database. Fiddling out all the possible combinations needed was quite some work. Basically we want to have only 2 operations on the database for each call to the member that interacts with the database: once to read it and once to write it, if the read action determines that a write action is needed. Because of the potential high amount of operations this required some checks e.g. the icon could have in the pipeline process to be written but during that time the read operation tried to read and found that it did not exist yet which triggered another update….

READ operation on the database is based on a lookup of the URI e.g. the link to a webpage. Basically it returns the record if it exists or FALSE if it is no yet in our database yet (meaning that we have work todo). Notice that if it returns false we start our pipeline process including http fetching, image operations, writing the images to disk and verification processes so during that pipeline it can well be that there is no write operation following if an icon can’t be found.

 

URI

FURI

 
 

0

0

FALSE: When the URI is not present and the FURI then we return FALSE

 

1

0

FALSE: When the URI is present but no FURI we return FALSE. Maybe the parent id is wrong (meaning it also has no parent with a FURI) or the icon data is missing: no favicon is found on the page afterwards and no default one is given. Which means: on the next trigger another call to this site is done to try again over and over again. So at a certain moment you would want to say that this site is “frozen” for a month but… that requires a write to the database with no actual favicon uri.

 

0

1

X: This cannot be: we can’t have no URI but a FURI in the database since URI must be unique

1

1

1

TRUE: same record : if we find both the URI and the FURI and they are identical then we simply return the record : we have found the favicon in the database and process can start to check if we also have it physically in the icon store on disk.

0

1

1

TRUE: another record: if we find both the URI and the FURI in the database but not in the same record we combine records and return the records. (e.g.: somewhere a parent has the icon but not this one, so we need them all: this one for the URI and the other one for the FURI…) (since the other one will have the parent’s URI not the URI we need).

 

1

N

ERROR: if we have multiple favicon URLS with 1 URI then something got corrupted in the parent child relationships

 

N

1

ERROR: if we find multiple URI the same then something seriously is wrong since URI’s should be unique

 

N

0

ERROR: if we find multiple URI the same then something seriously is wrong since URI’s should be unique

 

0

N

X: cannot be we can’t have no URI but a FURI in the database since URI must be unique

 

N

N

ERROR: database is probably corrupted

 

WRITE operation on the database follows after a. the READ operation returned false and b. our favicon pipeline delivered us something (could be the fury if we are lucky but could just as well be a link to a default icon or the parent location). Contrary to the READ operation we do NOT beforehand check on the URI (which is the input for a READ) but on the FURI, since at this stage we would know more about the FURI. Which may very well not be in the same record as the URI.

 

URI

FURI

 

Method in Favicon class

 

0

0

If we don’t find a FURI and even not a URI then simple: do a full insert.

icoFullInsert()

 

1

0

If we don’t find a FURI but we have 1 URI in the database then update the record. We have found a new icon for this URL. (maybe the owner changed his favicon, etc…)

updateUri($existing_uri_comment,$id)

 

0

1

If we find a FURI, it was already in the database, but apparently not with this URI. So now we insert a new record with a link to the original URI,FURI id.

icoInsertWithLinkToFuriId($furlId)

1

1

1

If we find both in the same record then probably the code for this plugin was changed in the meanwhile, maybe some extra records in the database were added so then update the record (since the read operation before this one did not find it) (meaning the write operation is maybe called directly to update the database)

updateFeatures($uri,$furi)

0

1

1

If we find a FURI and we find the URI but they are not in the same record then update the URI record with a link to the parent FURI record. Apparently the site owner now added a icon that was not there before and the icon is one that we already have.

updateUriWithParentFuri($uriID,$furiID,$uriComment)

 

1

N

ERROR: if we get multiple FURI’s returned then the parent child relationship in the database is corrupted.

 
 

N

1

ERROR: if we find multiple URI the same then something seriously is wrong since URI’s should be unique

 
 

N

0

ERROR: if we find multiple URI the same then something seriously is wrong since URI’s should be unique

 
 

0

N

ERROR: if we find multiple FURI’s and no URI then something is wrong: probably the parent-child relationships got corrupted.

 
 

N

N

ERROR: database is probably corrupted

 

 

 

 

 

 

 

 

 

 

 

Wp-Favicons Database Requirements

The Wp-Favicons plugin, which iconizes your blog-links, makes use of request caching. We need this for 2 reasons:

image

1. so that when you delete your favicon cache and want to regenerate your favicons, it will not go out and do thousands of requests again (assuming you have thousands of links on your weblog)

2. to validate your links and produce those cool little round icons indicating green, yellow, red, black or white links. Very handy to bring your 404 links etc… back to 0 in a short time and to provide your visitors some information about the links they click.

For me: I really wanted to have this. I even want to have the tool clean my links automatically in some way, which will appear in the  next version.

BUT……………. this of course has impact on the storage on your server or shared hosting account.

1. To show icons to your users, we have to store them on disk since embedded bas64 encoded icons are not supported by all browsers. This means, that for e.g. 10.000 outgoing links it will store 1.000 icons, which are small but each take a few Kb (assuming you link multiple times to the same websites) (since it will store an icon only if it is unique in terms of uri where it came from). But… whatever a CDN coupling might be in a next release.

2. To have an index of the icons on your disk we use a metadata cache, which is a database cache so this is then 10.000 records for 10.000 outgoing links.

3. To cache all your outgoing requests we store each request in the database and ergo duplicate the icon or the webpages we scanned in the request. THIS is a biggie.

On my system my original database export was about 120 Mb (for my personal multisite environment). When I installed the plugin the export became 825 MB. So this single plugin makes your database about 7 times as large. GRIN.

But yes, this plugin has high demands. If you have the disk space (so let’s say 7 time your current database), then it is no problem. If you are on a database disk space limited server then you might have a problem and better not use it.

For me, I really really want to have this one. It’s a heavy one, but one that is worth it.

But… for the next release I will try to lower the database size of the request cache, this will probably will mean some lost functionality but let’s see how far we get.

WP-Favicons will clean bad links and will show your visitors where they redirect to

image

The next version of WP-Favicons (still in the Trunk) will show you more information about your links:

- it will show visitors where they redirect to if they e.g. have a short url in front of them, handy because at least me likes to know where he clicks to and for you: if you have always “misspelled” a url then now you know that you should use the correct name.

image- it will show 404, 500, etc… and possible in the next release try to auto-clean them. If you are like me Google Webmaster will report you hundreds of broken links. And personally I am too lazy to clean them. Probably in the release after this one it will contain some kind of functionality to auto clean up your links. If you don’t want to auto-clean then at least it’s handy to see at once which links are no longer good and you can doublecheck them.

In the screenshot the current state: no styling or icons yet, but the information is there. I I decided little unobtrusive little status indicator icons looks the nicest.

Every HTTP code will get its own color in a categorized range (all 3xx will be yellowish) and status code 418 will be a teapot Glimlach

Of course you can select which functionality you want to use Glimlach

Handling Redirects for WP Favicons with redirects set to 0 with WordPress wp_remote_get

For version 0.51 of WP Favicons I want to  come nearer to handling 100% of the Favicons out there. Measuring this is simple because for every Favicon it does not handle correctly one by Google or Geticon.org (as backup) will be used. So I can see from the statistics (in the information screen) how good the current code is. Ideally we would like to catch 100% of everything possible (and that is still some way to go: 80% is simple but the more you crawl nearer to the 100% the tougher it gets).

What I describe below is for the most part in the code in this file: http://plugins.trac.wordpress.org/browser/wp-favicons/trunk/includes/class-http.php (although locally I have a newer version)

Continue reading

WP-Favicons: favicons for links on your blog.

imageI have not blogged about it… but I made a plugin called WP-Favicons which you can download here: http://wordpress.org/extend/plugins/wp-favicons/.

 

DEMO:

I am using it on my own blog: http://edward.de.leau.net so you can check it out there as demo.

It used to be an internal little plugin which I used. I then had a question on StackExchange and I dumped some code on the WordPress svn site. And I improved it a little since.

It is now on version 0.5.0 , which is public but has 1 big bug: the HTTP response codes are not written in the database since I accidently copied twice the same line, so two times the response message is written instead of the the response message and the response code. GRIN. (which is fixed in 0.5.1)

Nevertheless that is the code I am also using on my own blog since well.. it does not disturb anything but will be a clean action during the upgrade to 0.5.1.

In principle it shows Favicons on your blog next to links on your blog. And it does it well. Especially version 0.5.1 which now even gets the icons from redirected links.

Version 0.5.1 will have some refactoring done since of course while coding it you get some new thoughts so that will take some time.

Maybe some highlights:

  • it is polite concerning activation, deactivation and uninstall of the plugin
  • it is almost ready for internationalization Glimlach
  • it not only shows those Google S2 favicons that you see posted everywhere but grabs itself icons from pages and roots and then falls back on google or geticon.org. I hope those percentages will fall back to 0 if the things does what it should do and you should get a much greater coverage of favicons than “just with Google” which only does a smaller part of favicons out there, there is also a little statistics applet in there in the admin pages which show you exact those percentages
  • it has a diskcache and a real nice one: you can see the icon via the domain directory structure e.g. /cache/http/com/google/www/favicon.ico
  • it has a nice plugin framework that works alongside the Settings API of WordPress meaning: developers should be able to add stuff anywhere in the plugin
  • it supports image filters and includes a filter to convert everything (includes .ico) to PNG format. So you can write your own image filter and process favicons to whatever you want
  • it supports placement by filters and specific styling for it. So e.g. you can replace in the post content or text widget and have the icons differently styled as long as there is a WordPress filter for it.
  • it supports default icons when really the site has no favicon, you can choose from the WordPress Identicons, MonsterID, etc… or add your own plugin. It supports both cached defaults and non cached defaults.
  • it supports exceptions such as do not process .zip files
  • Because it builds a database, you get a database of all your outgoing links or uhm… better said all outgoing links that are parsed by the plugin.

But… while writing this I got some new thoughts. Maybe some of them will appear in version 0.5.1 Glimlach

How to setup your WordPress PHP Development environment

If you develop themes or plugins for WordPress you might still be doing this from an editor and FTP your changes to the server. This short guide shows you how to setup a more professional environment and will function as a link I can send around for people still developing PHP in an editor.

It’s not that hard, just follow the following steps and you will have a new cool shiny development environment setup:

Continue reading

Add a site specific theme in WP3 Multisite

To have some place to store screenshots to be able to answer this question I made this post. The question was:

What I’d like to do is get to the stage where I have 1 install that can accommodate the 10 sites, but each site can’t see the details/users/themes etc. of the other sites.

How to have themes that are only available to specific weblogs

Continue reading

Moving a weblog to a WP3 Multisite Weblog System (on MediaTemple)

At one point or another you will start thinking to move several older weblogs or even some domains you have to one weblog system. This has the advantage that you only need to administrate 1 WP installation (and the disadvantage that if one system fails they all fail).

I moved 20 smaller sites to one central weblog system. Those were pretty simple:

Continue reading