PHP script to fix server log file with records not in order

Okay, I doubt this post will be useful to anyone else, but I thought I might as well post it since I haven’t updated this blog in a while.

On ‘my’ web server I have some sites that are served by a shared Apache process. This stores daily logs, for one week, rotating the log numbering each day. I have a cron job that runs once a week and concatenates the past week’s logs into a single file, which is then stored elsewhere.

Today I was trying to run awstats to generate stats from the last months worth of logs (I check my sites stats once every month). However, I realised that my cron job was concatenating the daily logs into the weekly log file in the wrong order. This resulted in awstats only picking up the first day from each weekly log, as the rest of the log file had requests with earlier timestamps. It seems that awstats works through the log file chronologically.

So, to get my stats I had to re-order the log files so that the entries were listed chronologically as they should be. To do this I decided to use PHP, mostly just because that is the language I am most familiar with.

The script simply loops through the log file, adding each line to an array, using the request timestamp as the key. Then the array can be sorted by the key, and output back to the file.

I ran the script using the PHP CLI, and found I had to set a large memory limit to avoid out of memory errors. The exact amount of memory needed will depend on how large your logs are.

<?php

//Usage: ~/path/to/php/bin/php -d memory_limit=128M ./whatever-you-name-this-file.php

$sites=array('xoogu', 'xoogu-static1',
			 'another-domain', 'another-domain-static1',
			 'domain3', 'domain3-blog');
$dates=array('20131215', '20131222', '20131229', '20140105' );

//loop through all the sites
foreach($sites as $site){
	//loop through all the dates
	foreach($dates as $date){
		//array to hold each line from the log file
		$records=array();
		//log file location
		$logFile="/path/to/logs-archive/$site/$site-access-$date.log";
		//open the logfile for reading
		$handle=fopen($logFile, 'r');
		//initialise a counter used to ensure we don't loose log entries that occurred at the same time
		$i=0;
		//read the log one line at a time
		while($str = fgets($handle)){
			//parse the date from the log entry
			$recordDate = new DateTime(substr($str, strpos($str, '[')+1, 26));
			//store the log entry by the date with our counter concatenated on the end to avoid overwriting an existing record for a different request that occured at the same time
			$records[$recordDate->format(DateTime::ATOM).$i]=$str;
			//increment the counter
			$i++;
		}
		//close the file
		fclose($handle);
		
		//sort our records to be in date order
		ksort($records);
		//rename the unsorted log file so we still have a copy if anything goes wrong
		rename($logFile, "$logFile-unsorted");
		//the original file has been renamed, so now create a new version of the file and open it
		$handle=fopen($logFile, 'w');
		//write the records in order to the file
		foreach($records as $record){
			fwrite($handle, $record);
		}
		//close the file
		fclose($handle);
		//output progress to screen
		echo "$site-access-$date.log sorted\n";
	}
}

A couple of points to make:

My weekly logs are stored in the format $site-access-$date.log, e.g. for this site the log would be xoogu-access-2014-01-07.log.

The format of the logs is like this:

208.115.113.87 - - [28/Dec/2013:03:15:45 +0000] "GET /robots.txt HTTP/1.0" 200 125 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; help@moz.com)"

So if your log uses a different format (possible), or you use a different structure for storing your logs (very likely), then you’d need to modify the script appropriately. However, I’d be pretty surprised if anyone other than me would have a need for this script. (If you do, leave a comment below). And I shouldn’t even need it any more now I’ve corrected my weekly log archiving cron job.

How to automatically watermark or batch watermark photos using ImageMagick

In this article I’ll share a couple of ways to use ImageMagick to add a watermark to a photo. ImageMagick is a command line program, which is perfect for this job as that means it can scripted.

How to automatically watermark or batch watermark photos using ImageMagick

I use it on my photography website so I can upload original size photos, then the images can automatically be resized and have a watermark added. I won’t go into detail on the image resizing in this article, as I have a feeling the article will already be quite long. But maybe I will cover that aspect in a future article.

Continue reading

Justified Horizontal Menu – CSS & Javascript Solutions

Recently I wanted to use a horizontal navigation menu on a website. I wanted the items in the menu to be evenly distributed across the width of the page, but also butting up against each other, with no space between each item. The other requirement was that the spacing between the text of each item should be equal.

Fluid width menu with justified items, each one with the same padding applied
Fluid width menu with justified items, each one with the same padding applied

Same menu on a smaller width screen, demonstrating the fluid sizing
Same menu on a smaller width screen, demonstrating the fluid sizing

Continue reading

Optimising MySQL queries – JOIN vs IN (subquery)

When searching for some advice relating to an SQL query I was writing recently, I read some advice that you should try to use JOINs rather than IN with a subquery. I modified the particular query I was working on to use a JOIN instead of a subquery, and lo and behold, it was much faster.

So I decided to try and rewrite some other queries I’d written using IN with a subquery. Re-written to use JOINs instead, they should be much faster. But rather than just blindly rewriting the queries, I decided to run a few tests.

Continue reading

Space separated rel values – HTML vs. ATOM

Just a quick note regarding an issue I came across today. HTML allows space separated rel values to indicate link types. I was using this to indicate a URL was both self and canonical, i.e.

<link rel="self canonical" href="http://www.xoogu.com/2013/space-separated-rel-values-html-vs-atom/" />

I was using the same code in both the HTML and ATOM feed versions of my site, but it turns out that ATOM does not allow space separated rel values. So for ATOM you would require two links, like:

<link rel="self" href="http://www.xoogu.com/2013/space-separated-rel-values-html-vs-atom/" />
<link rel="canonical" href="http://www.xoogu.com/2013/space-separated-rel-values-html-vs-atom/" />

Or don’t include a canonical link in your ATOM feed.

Differences between windows batch and linux bash shell script syntax

Carrying on from my previous post about using scripts created in Windows on Linux, here are some comparisons to show the syntax differences between Windows batch scripts and bash scripts.

Comments

Windows batch script

REM some comment

or

:: some comment

Linux shell script

# some comment

For loop through files in directory

Windows batch script

FOR %%i IN (%1\*) DO (
REM do stuff here
)

Linux shell script

for i in "$1"/*
do
# do stuff here
done

Setting a variable to a number and then printing the variable

Windows batch script

SET /A varInteger=0
echo The value of varInteger is %varInteger%

Linux shell script

varInteger=0
echo "The value of varInteger is $varInteger"

Execute a command and store the output in a variable

Windows batch script

FOR %%x IN ('command to be executed') DO SET result=%%x

Linux shell script

result=`command to be executed`

If statement

Windows batch script

IF %somevar% GEQ %someothervar% (
REM do something
)

Linux shell script

if [ $somevar -ge $someothervar ]
then
# do something
fi

(GEQ and -ge is the greater than comparison operator, both bash and windows batch scripts will let you use a range of comparison operators.)

While loop

Windows batch script

:loop
IF !somevar! GEQ %someothervar% GOTO endofloop
REM do something that increases the value of !somevar!
GOTO loop
:endofloop

Linux shell script

while [ ! $somevar -ge $someothervar ]
do
# do something that increases the value of $somevar
done

(If you’re wondering why I used !somevar! instead of %somevar% in the windows script, this is because in a while loop you will be wanting to update the variable. See this article for more info: Windows batch scripting: EnableDelayedExpansion.)

Check last command executed OK, if not print error message and exit

Windows batch script

somecommand
IF %ERRORLEVEL% NEQ 0 (
	echo "somecommand failed, exiting"
	GOTO :EOF
)
REM the EOF label should be located at the end of the file
:EOF

Linux shell script

somecommand
if [ $? -ne 0 ]
then
	echo "somecommand failed, exiting"
	exit 1
fi

Zeropad a number to 4 digits

Windows batch script

SET zeropadded=000%somenumber%
REM Trim zeropadded to only four digits, from the end
echo "4 digit padded = %zeropadded:~-4%"

Linux shell script

zeropadded=`printf "%04d" $somenumber`
echo "4 digit padded = $zeropadded"

Using scripts created in Windows on Linux

A few weeks ago I tried converting a Windows batch script to a Linux shell (bash) script. The syntax between windows and linux shell scripts is quite different, but I also had a problem other than syntax differences.

After modifying the script to use linux syntax (e.g. proper loops instead of labels and GOTOs), I ran the script but it returned the following message:

/bin/bash^M: bad interpreter: No such file or directory

A quick google revealed the problem was the file using windows line endings (\r\n), while linux uses \n for line endings. Luckily there is a handy utility found in most linux distros that you can use to convert files containing windows line endings to linux line endings.

In Ubuntu I used fromdos, though other distros might use dos2unix. In terminal just type

fromdos myscriptfromwindows.sh

Easy, eh?

If you want to convert a file from linux to windows line endings instead, you can just use todos or unix2dos.

Selecting distinct merged results from multiple tables in MySQL

I had an issue recently where I needed to get a list of keywords for one of my websites. The website uses two databases, one for the blog, and a different one for the main part of the site. So I wanted to select the unique (distinct) keywords from the keyword tables on both databases.

This is the query I ended up with, which I will explain in a moment:

SELECT `name` COLLATE utf8_unicode_ci FROM `db_1`.`tags_table` UNION SELECT `Subject` COLLATE utf8_unicode_ci FROM `db_2`.`keywords_table` INTO OUTFILE '/tmp/tags.txt';

So, to start with, we need to get the results from both tables merged together. I did this by using two queries joined by a UNION.

SELECT `name` FROM `tags_table` UNION SELECT `Subject` FROM `keywords_table`

Because the tables are in different databases, we need to prefix the table name followed by a dot (.) when specifying the table to retrieve the records from. E.g. db_1.tags_table selects from the table tags_table in the database db_1. You will also need to be signed into mysql as a user that can access both databases (e.g. the root user) if you want to query across multiple databases.

SELECT `name` FROM `db_1`.`tags_table` UNION SELECT `Subject` FROM `db_2`.`keywords_table`

The two tables I needed to select from had different collations. So I also needed to specify the collation in the query to avoid an Illegal mix of collations error.

SELECT `name` COLLATE utf8_unicode_ci FROM `db_1`.`tags_table` UNION SELECT `Subject` COLLATE utf8_unicode_ci FROM `db_2`.`keywords_table`

The default behaviour of UNION is that duplicate rows are removed from the result. So there is no need to use UNION DISTINCT (though you can if you want). If you do want duplicate rows, use UNION ALL.

All that’s left to do is to get the result into a text file. To do this use INTO OUTFILE at the end of the query, followed by the filename the results should be written to. The file must not already exist, and the mysql user must have write permission for the directory where you want to save the folder. The easiest thing to do is to write the file to the /tmp directory, and then you can move the file from there to wherever you want manually.

SELECT `name` COLLATE utf8_unicode_ci FROM `db_1`.`tags_table` UNION SELECT `Subject` COLLATE utf8_unicode_ci FROM `db_2`.`keywords_table` INTO OUTFILE '/tmp/tags.txt';

How to set up a W3TC mirror CDN for WP multisite with domain mapping

WordPress Multisite with the domain mapping plugin is a good way to run multiple wordpress sites. It means you don’t have to worry about keeping WordPress and plugins up to date for each individual site.

The W3 Total Cache plugin is one of the most popular wordpress plugins for speeding up wordpess websites. One of the things it offers is serving static content from a CDN (Content Delivery Network). You don’t have to use a real CDN, but can instead a generic mirror CDN.

A generic mirror CDN doesn’t have all the benefits of a real CDN. However, because you use a different domain (or subdomain) for the CDN address, it does solve the problem of cookies being sent for static content requests. And unlike a real CDN, you don’t need to pay anything extra for it.

Contents

Setting up a static mirror

To start off, you’ll need to create a subdomain for your CDN. Using this site as an example, the main address is www.xoogu.com, so I would create a subdomain of static1.xoogu.com for use as the CDN. If your webhost uses cpanel, look for ‘Subdomains’ in the ‘Domains’ section.

When adding your subdomain, ensure that the document root is the same as the document root for the main website address.

If you use a main website address that is not prefixed by www. (or anything else), then you will need to use a completely different domain for the CDN. e.g. if my main website address was xoogu.com (no www.), using static1.xoogu.com as a mirror CDN would be pointless.

This is because cookies applied to the root domain are applied to all subdomains as well. So it is a good idea to always use the www. subdomain as your main website address to avoid this issue.

After setting up your static subdomain, you’ll need to edit the .htaccess file in your wordpress folder. Add the following:

#Set an environment variable if it is a static subdomain
SetEnvIfNoCase HOST ^static STATIC_DOMAIN
#Set long expires headers for static subdomains
Header set Cache-Control "max-age=29030400, public" env=STATIC_DOMAIN

What this does is it first sets an apache environment variable called ‘STATIC_DOMAIN’ if the domain starts with ‘static’.

Then the Header line sends a long expires cache header, but only if the ‘STATIC_DOMAIN’ variable is true (i.e. for all requests to domains / subdomains starting with ‘static’).

On my wordpress .htaccess I have the following as well:

#Rewrite any non existant files on www. subdomain to index.php
RewriteRule ^index\.php$ - [L]
RewriteCond %{ENV:STATIC_DOMAIN} !=1
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

#Serve 404 to anyone requesting PHP files from a static subdomain
RewriteCond %{ENV:STATIC_DOMAIN} =1
RewriteRule \.php$ non-existant-file [L]

Now in the W3TC options in the wordpress admin area, you can select to use a CDN and choose Generic Mirror. W3TC will automatically rewrite all your urls pointing to static content to point at your static subdomain instead. For example, if I included an image in this post with an src of /wp-content/uploads/some-image.jpg, W3TC would change that so when the page is served, the src points to http://static1.xoogu.com/wp-content/uploads/some-image.jpg.

Stop requests to /files/* being rewritten to ms-files.php

However, we’re not done quite yet as wordpress multisite usually serves static files via PHP. Yes, it is madness. Thankfully a ticket is on the wordpress trac for fixing this, with a milestone of version 3.4.

In the meantime, you’ll need to fix this yourself. First check your .htaccess and look for a line like

RewriteRule ^([_0-9a-zA-Z-]+/)?files/(.+) wp-includes/ms-files.php?file=$2 [L]

This is the line that causes static files (with the address containing /files/) to be served by PHP. If you know your site doesn’t use /files/ for anything, then you don’t need to do any of the following. Otherwise, comment that line out by adding a pound sign (#) in front of it:

#RewriteRule ^([_0-9a-zA-Z-]+/)?files/(.+) wp-includes/ms-files.php?file=$2 [L]

Now you need to find out the ids of each of your blogs. Look at the database for your wordpress site (e.g. using phpMyAdmin) and find the table called wp_blogs. Look at the blog_id column and make a note of the blog_ids and matching domains.

Map blog_ids using .htaccess

Unless you can setup a rewrite map, you’ll need to manually add a line to your .htaccess file for each domain:

SetEnvIfNoCase HOST xoogu.com$ BLOG_ID=1
SetEnvIfNoCase HOST some-other-domain.com$ BLOG_ID=2
SetEnvIfNoCase HOST another-domain.com$ BLOG_ID=3
RewriteRule ^files/ wp-content/blogs.dir/%{ENV:BLOG_ID}%{REQUEST_URI} [L]

The SetEnvIfNoCase lines map the domains to the wordpress blog_ids. Then the rewrite rule rewrites the request to correct directory.

Map blog_ids using a rewrite map

If your host allows rewrite maps, or you can have access to configure apache, then you can do the following instead:

Create a plain text file called domainToBlogIdMap.txt and add your domains. As far as I’m aware, apache only does exact matches for this, so you need to include both www. and static subdomains.

www.xoogu.com 1
static1.xoogu.com 1
www.some-other-domain.com 2
static1.some-other-domain.com 2
www.another-domain.com 3
static1.another-domain.com 3

In your virtualhost config for the site, add the rewrite map

RewriteMap domainToBlogIdMap txt:/home/xoogu/path/to/domainToBlogIdMap.txt

RewriteRule ^files/ wp-content/blogs.dir/${domainToBlogIdMap:%{HTTP_HOST}}%{REQUEST_URI} [L]

Setting up a static mirror and mapping blog_ids in Nginx

If you’re using nginx you can do the following:

map  $http_host  $name  {
  hostnames;
 
  *.xoogu.com     1;
  *.some-other-domain.com       2;
  *.another-domain.com			3;
  
}


server {
	listen		80;
	server_name 	static1.xoogu.com static1.some-other-domain.com another-domain.com;
	error_log  logs/wpms-error.log warn;
	access_log  logs/wpms-access.log  main;
	
	root  /home/xoogu/path/to/wordpress;
	
	location ~* \.php$ {
		deny all;
	}

	location / {
		try_files       $uri /wp-content/blogs.dir/$name$uri =404;
		expires         max;
		access_log      off;
	    }
}

In the first section we set up our rewrite map to map the domains to the wordpress blog_ids. Then in the second section we set up our static subdomains. Set the root to the same as the main site.

Then we have one location block to prevent people accessing php files via the static subdomain. The other location block serves the requested file if it exists, otherwise it tries to rewrite the request to the correct blog_id directory. If the file doesn’t exist there, it will serve a 404 error.