How to get Google’s cached copy of a page and text version with parameters

If you can’t get the cached version the good old fashioned way by typing cache:www.example.com into Google search for some reason (other than the page not being indexed or having the noarchive tag), use this URL:

http://webcache.googleusercontent.com/search?q=cache:http://www.example.com

If you want the text version of the cache, add &strip=1 to the end like so:

http://webcache.googleusercontent.com/search?q=cache:http://www.example.com&strip=1 

How to combine server logs (all files) using Windows command prompt

For SEO purposes, we typically have to analyze server logs to understand what the heck robots are actually doing on our site – sometimes, you’ll get a bunch of individual files from your hosting company, which makes getting all of the data needlessly laborious. For those of you Windows fans, this is how you can easily combine several log files into one file for easy importing into your favourite log analyzer, or even Excel.

Steps

1) Stick all of your server log files into one folder, copy the path to the folder (CTRL + C)

logfiles

2) Click on the Start button, type CMD  (On Windows 8? Poor you, go get your start button back!)

cmd

 

3) Type in “cd” (without quotes), space bar, then right click in the window and choose Paste. For example, I put my .log files in C:\logfiles so my command would be cd C:\logfiles

cmdcd

4)  Now we can combine all the files together, There are a few different ways of doing this, but I prefer to select the exact format of files I want to combine. Call me anal, meh. You can see below that all of my files have a .log extension:

logextension

5) I use the TYPE command because the COPY command might not work if the files are in use somewhere, i.e. it’s just easier.  To combine all of my files into one, I’ll do this: type *.log > biglogfile.log. This means I’m selecting any file (*) with a .log extension and copying (>) into one file (biglogfile.log). Go ahead and press enter and let the computer do it’s thang.

copyfile

6) Boom…done.  This will work with any file type and copying into a different file format – if you’re stuck, go see this thread 

finallog

If any Mac users stumbled onto this article, I’m sorry but I’m allergic to Apple. I’m sure it’s easy enough..

 

Cross domain canonicals from Blogspot blog

blogger blogspot canonical

Disclaimer:

I think Blogspot / blogger is a piece of cr*p. I don’t blame you for having a blogspot blog, but now that you’ve had to Google around to find a cross domain canonical fix, you know exactly how bad it is. For the love of {insert your preferred deity here}, DO NOT HOST ANYTHING WORTH OF ANY VALUE ON BLOGSPOT EVER AGAIN – sincerely, your friendly professional SEO, Dave.

Okay, so for some reason you need to create cross domain canonical tags from your blogspot blog to “wherever”, and you need to control this at page level. I am going to save you from hours of torture, hair loss, and potentially an aggravated trip to Mountain View.

I will assume the following:

  1. You’re not able to redirect your entire blogspot blog to a custom domain – directions here
  2. You have a complete list of all URLs and page titles from your blogspot blog
  3. You’ve exported your content and managed to load it onto a different domain – directions here
  4. You have admin access to the blogspot blog, obvious, but just making sure.
  5. You’ve got basic Excel skills and know how to Vlookup match your exported pages to your new pages via page titles

Passed all those?

Here’s the code for cross domain canonical in the template:

<b:if cond=’data:blog.canonicalUrl == “http://whatever.blogspot.com/this-post”‘><link href=”http://whatever.com/blog/this-post” rel=”canonical”/></b:if>

Easy right? Woohooo! Nay friend, nay.

Bullshit to watch out for #1 – The “canonicalURL”

whatever.blogspot.com works….so does:

  • whatever.blogspot.com.es
  • whatever.blogspot.co.uk
  • whatever.blogspot.in
  • etc…

Yep, Google decided to duplicate every single f*cking URL on international cctlds, regardless if you wanted it or not. So how do they solve the issue of ridiculous amounts of dupe content? Well, they rel canonical back to one version, slick and totally fu*king unnecessary in the first place.

There’s a bunch of old blog posts you’ll probably come across that mention using <b:if cond=’data:blog.url… which wasn’t wrong at the time, but since some drunk at Google decided to auto-implement this geo-bullshit, well, that doesn’t work so well anymore. I tried for ages, and it basically ignored <b:if cond=’data:blog.url every friggin time. Why? No clue, I’d have a better shot explaining why Taco Bell is still allowed to serve Americans grade F dog meat in their burritos.

You need to use the data:blog.canonicalUrl Blogger XML variable to get the cross domain canonical condition to fire on all ccTLDs. Don’t ask why, just do it and get back to drinking.

Bullshit to watch out for #2 – The default “canonical”

You know what happens when you have 2 canonical instructions on one page? Google ignores both completely.

In your template, you’ll probably have this line of code:

<b:include data='blog' name='all-head-content'>

You know what that means? It means Google is going to automatically insert whatever they want, because they know best. In this specific scenario it’s going to insert a few extra meta tags that you don’t give a sh*t about anyway since you’re cross domain canonical’ing anyway, but most importantly, it will insert a default canonical tag which cannot be there if you need to add your own custom canonical tag.

Get rid of it, it’s about as useful as flesh eating disease to you at this point. 

Bullshit to watch out for #3 – Homepage canonical

The homepage is special, you’ll need to add another if statement to handle this, preferably at the top. Pay very close attention to how I reference the blogspot.com URL, you must reference it with the trailing slash or it won’t work!

<b:if cond='data:blog.canonicalUrl == "http://whatever.blogspot.com/"'><link href="http://whatever.com/blog/" rel="canonical"/></b:if>

That’s it, add that to the final block of code in the final implementation section below.

Bullshit to watch out for #4 – Copy & Pasting my code

‘ ” and other characters get bastardized pretty quickly on different platforms.  The biggest culprits are single quotes (‘), just re-type them in okay?

The final implementation!

Now that you’ve gotten rid of that rancid <b:include data=’blog’ name=’all-head-content’> and done all the steps I’ve told you about in the “assumed” section at the top of this post, you’re now going to Excel the shit out of your current blogspot URLs to match your new domain’s URLs.

You should probably back up your template first..

In the <head> section of the template editor (html editor), add ALL of your if statements:

<b:if cond='data:blog.canonicalUrl == "http://whatever.blogspot.com/this-post"'><link href="http://whatever.com/blog/this-post" rel="canonical"/></b:if>

<b:if cond='data:blog.canonicalUrl == "http://whatever.blogspot.com/another-post"'><link href="http://whatever.com/blog/another-post" rel="canonical"/></b:if>

Save it, then go have a beer.

If you have other questions, drop me a line below – I may/may not respond (just being honest, I treasure my free time).

Internal 301 to homepage treated as 404 by Google

Back in May 2013, during a Webmaster central hangout with John Mueller, John confirmed that Google treats internal 301’s to the homepage as 404’s viagra est efficace. That should mean that if you 301 internal pages to the root, they won’t pass PageRank. However, this can still be interpreted incorrectly as there are many questions that remained unanswered.

For example, if example.com/pageA has 100 external links, and subsequently 301’d to the root example.com, does that mean that the value of these links are also gone?

What’s your take on it?

The conversation starts around the 22nd minute in the video below.

How to extract title & meta data using Gdocs, Xpath and ImportXml

I’m pretty sure everyone knows I have an unhealthy obsession with Google docs, and the wonderful things it can achieve. I’ve actually switched from my beloved Microsoft suite to Gdocs full time. Just waiting for those clever Google engineers to up the capacity on the 400k rows of data in a spreadsheet.

This one goes out to Paul from Clixfuel.com who’s asked how to get important meta data from a webpage quickly using ImportXML.

Okay, so the first thing I’m going to say is to always do a quick Google for any Xpath around your topic, and then understand you’ll need to adapt it for Google docs – you just need to know what you need to change 🙂 I found a brilliant response here, which I’m actually going to use.

Normal Xpath like this doesn't work:
 /html/head/meta[@name="description"]/@content

Oooh, cool. Only one teeny, tiny problem – Google docs uses a slightly different syntax for Xpath (don’t ask me why exactly, I’m not a programmer, I’m a hack job). This is right, notice the single quotes around description:

/html/head/meta[@name='description']/@content

We don’t even need to step backwards to the <html> or <head> tag either! This will work too:

//meta[@name='description']/@content

Okay, let’s go ahead and pull out the title, meta description and meta keywords from a webpage:

=importxml("http://www.davidsottimano.com/how-to-extract-title-meta-data-using-gdocs-xpath-and-importxml/","//title")

=importxml("http://www.davidsottimano.com/how-to-extract-title-meta-data-using-gdocs-xpath-and-importxml/","//meta[@name='description']/@content")

=importxml("http://www.davidsottimano.com/how-to-extract-title-meta-data-using-gdocs-xpath-and-importxml/","//meta[@name='keywords']/@content")

Go check out the sheet to see it work!

Hope that helps, leave any questions in the comments and I’ll get back to you.

Bulk ImportXml tool & source (Google docs spreadsheets)

google drive google docs spreadsheets

There’s been a few of you requesting a way to bypass the 50 importxml limit in Google docs so I’ve decided to release something publicly.

Click here to view the spreadsheet

Just make sure to sign in, then make a copy, then press the run button once to authorize the script. If the script doesn’t run, or isn’t there, see the section below.

How does it work?

Please keep in mind I AM NOT A PROGRAMMER, but I do ensure that my code works properly – so please be constructive with your feedback 🙂

The only way I could do this efficiently was to use a script to set up the ImportXml formula in the sheet. This means that I was never able to call importxml with the Sheet class, setFormula method and then replace the formula fast enough. Even if I did manage to copyvalues and clear the importxml formula from the cell, it would either timeout, result in errors or very rarely…work.

Another fun issue was that Google docs would store the results for importxml in cache, but would display N/A# when I ran through the first loop. WTF. Ok, so add in another loop and now it’s displaying the right results. Don’t ask, I have no idea, but it works.

The script isn’t authorizing, or it’s not there!

Yep, that can happen – here is the source code.

function bulkXml() {

  var sheet = SpreadsheetApp.getActiveSheet();
  var Num = Browser.inputBox("How many URLs do you need to scrape?");

  for (y=0;y<2;y++) {

    for (x=2;x-2 < Num;x++)  {

      var url = sheet.getRange(x,1).getValue();
      sheet.getRange(2,6).setValue(url);
      var xpathResult = sheet.getRange(3,6).getValue();    
      var counter = x -1;
      sheet.getRange("C4").setValue(" PLEASE WAIT...CURRENTLY FETCHING " + counter + " OUT OF " + Num);

      if (y===1){
        sheet.getRange(x,2).setValue(xpathResult);
        sheet.getRange("C4").setValue("PROCESSED " + counter + " OUT OF " + Num);
        SpreadsheetApp.flush();
      }

    }

  }

}

function clear() {
  var sheet = SpreadsheetApp.getActiveSheet();
  sheet.getRange("a2:b1000").setValue("");

}

Click on Tools > Script editor and copy paste into there. Make sure you save the script and then you should be good to go.

When I click on the button nothing happens!

I’ve assigned scripts to the buttons, but they sometimes get lost when you make a copy of the Google doc

Right click on the Run button, in the top right you’ll see a drop down arrow. Select assign script, then enter: bulkXml