Bulk ImportXml tool & source (Google docs spreadsheets)

There’s been a few of you requesting a way to bypass the 50 importxml limit in Google docs so I’ve decided to release something publicly.

Click here to view the spreadsheet

Just make sure to sign in, then make a copy, then press the run button once to authorize the script. If the script doesn’t run, or isn’t there, see the section below.

How does it work?

Please keep in mind I AM NOT A PROGRAMMER, but I do ensure that my code works properly – so please be constructive with your feedback :)

The only way I could do this efficiently was to use a script to set up the ImportXml formula in the sheet. This means that I was never able to call importxml with the Sheet class, setFormula method and then replace the formula fast enough. Even if I did manage to copyvalues and clear the importxml formula from the cell, it would either timeout, result in errors or very rarely…work.

Another fun issue was that Google docs would store the results for importxml in cache, but would display N/A# when I ran through the first loop. WTF. Ok, so add in another loop and now it’s displaying the right results. Don’t ask, I have no idea, but it works.

The script isn’t authorizing, or it’s not there!

Yep, that can happen – here is the source code.

function bulkXml() {

  var sheet = SpreadsheetApp.getActiveSheet();
  var Num = Browser.inputBox("How many URLs do you need to scrape?");

  for (y=0;y<2;y++) {

    for (x=2;x-2 < Num;x++)  {

      var url = sheet.getRange(x,1).getValue();
      sheet.getRange(2,6).setValue(url);
      var xpathResult = sheet.getRange(3,6).getValue();    
      var counter = x -1;
      sheet.getRange("C4").setValue(" PLEASE WAIT...CURRENTLY FETCHING " + counter + " OUT OF " + Num);

      if (y===1){
        sheet.getRange(x,2).setValue(xpathResult);
        sheet.getRange("C4").setValue("PROCESSED " + counter + " OUT OF " + Num);
        SpreadsheetApp.flush();
      }

    }

  }

}

function clear() {
  var sheet = SpreadsheetApp.getActiveSheet();
  sheet.getRange("a2:b1000").setValue("");

}

Click on Tools > Script editor and copy paste into there. Make sure you save the script and then you should be good to go.

When I click on the button nothing happens!

I’ve assigned scripts to the buttons, but they sometimes get lost when you make a copy of the Google doc

Right click on the Run button, in the top right you’ll see a drop down arrow. Select assign script, then enter: bulkXml

13 Comments

  • Amar Chahal says:

    Great tool. I couldn’t get it to work for large sets of URLs (1500+), though. I keep getting the error “please check the page, something went wrong”. However, when I do it manually using importXML, it works fine.

    • dsottimano says:

      It turns out that Google docs has a script timeout at around 6 minutes (or at least it keeps cutting out around that time). I managed to go through 950 URLs today. I’ll have another look at it and see if I can find the fix. Thanks for the heads up!

  • Hywel Morgan says:

    This tool is looks superb, just what I need, but I’m struggling to get anything out of it. Probably my incompetence. a) I can’t get it to do more than 4 results and b) the results just flash by in F2 and don’t seem to be recorded anywhere. What I’m I doing wrong? (Sorry if I appear to be a luddite!)

    • dsottimano says:

      You weren’t doing anything wrong – I made the mistake of setting the num variable to 4 in the spreadsheet. Fixed now.

  • ashj says:

    Hi am quite new to this can you please tell me how i can use this to extract info from other website to make a price tracker …thanks

  • Philip Diamond says:

    This tool is fantastic however I have two major questions: Question 1–where in the script is it actually pulling the XML? I don’t see how it actually does that, nor where it uses the contents of cell C2 to access the appropriate path. Question 2–I am currently using arrayFormula + Index to pull specific points of data out of an XML import, however your code does not allow for such data manipulation. For that reason I am looking to understand where in the code it actually pulls the XML so I can modify it appropriately, as mentioned in my first question. Thanks so much for the great script, and I really hope to hear back soon.

    Best,
    Phil

    • dsottimano says:

      Hi Phil,

      The data (importxml) call is in F3 (hidden column, just unhide it), and for every URL you need to scrape (a1:a??), I use =iferror(concatenate(importxml(F2,C2)),”Please check the page, something went wrong”). You could stick your index formula into that, and everything should be fine. Sorry about the delay, never got notified of the comment!

  • Jan Kadela says:

    Hi David, your script is a great help!!!!! The problem I have is that it keeps timing out and it doesn’t capture all the data for larger data sets. I cannot use JAVA, just a thought though. Would it be possible to write the script so that it would always take let’s say first 10 urls to process and then the other 10 etc. untill it completes the process? So it would be batching the load by 10 let’s say where there is no problem with fetching the data. Thank you

    • Jan Kadela says:

      I found this answer, would anyone be able to include it in the script, for it to work even better?
      **

      Figure out a way to split up your work so it takes less than 6 minutes, as that’s the limit for any script. On the first pass, you can iterate and store the list of files and folders in a spreadsheet and add a time-driven trigger for part 2.

      In part 2, delete each entry in the list as you process it. When there are no items in the list, delete the trigger.

      This is how I’m processing a sheet of about 1500 rows that gets spread to about a dozen different spreadsheets. Because of the number of calls to spreadsheets, it times out, but continues when the trigger runs again.

  • Alok says:

    Dav! You are very helpful. This page help me to setup my own spreadsheet that save lots of time and money too. Thanks a lot.

  • Gasdior Giovo says:

    Hey I have problem to show a multiple range results…is that possible? I mean to have the result from one link import in different cells not only in the one next to the link…thanks, great app

Leave a Reply

Your email address will not be published.