nic jansma | SOASTA | nicj.net | @nicj
Nic Jansma
SOASTA
Philip Tellis
SOASTA
SOASTA mPulse is a Real User Monitoring (RUM) tool for measuring page load performance
boomerang.js
from a CDN to our customers using
a script loader snippet they include in their HTMLboomerang.js
version (if they ask)<script>
nodes or async
IFRAME
(function(url){
// Section 1
var dom,doc,where,iframe = document.createElement('iframe');
iframe.src = "javascript:void(0)";
iframe.title = ""; iframe.role="presentation"; // a11y
(iframe.frameElement || iframe).style.cssText = "width: 0; height: 0; border: 0";
where = document.getElementsByTagName('script');
where = where[where.length - 1];
where.parentNode.insertBefore(iframe, where);
// Section 2
try {
doc = iframe.contentWindow.document;
} catch(e) {
dom = document.domain;
iframe.src="javascript:var d=document.open();d.domain='"+dom+"';void(0);";
doc = iframe.contentWindow.document;
}
doc.open()._l = function() {
var js = this.createElement("script");
if(dom) this.domain = dom;
js.id = "js-iframe-async";
js.src = url;
this.body.appendChild(js);
};
doc.write('<body onload="document._l();">');
doc.close();
})('http://some.site.com/script.js');
What happens when our customers think our third-party script is causing an issue on their site?
Aren't you supposed to be non-blocking?
Customer:
"Hi guys, So, I was running some WPT tests today and saw this… I thought that this was supposed to be non-blocking. If anyone sees this, they'll hang me up by my heels."
Screenshot #2 from our customer:
The customer shared their WebPageTest results URL, and we looked closer at the test pass
Out of the 9 runs, 2 showed what appeared to be boomerang.js blocking other downloads
Repro #1:
Repro #1 Larger view
Repro #2:
We looked at the 9 test runs, and found 3 more that had some sort of period where nothing happens
The tests show periods of time where the CPU is 100%, and bandwidth (bytes transferred) drops to 0 for 1-4 seconds.
Repro #1 and #2 show 100% CPU and no bandwidth for over a second:
Non-repros: boomerang.js loaded quickly, but two images appeared to "hang"
Non-repros: boomerang.js loaded much earlier, but other content appears to "hang"
Non-repros: Other content "hanging"
WebPageTest has many options for diving deeper
Gives a breakdown of main thread processing and timeline
libpcap
is a portable library for capturingtcpdump.org/manpages/tcpdump.1.html
tcpdump -nS
Higher-level analysis of tcpdump
Analyze PCAP (tcpdump) files in your browser
At this point, we downloaded the WebPageTest tcpdump files to dive deeper into the data
In all of the runs that showed a period of "no progress", we found zero network activity
(Repro #1)
We expect the OS network stack to continue TCP communications even if the browser was "blocked" on a script
(Repro #2)
The usefulness of Browser Dev Tools could be a talk on its own, but we'll give some highlights during our investigations
chrome://tracing
WebPageTest provides Chrome Traces
Repro #2:
Repro #2:
NetLog: Chrome’s network logging system
https://www.chromium.org/developers/design-documents/network-stack/netlog
{"params":{"load_flags":2163712,"method":"GET","priority":"LOWEST",
"url":"http://c.go-mpulse.net/boomerang/KQTS5-4NBTD-EYGLE-64UYR-S5892"},
"phase":1,"source":{"id":588,"type":1},"time":"5454412310","type":91},
{"phase":1,"source":{"id":588,"type":1},"time":"5454412310","type":93},
{"phase":2,"source":{"id":588,"type":1},"time":"5454412310","type":93},
{"phase":1,"source":{"id":588,"type":1},"time":"5454412310","type":101},
{"phase":2,"source":{"id":588,"type":1},"time":"5454412310","type":101},
{"phase":1,"source":{"id":588,"type":1},"time":"5454412310","type":102},
{"params":{"byte_count":1460},"phase":0,"source":{"id":275,"type":4},
"time":"5454412311","type":62},
{"phase":2,"source":{"id":529,"type":1},"time":"5454412311","type":143},
{"phase":1,"source":{"id":529,"type":1},"time":"5454412311","type":143},
{"params":{"byte_count":443},"phase":0,"source":{"id":275,"type":4},
"time":"5454412313","type":62},
{"phase":2,"source":{"id":529,"type":1},"time":"5454412313","type":143},
{"phase":1,"source":{"id":529,"type":1},"time":"5454412313","type":143},
{"phase":2,"source":{"id":275,"type":4},"time":"5454412313","type":37},
{"phase":2,"source":{"id":529,"type":1},"time":"5454412313","type":143},
It will be great to (re)prove that our script loader works even if our CDN is down, or if there are delays in the network
How can we do this? There are a couple tools that can help with Single Point of Failure (SPOF) testing
blackhole.webpagetest.org
drops all traffic
setDnsName c.go-mpulse.net blackhole.webpagetest.org
navigate our-customer.com
No issues with blocking our CDN c.go-mpulse.net
Let's try to do SPOF on our local machine as well
/etc/hosts
On Windows: C:\windows\system32\drivers\etc\hosts
blackhole.webpagetest.org == 72.66.115.13
72.66.115.13 apis.google.com
72.66.115.13 www.google-analytics.com
72.66.115.13 c.go-mpulse.net
Conclusion:
Conclusion:
/etc/hosts
, and FiddlerCustomer:
"I'm seeing pages that should match showing up in No Page Group again"
(No Page Group)
category that should have
matched a URLWe obviously work for SOASTA, and mPulse is our RUM product
RUM lets you view your real-world customer data from an aggregate level
What are the most common causes of (No Page Group)
?
iOS Mobile Safari sticks out:
RUM Waterfalls let you look at real-world individual page loads
One great use of Fiddler is to monitor external browser traffic without having Browser Dev Tools open (including mobile traffic!)
At this point, we sat with an iPad, reloading the home page hundreds of times to try to get a repro...
And tried...
And tried...
... an hour later, after trying many ways of loading the home page, we finally got a hit!
www.customer.com
in the address bar, but got a phone
call, so didn't hit Go
yetNow that we had a repro, we were able to narrow down the issue to a bug in Boomerang that didn't deal with pre-render state transitions properly.
The fix was pretty straightforward, but we needed to test it.
Fiddler allows you to inject your own content in place other live content on any host
We injected our fixed version into the customer' site, and validated that it worked
Conclusion:
Stop messing with my readyState
www.customer.com
and found that Boomerang wasn't reliably sending a Page Load beaconwindow.onload
and fire a beacon, but this wasn't happeningdocument.readyState == "loading"
even though window.onload
had firedwindow.pageshow
was firing before window.onload
-- window.onload
should be firstOur guess was that there was a script running on our customer's site that was messing with some of the document loading states, but had to prove it
One way is to fetch, unminify and analyze all of the site's JavaScript, but there are a couple easier ways if you want to use the Browser Dev Tools to work for you
We started out with a guess that something was changing window.onload
or
document.readyState
One way of modifying pre-existing DOM properties is via Object.defineProperty
Inject this in the page to find anyone using it:
Object.defineProperty = function(obj, prop, descriptor) {
debugger;
};
Chrome/IE/FF pretty-print (unminify) is the greatest thing
Chrome/IE/FF pretty-print (unminify) is the greatest thing
We also a similar change of document.readyState
Object.defineProperty(
document,
"readyState",
{
get:
function()
{
return document.someOtherReadyState;
}
});
Conclusion:
window.onload
and document.readyState
were intentional
by another third-party script for FEO optimizationPremature optimization is the root of all good intentions
We use NodeJS for many things at SOASTA:
Useful NodeJS NPM modules for command-line scripts:
jetty
: ANSI control sequencesfast-stats
: Statistical analysis of numeric datasetscli-table
: Tables for the command-linecommander
: Command-line argument parsingline-by-line
: Reads large files without buffering into memoryWe fetched gigabytes of dropped-beacon log files, and started doing some statistical analysis on the causes
We can break down the dropped-beacons data by dimensions to help guide us towards finding a repro:
NodeJS cli-table
output. By browser:
┌──────────────────────────────┬──────────┬──────────┐ │ URL │ Count │ % │ ├──────────────────────────────┼──────────┼──────────┤ │ IE/7.0 │ 1559 │ 66.65 │ ├──────────────────────────────┼──────────┼──────────┤ │ IE/9.0 │ 293 │ 12.53 │ ├──────────────────────────────┼──────────┼──────────┤ │ Safari/5.1.9 │ 283 │ 12.10 │ ├──────────────────────────────┼──────────┼──────────┤
NodeJS cli-table
output. By beacon type:
┌──────────────────────────────┬──────────┬──────────┐ │ URL │ Count │ % │ ├──────────────────────────────┼──────────┼──────────┤ │ xhr │ 2222 │ 95.00 │ ├──────────────────────────────┼──────────┼──────────┤ │ navigation │ 37 │ 1.58 │ ├──────────────────────────────┼──────────┼──────────┤ │ ... │ 7 │ 0.30 │ ├──────────────────────────────┼──────────┼──────────┤ │ Total │ 2339 │ 100 │ └──────────────────────────────┴──────────┴──────────┘
NodeJS cli-table
output. By URL:
┌───────────────────────────────────────┬──────────┬──────────┐ │ URL │ Count │ % │ ├───────────────────────────────────────┼──────────┼──────────┤ │ http://www.customer.com/api/foo │ 2187 │ 93.50 │ ├───────────────────────────────────────┼──────────┼──────────┤ │ http://www.customer.com/anotherurl │ 9 │ 0.38 │ ├───────────────────────────────────────┼──────────┼──────────┤
/api/foo
endpointWe sat our VirtualBox IE 9 browser on www.customer.com
for a while, watching XHRs
and beacons flow past
Both IE 9 Developer Tools and Fiddler showed something interesting:
IE 9 Developer Tools showing aborted requests to our injected <javascript>
that updates the token and timestamp:
Conclusion:
<javascript>
node that was fetching the updated CSRF token and timestamp<javascript>
node around
for a bit) via the same toolsThe many ways to send a beacon
... and the many ways to not send a beacon
We send the boomerang.js beacon to mPulse via several methods:
IMG
element with
a img.src
containing the payload in the query stringFORM
element and call form.submit()
on itWindows 10 and Edge had just been released, and a customer reported that their site was hanging in Edge on some pages, and that it no longer did when boomerang.js was removed from their site
We had tested Windows 10 Techincal Preview (the previous Edge build) thoroughly, but something in the final release was causing problems
Sure enough, loading customer.com
would hang Edge for up to
30 seconds.
Since the browser was hung, it was hard to use the Edge debugger
Time to dive into system-level tracing!
Available tracing:
Simple trace of system evenst:
xperf.exe -on latency -stackwalk profile
// [run scenario]
xperf.exe -stop -d myscenario.etl
Internet Explorer and Chrome both fire ETW events that you can overlay in the charts and see in the tables
Microsoft-IE events:
Microsoft-IEFRAME:
performance.mark("startTime1");
performance.mark("endTime1");
performance.mark("startTime2");
performance.mark("endTime2");
performance.measure("durationTime1", "startTime1", "endTime1");
performance.measure("durationTime2", "startTime2", "endTime2");
By using -stackwalk
on the command line, you can enable stacks on many events
Public symbol servers:
https://msdl.microsoft.com/download/symbols
http://symbols.mozilla.org/firefox
https://chromium-browser-symsrv.commondatastorage.googleapis.com/
More great tutorials on ETW, UIForETW, and xperf are available at: randomascii.wordpress.com
via Bruce Dawson @BruceDawson0xB
xperf -on latency -stackwalk profile
xperf -d repro.etl
The trace shows Edge spending nearly 100% CPU for over 70 seconds:
FORM
submission, was triggering this Edge hang200 OK
response, or204 No Content
response that was missing a Content-Length: 0
header.We were able to test different fixes across our test matrix (IE 6 - Edge, Chrome, Firefox, Safari, Mobile Safari, Android, Lynx, etc) using Fiddler