What’s the saying? “There are no dumb questions, only dumb answers”? Well, apparently that can also apply to JavaScript methodologies. This post is written to answer what I thought was a “dumb” question. It was a question that I posed to myself after reading Ingo Rammer’s tweet: “Wow .. the xlsx.js code is *exactly* how one shouldn’t parse XML. Ever. I thought we passed this line about five years ago …
” Now, I didn’t take offense at what Ingo said. In fact, I agreed with him.
Back when I first started writing XLSX.js, I knew that parsing the XML out as a string was not in keeping with current practices. These days we have DOMParser in all the major browsers, and the XMLDOM ActiveX object for older IE browsers. We can deal with these XML nodes as hierarchical, searchable objects. But… I figured I’d just stick with the string parsing anyway. I assumed that I would update it at some point, when I had more time, but initially I was looking to create a solution quickly and theorized that string processing was probably somewhat faster. Fast forward to a week ago, and a need has arisen that will require me to expand XLSX.js’ capabilities. As I will need to modify the reading and writing code anyway, I started by switching over the read code to DOMParser. Upon completion, I remember Ingo’s tweet and wonder if I should post a little comment on the blog explaining why I used string processing in the first place. In fact, maybe I’ll just throw both pieces of code into jsPerf to substantiate my theory of string processing being a little faster. It’s probably a little faster, but can it really be fast enough to justify staying with that method?
But, that’s a “dumb” question, right? Parsing XML using the DOM is more maintainable, it’s more extensible, it’s likely more stable. It’s certainly more common. The community has decisively gone this direction, either natively or via libraries like jQuery, so it must be the better solution in this instance. However, despite those valid points, the jsPerf provided a compelling rebuttal. My initial tests showed that string processing was 10x faster on Safari 5, and 41x faster on IE. Someone out there found the jsPerf on an Android device, and showed string processing to be 47x faster there. So, the performance gains are highly significant.
However, while DOMParser will only be called once, getElementsByTagName will be called an increasing number of times as the document gets bigger. My initial tests were with a fairly small document, so I wondered if this performance difference was minimized as the document size increased. I created an Excel spreadsheet with two columns and one thousand rows, and placed the contents of the worksheet XML file in a preparation code variable. The DOM2 and String2 tests use the same code as the smaller tests, but they are processing a much larger file. In the large tests string processing showed to be 7.5x faster on Safari, and 32x faster on IE. Firefox was an outlier, in that it’s overall performance on the large document was worse than the Android device. Therefore, it’s roughly equal performance between XML parsing and string processing may not be applicable.
Now, not all performance increases are noteworthy. A 3,200% performance increase for a loop that executes a total of 10 simple additions, may not be of any real significance. In such a case, both will run so quickly that it is likely better to go with the more maintainable, extensible, stable, and conventional method. However, XLSX.js is not akin to such a simple loop. Spreadsheets can easily contain thousands of cells with data and formatting. Also, every instance of data or text formatting is associated with nodes describing said formatting. Now, this doesn’t necessarily mean a 1:1 relationship between the number of cells and the number of formatting-related nodes. However, every formatted cell will have association with at least one formatting-descriptive node regardless of whether the association is exclusive. Therefore, it is feasible (though somewhat unlikely) to say that a spreadsheet with 2,000 formatted cells could require the reading or writing of at least 6,000 nodes. In such a scenario, reduced processing time becomes a much larger consideration. Still, if processing a document in 3.1% of the time otherwise required does not seem significant, you may want to think of the capacity increase. For whatever amount of time you deem to be acceptable for a user to wait for their finished document/data, processing the XML as a string will allow you to work with a document 32x the size otherwise possible. Those 6,000 data or formatting nodes can become 192,000.
I realize that several factors influence the performance scaling of XLSX-file processing. However, my conclusion is that the benefits of string processing, in this instance, are quite clear. As a result, XLSX.js will stay with string processing until an alternative method with considerable performance is discovered. Additionally, the DOCX.js library that is about to release will convert over to string processing at some point. You are free to disagree with this decision, and fork the projects to pursue a DOM-based solution. Heck, the jsPerf even has some code to get a person started. But, I believe the case for string processing is a strong one with significant legitimacy.
With that, I’ll leave you with this musing…
In preparation for writing this post, I spent a good amount of time looking for arguments against string processing. Blog posts arguing for a particular JavaScript methodology are not rare, and I fully expected to see a good list of reasons why string processing should be abandoned in favor of DOM-based methods. Surprisingly, I could not find any. I found forums telling people not to use string processing, largely without reasoning, and posts describing the benefits of native methods (eg. getElementsByTagName) over jQuery. But, never could I find a good explanation for why string processing is wrong. And, I have a theory why: it’s not.
It is arguably less efficient, overall, in certain scenarios. For example, if one is processing a small document or XHR response, the maintainability, extensibility, and stability of the DOM-based methods would clearly win out. However, those scenarios are much less common these days. Ingo mentions the progression beyond string processing about five years ago, which was in the vicinity of the “JSON revolution”. More services started providing JSON, and then JSONP, causing people to migrate to the undeniable benefits. Performance for small bits of XML became less important as it became less common and browsers became faster. I would theorize that the average size of XML being worked with in JavaScript is increasing, as services continue to provide JSON data and HTML5/JavaScript continue to grow into a first class development platform. Developers will be more inclined to explore interactions with XML-based file formats, and these files may be quite large. Therefore, I would conclude that string processing of XML documents is likely to increase in the near to mid-future. While it won’t be applicable in every instance, large files with well-written specifications can be read and written via string processing with far greater speed than current DOM-based methods. It wasn’t the answer I expected, but I guess the moral of the story is to not assume that a methodology is best just because it is most prevalent.
If I missed something in my tests or conclusions, please feel free to comment!
Stephen