Back when I first started writing XLSX.js, I knew that parsing the XML out as a string was not in keeping with current practices. These days we have DOMParser in all the major browsers, and the XMLDOM ActiveX object for older IE browsers. We can deal with these XML nodes as hierarchical, searchable objects. But… I figured I’d just stick with the string parsing anyway. I assumed that I would update it at some point, when I had more time, but initially I was looking to create a solution quickly and theorized that string processing was probably somewhat faster. Fast forward to a week ago, and a need has arisen that will require me to expand XLSX.js’ capabilities. As I will need to modify the reading and writing code anyway, I started by switching over the read code to DOMParser. Upon completion, I remember Ingo’s tweet and wonder if I should post a little comment on the blog explaining why I used string processing in the first place. In fact, maybe I’ll just throw both pieces of code into jsPerf to substantiate my theory of string processing being a little faster. It’s probably a little faster, but can it really be fast enough to justify staying with that method?
But, that’s a “dumb” question, right? Parsing XML using the DOM is more maintainable, it’s more extensible, it’s likely more stable. It’s certainly more common. The community has decisively gone this direction, either natively or via libraries like jQuery, so it must be the better solution in this instance. However, despite those valid points, the jsPerf provided a compelling rebuttal. My initial tests showed that string processing was 10x faster on Safari 5, and 41x faster on IE. Someone out there found the jsPerf on an Android device, and showed string processing to be 47x faster there. So, the performance gains are highly significant.
However, while DOMParser will only be called once, getElementsByTagName will be called an increasing number of times as the document gets bigger. My initial tests were with a fairly small document, so I wondered if this performance difference was minimized as the document size increased. I created an Excel spreadsheet with two columns and one thousand rows, and placed the contents of the worksheet XML file in a preparation code variable. The DOM2 and String2 tests use the same code as the smaller tests, but they are processing a much larger file. In the large tests string processing showed to be 7.5x faster on Safari, and 32x faster on IE. Firefox was an outlier, in that it’s overall performance on the large document was worse than the Android device. Therefore, it’s roughly equal performance between XML parsing and string processing may not be applicable.
Now, not all performance increases are noteworthy. A 3,200% performance increase for a loop that executes a total of 10 simple additions, may not be of any real significance. In such a case, both will run so quickly that it is likely better to go with the more maintainable, extensible, stable, and conventional method. However, XLSX.js is not akin to such a simple loop. Spreadsheets can easily contain thousands of cells with data and formatting. Also, every instance of data or text formatting is associated with nodes describing said formatting. Now, this doesn’t necessarily mean a 1:1 relationship between the number of cells and the number of formatting-related nodes. However, every formatted cell will have association with at least one formatting-descriptive node regardless of whether the association is exclusive. Therefore, it is feasible (though somewhat unlikely) to say that a spreadsheet with 2,000 formatted cells could require the reading or writing of at least 6,000 nodes. In such a scenario, reduced processing time becomes a much larger consideration. Still, if processing a document in 3.1% of the time otherwise required does not seem significant, you may want to think of the capacity increase. For whatever amount of time you deem to be acceptable for a user to wait for their finished document/data, processing the XML as a string will allow you to work with a document 32x the size otherwise possible. Those 6,000 data or formatting nodes can become 192,000.
I realize that several factors influence the performance scaling of XLSX-file processing. However, my conclusion is that the benefits of string processing, in this instance, are quite clear. As a result, XLSX.js will stay with string processing until an alternative method with considerable performance is discovered. Additionally, the DOCX.js library that is about to release will convert over to string processing at some point. You are free to disagree with this decision, and fork the projects to pursue a DOM-based solution. Heck, the jsPerf even has some code to get a person started. But, I believe the case for string processing is a strong one with significant legitimacy.
With that, I’ll leave you with this musing…
If I missed something in my tests or conclusions, please feel free to comment!