Learning how to program in Objective-C and for the iPhone can be really frustrating sometimes. Although I am coming to grips with the language and its frameworks, I am finding certain seemingly simple tasks a bit of a chore.
For instance, as one of my own projects to help me learn Cocoa Touch, I am trying to do some parsing of HTML off of a website – screenscraping. To handle any kind of XML processing there are, to my knowledge, only two libraries/utilities you can use. One is the dreaded libxml2 and the other is NSXMLParser.
Firstly, libxml2. What a nightmare! Albeit that it is reported to be very memory efficient and fast, it is written in C, has the most confusing documentation on the planet – and to top it all off, the classes are named according to a different set of coding standards from the ones I am used to. This makes all sample code, that I could find, very difficult to read.
What I did find however, was a couple of utility classes written by Marcus Zarra that makes XPath querying with libxml2 a bit easier. But not easy enough.
Which brought me to NSXMLParser. If you are going to use NSXMLParser to screenscrape you are asking for spaghetti code. NSXMLParser is a SAX-style parser. In other words, while parsing through an XML document it fires events on the start of an element, when it finds characters on an element and at the end of an element. This works very well if you know what the XML is going to look like but it is not ideal when working with HTML.
In the end, after fiddling for a few hours and many curse words I got my code to work with NSXMLParser. Even though it looked like a dog’s breakfast and would probably break the moment an extra tag was added to the HTML.
There is one other thing, however, that you can do, create a proxy for all the data that you send to the phone. This serves two purposes. The first, is that you can use Java running on a web server to fetch, clean and extract the data from the website you are trying to scrape. I used a combination of JTidy and XPath to extract the data I needed from the relevant pages and convert it to objects which I can now serve to the iPhone. This gives an incredible amount of flexibility and allows for a marked improvement in performance as the phone does not need to load large documents into memory in the background.
The other major benefit of using a proxy server is that if you offload processing to an external server, and you make your iPhone application as dumb as possible you won’t need to update and resubmit your iPhone application to Apple, within reason of course. You can even resize images on the server and cache responses.
My application now uses NSXMLParser to work with simple XML files and I have a lot of control. Moral of the story, don’t try and do everything on the phone.
So what’s next? Well I am going to finish my server-side code. Rewrite my iPhone application so that it works with the simpler XML from the server. Once all this is done I can start working on the more fun parts of my application, such as using the media frameworks for the iPhone namely the camera, video and audio.