Pick ur favourite Parser, for is DOM Parser. Extract the info by search the tag name then gone throughout some pre processing like regex filter and such and such finally Java I/O to write it back down as text file by given path.
XML parser – Several parser come along with Java SE lib, which one is your favourite?
This time I got told to implement a Java class which take xml file as input then extract selected info, such as Topic / Content / Time stamp…… and packed it as text file for further process. It sound boring right? This is the first perception I have, ordinary business logic.
But it turn the spec is so much trickier than I thought, and that’s why I desperate a blog for it.
After completing the task, conclude several point which may save you from infinite loop of frustration.
– Encoding problem:
Depends on the source, different encoding method is used. And you have to really be aware of those. Because java parser would read .xml as utf-8 by default (Atlases on my environment ). Hence in this case Chinese character which encoded in Big5 / Big5 HK edition would result in a bunch of unreadable text. This is solution you won’t want it to be. So ending method need to set explicitly.
– How compiler read .java file.
Throughout the development of the class. I encounter a problem which I dun expect to happen but it does. Try to replace all “新聞” text into news (Which is the direct translation of it). It work totally fine locally, however none of it match when the class run on server. Lucky my colleague remind me about the compiler over server side actual read the source.java as ascii code. No wonder why my code dun work then. Because I hardcode the term “新聞” in .java. Somethings like: String.replace(“新聞”, news); Hence as the source file read as ascii code on compiiling stag, “新聞” just become a bunch of mailingless bit and on the other hand, source text is headed properly, encoding set on XML parser. So of course It can’t match.
Any way personally I learn tons from my colleague and ppl around me, Thx.