Converting Model-Glue Docs to HTML
I saw Ray Camden's Friday Puzzler - Helping the Model-Glue Team from March 13th today and noticed a solution had not been posted yet. I gave it some thought and had visions of recursive CFCs parsing out each HTML tag using Regular Expressions initially. But then I realized the task wasn't to create a tool to parse the HTML out of any webpage, but to parse the HTML out of these specific webpages.
Digging into the source code of these pages generated by RoboHelp, I began to see a structure and a set of rules that appeared to be followed on each page. I discovered the navigation tree was comprised of 19 webpages each representing a folder with links to the other folders and the documents contained within itself. My initial code looped through each of these files and stripped out anything above <body> and below </body> using the FindNoCase() function. Next, I added code to loop through the resulting HTML looking for links. Again, the code kept to rules, so I was able to parse out the links to folders by looking for target="_self". Any links that didn't refer to "_self" were the documents we were looking to "scrape" the content from. At this point I cheated a little; I now had a list of all the documents and the folder structure that they were stored in, so I manually created the folders mirroring the structure on the website. I figured this was ok, as I was creating a one-time process, not a reusable application. Finally, getting back to coding, I looped through each document stripping out anything above <h1 and below the <script beneath the content.
I appologize that the Blogger template I'm using doesn't handle code very well. You can download the code as well as all the docs in the link below:
Download docs.model-glue.com.zip
Digging into the source code of these pages generated by RoboHelp, I began to see a structure and a set of rules that appeared to be followed on each page. I discovered the navigation tree was comprised of 19 webpages each representing a folder with links to the other folders and the documents contained within itself. My initial code looped through each of these files and stripped out anything above <body> and below </body> using the FindNoCase() function. Next, I added code to loop through the resulting HTML looking for links. Again, the code kept to rules, so I was able to parse out the links to folders by looking for target="_self". Any links that didn't refer to "_self" were the documents we were looking to "scrape" the content from. At this point I cheated a little; I now had a list of all the documents and the folder structure that they were stored in, so I manually created the folders mirroring the structure on the website. I figured this was ok, as I was creating a one-time process, not a reusable application. Finally, getting back to coding, I looped through each document stripping out anything above <h1 and below the <script beneath the content.
I appologize that the Blogger template I'm using doesn't handle code very well. You can download the code as well as all the docs in the link below:
Download docs.model-glue.com.zip
<!--- I manually reproduced the folder structure of the Model-Glue docs on my local harddrive --->
<cfset pathRemote = "http://docs.model-glue.com/whgdata/">
<cfset pathLocal = "c:/docs.model-glue.com/whgdata/">
<!--- The navigation tree for the docs in RoboHelp is comprised of 19 html files --->
<cfloop index="ptrTree" from="0" to="18">
<cfset fileName="whlstt#ptrTree#.htm">
<cfset fileRemote="#pathRemote##fileName#">
<cfset fileLocal="#pathLocal##fileName#">
<!--- Call up each of the 19 html files that make up the navigation tree and loop through finding each document --->
<cfhttp url="#fileRemote#" method="get" resolveurl="yes" throwonerror="yes"></cfhttp>
<cfif cfhttp.statusCode is "200 OK">
<p><strong><cfoutput>#fileName#</cfoutput></strong><br /><cfflush>
<cfset treeHTML=cfhttp.FileContent>
<cfset ptrLink=1>
<!--- Loop through each link it the navigation tree looking for links to documents --->
<cfloop condition="ptrLink lt len(treeHTML)">
<cfset startLink=FindNoCase("<a href=",treeHTML,ptrLink)>
<cfif startLink gt 0>
<cfset endLink=FindNoCase("</a>",treeHTML,startLink)+3>
<cfset tmpLink=mid(treeHTML,startLink,endLink-startLink+1)>
<!--- Found a link to a document, so parse out the url and link title --->
<cfif Not(FindNoCase("_self",tmpLink))>
<cfset startURL=FindNoCase("http://",tmpLink)>
<cfset endURL=FindNoCase(".htm",tmpLink,startURL)>
<cfset startImg=FindNoCase("<img",tmpLink,endURL)>
<cfset startTitle=FindNoCase(">",tmpLink,startImg)+1>
<cfset endTitle=FindNoCase("</a>",tmpLink,startTitle)>
<cfset pageURL=mid(tmpLink,startURL,endURL-startURL+4)>
<cfset pageTitle=mid(tmpLink,startTitle,endTitle-startTitle)>
<cfoutput><a href="#pageURL#">#pageTitle#</a></cfoutput><br /><cfflush>
<!--- Call up the document and parse out just the HTML throwing out the extra code --->
<cfhttp url="#pageURL#" method="get" resolveurl="yes" throwonerror="yes"></cfhttp>
<cfif cfhttp.statusCode is "200 OK">
<cfset pageHTML=cfhttp.FileContent>
<cfset startContent=FindNoCase("<h1",pageHTML)>
<cfset endContent=FindNoCase("<script type=",pageHTML,startContent)>
<cfset pageContent=Mid(pageHTML,startContent,endContent-startContent)>
<!--- Write out the content HTML using the same folder structure --->
<cfset pageLocal=Replace(Replace(Replace(pageURL,'/whgdata/../','/'),':80',''),"http://","c:\")>
<cffile action="write" file="#pageLocal#" output="#pageContent#">
<cfelse>
<cfdump var="#cfhttp#">
<cfabort>
</cfif>
</cfif>
<!--- Update the ptr used for looping --->
<cfif endLink gt 0>
<cfset ptrLink=endLink+1>
<cfelse>
<cfset ptrLink=len(treeHTML)+1>
</cfif>
<cfelse>
<cfset ptrLink=len(treeHTML)+1>
</cfif>
</cfloop>
<!--- Parse out just HTML in the Navigation Tree file throwing out the extra code --->
<cfset startHTML=FindNoCase("<body",treeHTML)>
<cfset endHTML=FindNoCase("</body>",treeHTML,startHTML)>
<cfset fileHTML=mid(treeHTML,startHTML,endHTML)>
<!--- Write out the Navigation Tree file --->
<cffile action="write" file="#fileLocal#" output="#fileHTML#">
<cfelse>
<cfdump var="#cfhttp#">
<cfabort>
</cfif>
</p>
</cfloop>
</p><h1>Done!</h1>

