Accessing raw HTML of current page? (+ ABCpdf)

mbh_brett · Aug 29, 2004

I'm currently trying to get ABCpdf to work, unfortunately the "ImageURL" function just sits there doing nothing and eventually times out, sometimes requesting authorisation.

So, the other method being to pass ABCpdf a raw HTML string. A simple one being "<html><h1>Hello world</h1><p>Label</p></html>", for example. This would be passed to the ABCpdf and the PDF generated and everything would be just dandy.

But... how to access the raw HTML string in such a way as to return a string for the entire HTML document? Doing it this way - that is, getting the current page exactly how it loads - because there is dynamic data, so I can't have some static method.

Thanks for any help. I'll continue my googling efforts.

mbh_brett · Aug 29, 2004

I have come across a process called "Screen Scraping", this may have something to do with it.

...or maybe not, since that deals with content from other sites, making a new request instead of dealing with the current HTML output stream.

Bruce · Aug 29, 2004

i am not sure if this is related.

AddImage function of ABCPDF is not supported on our server because of security issue.

I am not use if what you are doing is related.

pls post your code.

quote:Originally posted by mbh_brett

I have come across a process called "Screen Scraping", this may have something to do with it.

...or maybe not, since that deals with content from other sites, making a new request instead of dealing with the current HTML output stream.
</blockquote id="quote"></font id="quote">

B.

DiscountASP.NET
http://www.DiscountASP.NET

bluebeard96 · Aug 30, 2004

I think scraping would work. I'm not sure of the .net code, but it is fairly easy in regular ASP.

---------
Dim myXML, myPageText
Set myXML = Server.CreateObject("MSXML2.ServerXMLHTTP")
myXML.open "GET", "your-dynamic-url.aspx"
myXML.send
myPageText= myXML.responseText
Set myXML = Nothing

Response.write myPageText
----------

myPageText now contains the html of the dynamic page.

mbh_brett · Aug 30, 2004

Well isn't that just dandy...

At the moment I am getting the raw HTML by overloading the Render() function on the page, and then I am attempting to replicate the ABCpdf example and put that, put into "Doc" form, in the Session for passing to the other page.

However, when I do this the PDF is "invalid" and doesn't load. As soon as I get something from the Session it would seem, even if its just the string "geh".

The "showdoc.aspx" file currently contains the following:

<%@Page%>
<%@AssemblyName="ABCpdf"%>
<%@ImportNamespace="WebSupergoo.ABCpdf4"%>
<%
Docdoc=(Doc)Session["pagehtml"];

byte[]theData=doc.GetData();
Response.ContentType="application/pdf";
Response.AddHeader("content-length",theData.Length.ToString());
if(Request.QueryString["attachment"]!=null)
Response.AddHeader("content-disposition","attachment;filename=MyPDF.PDF");
else
Response.AddHeader("content-disposition","inline;filename=MyPDF.PDF");
Response.BinaryWrite(theData);

Session.Remove("pagehtml");

%>
</CODE>

while the C# "report.aspx.cx" file contains:

protectedoverridevoidRender(HtmlTextWriterwriter)
{
StringBuildersb=newStringBuilder();
StringWritersw=newStringWriter(sb);

HtmlTextWriterhWriter=newHtmlTextWriter(sw);
base.Render(hWriter);

//***storetoastring
stringPageResult=sb.ToString();

stringtheText="<h1>gah</h1><p>help??</p>";//wouldbe"PageResult"ratherthantheteststring-ifitworked.
Docdoc=newDoc();
doc.AddHtml(theText);
Session.Add("pagehtml",doc);

//***Writeitbacktotheserver
writer.Write(PageResult);
}
</CODE>

I've also tried passing the string of text and doing the Document calculations in the showdoc.aspx file. Same thing it appears? I have successfully Response.Write-d this same Session value, meanwhile.

Thanks for any help.
Click to expand...

Click to expand...

mbh_brett · Sep 12, 2004

I've had words with the boys over at ABCpdf and they are unaware that the the AddImageURL function violates any security issues, Win2003 in paticular.

I have currently taken to the method of rendering the PDF from scratch - that is basically rendering the page and all its seperate elements again... but in PDF form. (by using embedded C# in showdoc.aspx)

Clearly the AddImageURL solution is much nicer - point to the URL and wallah! Any thoughts on this problem or will this remain a security issue for whatever reason?

Bruce · Sep 13, 2004

Taken from ABCPDF's support page, section in Red is why addimage wouldn't work. We cannot disable the default security policy on IE, as it ties into the rest of the policy.

My URL rendering code is telling me that it's 'Unable to read file' or is producing a blank output. What gives?

First read the previous section carefully.

Firewalls and proxy servers can present particular problems for URL rendering because they may require some kind of logon. Your IIS user will most likely not have a logon. Windows Authentication can produce a challenge that your IIS user is unlikely to be able to respond correctly to.

Do ensure you can browse to the appropriate location using IE while logged on as Administrator.

If you're rendering a URL on your localhost then ensure that the IIS user has read access to this location.

Create a VBS script (or compiled .NET application) to mimic the effects of your code. Copy the following text into a text file and then rename it mytest.vbs. Double click to run.

Set d = CreateObject("ABCpdf4.Doc")
d.AddImage "http://www.google.com/"
d.Save "google.pdf"
MsgBox "Finished"

If the google render code fails then try a file URL (eg "file://c:\mydoc.htm" or the code below) to eliminate network issues.

Set d = CreateObject("ABCpdf4.Doc")
d.AddImage "<html><h1>Hello in HTML!</h1></html>"
d.Save "hello.pdf"
MsgBox "Finished"

You can download a set of sample scripts and .NET projects here...

If this fails you might like to consider an IE upgrade. We test using IE6 and - occasionally - we find that HTML rendering issues are related to old or inconsistent installations of IE.

Windows 2003 Server defaults to an Internet Explorer Security policy which may interfere with HTML rendering. You may have to modify or disable the policy to allow access to the pages you want to render.</font id="red">

If you still have problems please mail us telling us exactly what you've tried, the results you've had, your OS, SP and the version of IE installed on your server. You may wish to save your web page from IE and include it in your mail so we can see what you're trying to do.

quote:Originally posted by mbh_brett

I've had words with the boys over at ABCpdf and they are unaware that the the AddImageURL function violates any security issues, Win2003 in paticular.

I have currently taken to the method of rendering the PDF from scratch - that is basically rendering the page and all its seperate elements again... but in PDF form. (by using embedded C# in showdoc.aspx)

Clearly the AddImageURL solution is much nicer - point to the URL and wallah! Any thoughts on this problem or will this remain a security issue for whatever reason?
</blockquote id="quote"></font id="quote">

B.

DiscountASP.NET
http://www.DiscountASP.NET

mhariri · Mar 19, 2007

Does this mean that we can not use this functionality is the ABCPDF5? Did anybody comeup with an alternative? I tried reading a local html document using AddImageHtml but I am having similar problems. Any suggestion on how to create a looking report without definging location of every word, table and picture?

Bruce · Mar 20, 2007

sorry.. i am not aware of any workaround

Bruce

DiscountASP.NET
www.DiscountASP.NET

superwizbang · Apr 1, 2007

Imports System.Text.RegularExpressions
Imports System.Text
Imports System.Net

Public Sub butFullScrapeWithHtml_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles butFullScrapeWithHtml.Click
Dim client As New WebClient
Dim pageData As [Byte]() = client.DownloadData(txtURL.Text)
Dim pageHtml As String = Encoding.ASCII.GetString(pageData)
litScrapePage.Text = pageHtml
End Sub</CODE>or

Public Sub butFullWSE_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles butFullWSE.Click
Dim client As New WebClient
Dim pageData As [Byte]() = client.DownloadData(txtURL.Text)
Dim pageHtml As String = Encoding.ASCII.GetString(pageData)
Dim Pattern As String = txtStartString.Text + "((.|\n)*?)" + txtEndString.Text
Dim regexPattern As New Regex(Pattern, RegexOptions.Compiled)
Dim matchPattern As Match = regexPattern.Match(pageHtml)
litScrapePage.Text = matchPattern.Value
End Sub</CODE>if you need a start/stop point. Use your browser to "View Code" and see if there are elements which you want to omit and only take the part of the page you need.
Click to expand...

Click to expand...

mhariri · Apr 2, 2007

Hi,

Thank you for your great post but could you be a little more specific on how this ties to the complete process of creating a pdf document from html page? I am not sure how to tie your code.

Thanks

superwizbang · Apr 3, 2007

I posted the code a little off the cuff noting that you were interested in creating a document from scraping an HTML page. I merely posted code which I have used to do just that and as you see, loaded it into a literal control to display it. I am not sure of the .PDF conversion process (actually, something that I will look into in the coming weeks as this is something I would like to do on my own site for customer invoices).

In passing, I have run across code in the past which uses VS's Crystal Reports module to convert a document to .PDF if you do not have the Adobe libraries built in. I can't say that I know any more than that. Getting the HTML of a page is rather easy in the .NET platform with code similar to what you see below. The second code snippet that I placed has a start/stop element which is great for omitting uselesscoding before/after the section that you might want.

An example that I have used this for is scraping basketball scores for loading into a fantasy sports website I builtwith a "live scoring" module. It visits a site, looks for a unique code string (txtStartString) and takes all elements from the page up to (txtStopString). I just look for the closest uinique tag before and after my desired part of the page to scrape and parse out undesired junk. Obviously if you set txtStartString to "<td>", it will go until it finds the first "<td>" tag and take everything after that...which might be more HTML code than you really want...

In my code, replace the "txtURL.Text" (oviously a returned value from a textbox on my page) with your desired URL or do as I have and place a textbox on your page. Place a button on the page and set it's click event to the code below. The first example returns the whole page into a string variable. The second example uses a little regex coding to filter out undesired code. Add a literal to your page and look at the code returned and adjust as needed (add a <html> pageHTML/MatchPattern.Value </body> </html>) and you should be in business. To look at the exact code returned, from your browser, choose the "view source" to see what your parsing has returned.

Once you have the code looking like you would want, pass it to your widget.

-Ken

Log in or Sign up

Accessing raw HTML of current page? (+ ABCpdf)

mbh_brett

mbh_brett

Bruce DiscountASP.NET Staff

bluebeard96

mbh_brett

mbh_brett

Bruce DiscountASP.NET Staff

mhariri

Bruce DiscountASP.NET Staff

superwizbang

mhariri

superwizbang

Share This Page

Accessing raw HTML of current page? (+ ABCpdf)

Bruce DiscountASP.NET Staff

Bruce DiscountASP.NET Staff

Bruce DiscountASP.NET Staff

Share This Page

Useful Searches