Accessing raw HTML of current page? (+ ABCpdf)

Discussion in 'ASP.NET / ASP.NET Core' started by mbh_brett, Aug 29, 2004.

Thread Status:
Threads that have been inactive for 5 years or longer are closed to further replies. Please start a new thread.
  1. I'm currently trying to get ABCpdf to work, unfortunately the "ImageURL" function just sits there doing nothing and eventually times out, sometimes requesting authorisation.

    So, the other method being to pass ABCpdf a raw HTML string. A simple one being "<html><h1>Hello world</h1><p>Label</p></html>", for example. This would be passed to the ABCpdf and the PDF generated and everything would be just dandy.

    But... how to access the raw HTML string in such a way as to return a string for the entire HTML document? Doing it this way - that is, getting the current page exactly how it loads - because there is dynamic data, so I can't have some static method.

    Thanks for any help. I'll continue my googling efforts.
     
  2. I have come across a process called "Screen Scraping", this may have something to do with it.

    ...or maybe not, since that deals with content from other sites, making a new request instead of dealing with the current HTML output stream.
     
  3. Bruce

    Bruce DiscountASP.NET Staff

    i am not sure if this is related.

    AddImage function of ABCPDF is not supported on our server because of security issue.

    I am not use if what you are doing is related.

    pls post your code.

    quote:Originally posted by mbh_brett

    I have come across a process called "Screen Scraping", this may have something to do with it.

    ...or maybe not, since that deals with content from other sites, making a new request instead of dealing with the current HTML output stream.
    </blockquote id="quote"></font id="quote">

    B.

    DiscountASP.NET
    http://www.DiscountASP.NET
     
  4. I think scraping would work. I'm not sure of the .net code, but it is fairly easy in regular ASP.

    ---------
    Dim myXML, myPageText
    Set myXML = Server.CreateObject("MSXML2.ServerXMLHTTP")
    myXML.open "GET", "your-dynamic-url.aspx"
    myXML.send
    myPageText= myXML.responseText
    Set myXML = Nothing

    Response.write myPageText
    ----------

    myPageText now contains the html of the dynamic page.
     
  5. Well isn't that just dandy...

    At the moment I am getting the raw HTML by overloading the Render() function on the page, and then I am attempting to replicate the ABCpdf example and put that, put into "Doc" form, in the Session for passing to the other page.

    However, when I do this the PDF is "invalid" and doesn't load. As soon as I get something from the Session it would seem, even if its just the string "geh".

    The "showdoc.aspx" file currently contains the following:

     
  6. I've had words with the boys over at ABCpdf and they are unaware that the the AddImageURL function violates any security issues, Win2003 in paticular.

    I have currently taken to the method of rendering the PDF from scratch - that is basically rendering the page and all its seperate elements again... but in PDF form. (by using embedded C# in showdoc.aspx)

    Clearly the AddImageURL solution is much nicer - point to the URL and wallah! Any thoughts on this problem or will this remain a security issue for whatever reason?
     
  7. Bruce

    Bruce DiscountASP.NET Staff

    Taken from ABCPDF's support page, section in Red is why addimage wouldn't work. We cannot disable the default security policy on IE, as it ties into the rest of the policy.



    My URL rendering code is telling me that it's 'Unable to read file' or is producing a blank output. What gives?

    First read the previous section carefully.

    Firewalls and proxy servers can present particular problems for URL rendering because they may require some kind of logon. Your IIS user will most likely not have a logon. Windows Authentication can produce a challenge that your IIS user is unlikely to be able to respond correctly to.

    Do ensure you can browse to the appropriate location using IE while logged on as Administrator.

    If you're rendering a URL on your localhost then ensure that the IIS user has read access to this location.

    Create a VBS script (or compiled .NET application) to mimic the effects of your code. Copy the following text into a text file and then rename it mytest.vbs. Double click to run.

    Set d = CreateObject("ABCpdf4.Doc")
    d.AddImage "http://www.google.com/"
    d.Save "google.pdf"
    MsgBox "Finished"

    If the google render code fails then try a file URL (eg "file://c:\mydoc.htm" or the code below) to eliminate network issues.

    Set d = CreateObject("ABCpdf4.Doc")
    d.AddImage "<html><h1>Hello in HTML!</h1></html>"
    d.Save "hello.pdf"
    MsgBox "Finished"

    You can download a set of sample scripts and .NET projects here...

    If this fails you might like to consider an IE upgrade. We test using IE6 and - occasionally - we find that HTML rendering issues are related to old or inconsistent installations of IE.

    Windows 2003 Server defaults to an Internet Explorer Security policy which may interfere with HTML rendering. You may have to modify or disable the policy to allow access to the pages you want to render.</font id="red">

    If you still have problems please mail us telling us exactly what you've tried, the results you've had, your OS, SP and the version of IE installed on your server. You may wish to save your web page from IE and include it in your mail so we can see what you're trying to do.




    quote:Originally posted by mbh_brett

    I've had words with the boys over at ABCpdf and they are unaware that the the AddImageURL function violates any security issues, Win2003 in paticular.

    I have currently taken to the method of rendering the PDF from scratch - that is basically rendering the page and all its seperate elements again... but in PDF form. (by using embedded C# in showdoc.aspx)

    Clearly the AddImageURL solution is much nicer - point to the URL and wallah! Any thoughts on this problem or will this remain a security issue for whatever reason?
    </blockquote id="quote"></font id="quote">

    B.

    DiscountASP.NET
    http://www.DiscountASP.NET
     
  8. Does this mean that we can not use this functionality is the ABCPDF5? Did anybody comeup with an alternative? I tried reading a local html document using AddImageHtml but I am having similar problems. Any suggestion on how to create a looking report without definging location of every word, table and picture?
     
  9. Bruce

    Bruce DiscountASP.NET Staff

  10.  
  11. Hi,


    Thank you for your great post but could you be a little more specific on how this ties to the complete process of creating a pdf document from html page? I am not sure how to tie your code.


    Thanks
     
  12. I posted the code a little off the cuff noting that you were interested in creating a document from scraping an HTML page. I merely posted code which I have used to do just that and as you see, loaded it into a literal control to display it. I am not sure of the .PDF conversion process (actually, something that I will look into in the coming weeks as this is something I would like to do on my own site for customer invoices).

    In passing, I have run across code in the past which uses VS's Crystal Reports module to convert a document to .PDF if you do not have the Adobe libraries built in. I can't say that I know any more than that. Getting the HTML of a page is rather easy in the .NET platform with code similar to what you see below. The second code snippet that I placed has a start/stop element which is great for omitting uselesscoding before/after the section that you might want.

    An example that I have used this for is scraping basketball scores for loading into a fantasy sports website I builtwith a "live scoring" module. It visits a site, looks for a unique code string (txtStartString) and takes all elements from the page up to (txtStopString). I just look for the closest uinique tag before and after my desired part of the page to scrape and parse out undesired junk. Obviously if you set txtStartString to "<td>", it will go until it finds the first "<td>" tag and take everything after that...which might be more HTML code than you really want...

    In my code, replace the "txtURL.Text" (oviously a returned value from a textbox on my page) with your desired URL or do as I have and place a textbox on your page. Place a button on the page and set it's click event to the code below. The first example returns the whole page into a string variable. The second example uses a little regex coding to filter out undesired code. Add a literal to your page and look at the code returned and adjust as needed (add a <html> pageHTML/MatchPattern.Value </body> </html>) and you should be in business. To look at the exact code returned, from your browser, choose the "view source" to see what your parsing has returned.

    Once you have the code looking like you would want, pass it to your widget.

    -Ken
     
Thread Status:
Threads that have been inactive for 5 years or longer are closed to further replies. Please start a new thread.

Share This Page