IFX Mail Archive: RE: IFX> PDF/is Issue.

RE: IFX> PDF/is Issue.

From: Gail Songer (gsonger@peerless.com)
Date: Wed Mar 12 2003 - 11:20:41 EST

  • Next message: Gail Songer: "IFX> FW: Meeting: {IPP FAX / PDF-is} March 14, 2003 10:00 AM America/Los_Angeles {123123}"

    But one is going to have to modify the existing writers in order to be
    PDF/is compliant so at least this argument shouldn't be an issue.

     

    -----Original Message-----
    From: Poysa, Kari [mailto:Kari.Poysa@usa.xerox.com]
    Sent: Wednesday, March 12, 2003 6:04 AM
    To: Hastings, Tom N; 'Rick Seeler'; 'Carl Kugler'
    Cc: ifx@pwg.org
    Subject: RE: IFX> PDF/is Issue.

     

    Tom, The Length being discussed here actually is the byte count of the
    streams of Image XObjects that belong to the Page. So if the Page is
    comprised of more than one image (a.k.a banding), then the sender does
    not need to cache even a full page's worth of compressed data in order
    to be able to write the Image XObject's stream length in the stream
    dictionary.

     

    Full PDF allows the writer to enter an indirect object reference into
    the required Length entry. This makes it easy to implement writers
    because the separate object for the length can be written after all of
    the image data has been written. The PDF files are then read in the
    reverse order starting from the end of the file. This works well if one
    has a file system to store the complete PDF file. So requiring the
    Length to be a direct value in the stream dictionary most likely would
    cause existing writer SW to have to be modified. One could not keep
    writing the same kind of files and claim them PDF/is compliant.

     

        --- Kari ---

    -----Original Message-----
    From: Hastings, Tom N
    Sent: Tuesday, March 11, 2003 5:49 PM
    To: Poysa, Kari; 'Rick Seeler'; 'Carl Kugler'
    Cc: ifx@pwg.org
    Subject: RE: IFX> PDF/is Issue.

    Kari,

     

    I think you summed up the argument about tradeoff simply between the
    Sender and the Receiver when you said:

     

    "If we require the reader to be able to cache a page's worth of
    uncompressed data, surely we can require the writer to cache a page's
    worth of compressed data [in order to determine the length and send that
    length in the stream]."

     

    I assume that PDF has the notion of a length for each page, right? So
    we require that the Sender put in a length field for each page of data
    at the front of each page of data. Can that length field be sent with
    the data in some manner, so that the Sender doesn't have to know the
    lengths of all of the pages before sending any?

     

    Tom

    -----Original Message-----
    From: Poysa, Kari [mailto:Kari.Poysa@usa.xerox.com]
    Sent: Friday, March 07, 2003 15:04
    To: 'Rick Seeler'; 'Carl Kugler'
    Cc: ifx@pwg.org
    Subject: RE: IFX> PDF/is Issue.

    Rick, I bet this solution can be implemented, but it does have some
    problems for the reader that unfortunately I did not see earlier. The
    difficulty really is whether we want to make life easy for the streaming
    writer or the reader.

     

    If the length follows the image stream, the reader must scan the
    filtered stream to find the end of the stream. This can make the reader
    implementation both cumbersome and slow, especially if the stream has to
    be fully decoded during the PDF file parsing, instead of simply
    extracting the correct amount of binary data and passing it to a
    separate decompression module. The PDF file parser would have to know
    details of the compressed streams which should really be of no interest
    to the PDF file parser module and makes creating applications from 3rd
    party components harder.

     

    In addition, if the reader attempts to decode the stream, how much data
    should be cached and decoded at a time? If the end of stream is not
    found at first attempt, one has to pass additional data to the decoder
    and continue decoding from where previous data ended. This can delay
    achieving robust implementations. The alternative, searching for the
    "endstream" text, is not 100% reliable (although very close) and is a
    wasted step since no decompression is achieved yet.

     

    This issue is really at the heart of what "streamable" means, and also
    has a big impact on what kind of low resource applications PDF/is can be
    used for. I think we should consider it a "MUST" for the writer to
    prefix the stream with its length, since the goal is to make the file
    format streamable especially at a low resource reader. If we require the
    reader to be able to cache a page's worth of uncompressed data, surely
    we can require the writer to cache a page's worth of compressed data.

     

    I do understand Ira McDonalds note about streaming writers (see separate
    Email). Possibly this issue whether to prefix or postfix image streams
    with their lengths should be a negotiable capability between the sender
    and receiver?

     

        --- Kari ---

    -----Original Message-----
    From: Rick Seeler [mailto:rseeler@adobe.com]
    Sent: Thursday, March 06, 2003 2:37 PM
    To: 'Poysa, Kari'; 'Carl Kugler'
    Cc: ifx@pwg.org
    Subject: RE: IFX> PDF/is Issue.

    Kari,

     

    Yes, the stream length should precede the stream, if possible (this is
    allowed). But, in the case where the stream may be long, this may not
    be possible for the Producer. In that case, the length should be an
    indirect object reference to the length that should come immediately
    after the stream.

     

    As for your idea of scanning for "endstream" that's followed by the size
    object. This still has the same problem as scanning for "endstream" but
    just has more data and a smaller likelihood of occurrence.

     

    Given that, and what I discussed in my previous e-mail on this subject
    (to Rob Buckley), I think the best approach might be to:

    1) The Producer MUST always write the stream length of all 'Content
    Streams' and 'ICC Profile' streams immediately in the object dictionary
    (before the stream).

    2) When writing image streams, the Producer MAY either write the stream
    length before or after the stream, as they prefer.

    3) When an image stream is length succeeded (indirect object), the
    Consumer SHOULD decode image streams to determine the stream length,
    when possible. But, the Consumer MAY (at their peril) scan for the
    'endstream' marker.

     

    How does this sound as a solution?

     

     

    -Rick

    -----Original Message-----
    From: owner-ifx@pwg.org [mailto:owner-ifx@pwg.org] On Behalf Of Poysa,
    Kari
    Sent: Thursday, March 06, 2003 7:15 AM
    To: 'Carl Kugler'
    Cc: ifx@pwg.org
    Subject: RE: IFX> PDF/is Issue.

    In my opinion the goal should be to write the stream length immediately
    to the stream dictionary.

     

    Also, the likelihood of "endofstream" to exists in the data is small.
    We could also require that if a low resource streaming writer is not
    able to add the length directly into the stream directory, then the PDF
    object for the length MUST immediately follow the stream object. This
    way, the reader can scan for "endofstream" (but of course only if the
    length was not in the stream dictionary) and make sure that it is the
    correct "endofstream" by verifying that it is immediately followed by
    something that looks like a length object. Could reader implementers
    comment on this?

     

    I think introducing an additional filter like ASCII85 just for spotting
    the end of stream adds unnecessary complexity to both writer and reader,
    increases file sizes and also requires more memory and processing as the
    stream cannot be passed directly to a decompressor.

     

        --- Kari ---

    -----Original Message-----
    From: Carl Kugler [mailto:kugler@us.ibm.com]
    Sent: Wednesday, March 05, 2003 10:50 AM
    Cc: ifx@pwg.org
    Subject: RE: IFX> PDF/is Issue.

    I like the chunking approach. It is efficient, reliable, and has low
    overhead for reasonably sized chunks. Also fits well in a typical
    implementation that writes a chunk of data at a time.

            -Carl

     

    "Zehler, Peter" <PZehler@crt.xerox.com>
    Sent by: owner-ifx@pwg.org

    03/05/2003 05:00 AM

            
            To: "'Rick Seeler'" <rseeler@adobe.com>, ifx@pwg.org
            cc:
            Subject: RE: IFX> PDF/is Issue.

    Rick,
    Why not just increase the size of the length field signature? Could
    this be done by the addition of data or comments in the length object or
    by adding another object? I don't know pdf very well. I don't think we
    need 0% probability of confusion just a statistically insignificant
    chance.
    Pete
      

    Peter Zehler
    XEROX
    Xerox Architecture Center
    Email: PZehler@crt.xerox.com
    Voice: (585) 265-8755
    FAX: (585) 265-8871
    US Mail: Peter Zehler

            Xerox Corp.
           800 Phillips Rd.
           M/S 128-30E
           Webster NY, 14580-9701

    -----Original Message-----
    From: Rick Seeler [mailto:rseeler@adobe.com]
    Sent: Tuesday, March 04, 2003 1:29 PM
    To: ifx@pwg.org
    Subject: IFX> PDF/is Issue.

    During prototyping of PDF/is the following problem arose:
      
    How does the Consumer know when the end of a data stream (See section
    3.2.7 of [pdf]) is reached? Normally, in a PDF, the Consumer would
    consult the stream length field. The problem here is where to put the
    length field. If the length were placed before the stream, the Consumer
    would know how long the stream is. This requires the Producer to know
    the stream's length before writing it to the Consumer. If, instead, the
    length were written at the end of the stream, this would solve the
    Producer's problem but the Consumer would not know how to find the
    length since they can't identify, 100% of the time, where the stream
    ends and where the length object is.
      
    An example will illustrate:
    First, the normal case...
      
    stream
    sdljfiwefnwfubrevurewliysnhr;hgawebfz;h;uwre (lots of binary data
    here)....
    84trhdvfyu7wgf4.nbdrgur4uaru4gb
    endstream
    12 0 obj
    3456 <- the length of the previous stream.
    endobj
      
    But, what if the data looked like this...
      
    stream
    sdljfiwefnwfubrevurewliysnhr;hgawebfz;h;uwre (lots of binary data
    here)....
    endstream <- the binary data could have a string of bytes
    that looked like this.
    84trhdvfyu7wgf4.nbdrgur4uaru4gb
    endstream
    12 0 obj
    4567 <- the length of the previous stream.
    endobj
      
    Of course, you could look to bytes after the appearance of the word
    'endstream' to see if this is really the end of the stream; but you can
    always come up with a stream that could match your parsing algorithm's
    expectations (although with decreasing percentage of occurrence).
      
    Possible solutions:
    1) Write all data using ASCII85 encoding (See Section 3.3.2 of [pdf]).
    This will increase stream lengths by 25%. ASCII85 has a stream
    delimiter which would solve this problem -- the end of the stream can be
    known for certain and the length field can be placed after the stream.
    2) Require the Producer to write the stream length before any stream
    (the streams would stay binary). The Producer can use banding to break
    up large images into small enough chunks so the Producer can cache the
    stream before sending.
    3) Offer a combination of 1 & 2. The Producer would cache streams if
    possible, but may use ASCII85, if necessary.
    4) Producer must make certain all streams must not contain a series of
    bytes "\0D\0Aendstream" in the stream data. This is how the spec is
    defined currently -- but this may be too onerous for the Producer.
      
    Any other ideas? I'm personally leaning toward solution #3.
      

    -Rick



    This archive was generated by hypermail 2b29 : Wed Mar 12 2003 - 11:19:40 EST