1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Tech Question for Network/PHP Guys

Discussion in 'PHP' started by Nick W, Nov 14, 2004.

  1. #1
    Hi all,

    Here's the scenario: If you wanted to check a page to see if it was new or not (ie. it had been updated since you last checked) what PHP functions would you be looking at?

    How could that be done using the absolute minimal bandwidth, im really looking to get an idea of what scaling issues would be involved if I wanted to check 100's of thousands of urls an hour...

    Sorry it's vague, be vague in return :) just looking for some starting points for research into a project idea..

    thx..
     
    Nick W, Nov 14, 2004 IP
  2. THT

    THT Peon

    Messages:
    686
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #2
    you could use one of the various methods to get the page(CURL, get_file_contents) and compare it to the previous time you checked

    File size should be enough
     
    THT, Nov 14, 2004 IP
  3. Nick W

    Nick W Peon

    Messages:
    67
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Sure, I know how to get the page, i like to use cURL but as i said above, what is the best way to do this with minimal bandwidth usage?

    I really couldnt go and grab 100's of thousands of pages an hour :)
     
    Nick W, Nov 14, 2004 IP
  4. xml

    xml Peon

    Messages:
    254
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Curl supports HTTP/1.1 i think, support Gzip compressed pages on sites that enable it.
     
    xml, Nov 14, 2004 IP
  5. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,334
    Likes Received:
    2,613
    Best Answers:
    462
    Trophy Points:
    710
    Digital Goods:
    29
    #5
    If the server is sending the Last-Modified header, you can use the curl option to return the headers like so:

    curl_setopt($ch, CURLOPT_HEADER, TRUE);
    PHP:
    Although, it looks like it returns the header AND the content, which isn't needed. You may want to use cURL from a shell to do it instead:

    shell_exec ('curl -I http://www.digitalpoint.com');
    PHP:
    as an example...
     
    digitalpoint, Nov 14, 2004 IP
  6. Nick W

    Nick W Peon

    Messages:
    67
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Thanks Shawn, are there any other ways to check last modified without actually downloading the page?
     
    Nick W, Nov 14, 2004 IP
  7. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,334
    Likes Received:
    2,613
    Best Answers:
    462
    Trophy Points:
    710
    Digital Goods:
    29
    #7
    If you are checking with a PHP script that resides on same server that the files you are checking are you can do this:

    filemtime($filename);
    PHP:
    That will give you a UNIX timestamp of the last time the file was modified.
     
    digitalpoint, Nov 14, 2004 IP
  8. john_loch

    john_loch Rodent Slayer

    Messages:
    1,294
    Likes Received:
    66
    Best Answers:
    0
    Trophy Points:
    138
    #8
    Instead of GET, POST, etc, use the request method designed exclusively for what you want to do.. HEAD

    Read more..http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.4

    Excerpt: "This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification..."

    Regarding specifically dated/altered content, cacheing is evaluated (by intermediaries) when the HEAD request is used as well..

    Excerpt: "If the new field values indicate that the cached entity differs from the current entity (as would be indicated by a change in Content-Length, Content-MD5, ETag or Last-Modified), then the cache MUST treat the cache entry as stale."

    In other words, the above are established standard practices for what you want to do :)

    Cheers,

    JL
     
    john_loch, Nov 14, 2004 IP
  9. sarahk

    sarahk iTamer Staff

    Messages:
    28,641
    Likes Received:
    4,486
    Best Answers:
    123
    Trophy Points:
    665
    #9
    Personally I think you are stuck getting the text. Consider a dynamic website. The get last modified will get the age of the file which generates the page, not the age of the content (probably held in a database). Personally I can't see anyway around it.

    Then you need to consider how to identify how much has changed.
    * Is a small change acceptable?
    * how many bytes?
    * but not if the change is just in a link? must be content?

    Hope this helps - and good luck!

    Sarah
     
    sarahk, Nov 15, 2004 IP