About connecting to websites often

Forum

Forum
Lounge
About connecting to websites often

About connecting to websites often

Often times when you're writing programs that need to connect to websites, you'll have to have the program connect to the website very often if not every second.

So for instance if you're writing some sort of feed where you want your program to check for particular type of topics (eg. topics in particular subforum) or replies to a particular topic and then send back information to the user when something changes..

Then in such situations you want your program to be able to instantly give information to the user. So the program would have to connect to the website to make note of changes at least once a minute.

So if I'm connecting to a website that often, maybe even connecting every single second, SPONTANEOUSLY for DAYS TOGETHER, would the website consider that as harmful and block me?

Take this forum for example. Would this forum get annoyed?

Also, for keeping track of something in a website, do you need to download all of its HTML, and spontaneously? Or is there a more efficient way, when suppose you know exactly the change is going to happen?

mbozzi (3942)

In general you can structure your program to keep a connection alive.

Also, for keeping track of something in a website, do you need to download all of its HTML, and spontaneously? Or is there a more efficient way, when suppose you know exactly the change is going to happen?

Not in general, although this is necessary on occasion: if you're making a habit of scraping HTML for links, Perl is your friend.

Ideally, the server provides an API that can be called (usually after authentication) by issuing the proper HTTP requests. The result would be exactly what you ask for, often in JSON or some other format (XML, plain text).

Obviously if you know when the change occurs, just wait until after the update to ask for the result. If only the server knows when the change occurs, you could either
a.) have the client poll the server reasonably often, or
b.) have the server notify the client when the change occurs, for example using Web-sockets or HTTP callbacks.

So if I'm connecting to a website that often, maybe even connecting every single second, SPONTANEOUSLY for DAYS TOGETHER, would the website consider that as harmful and block me?

Many websites rate-limit requests and connections. Egregious violations can sometimes get you blocked (and your authentication revoked, etc.)

Last edited on

Peter87 (11244)

I don't think you should worry about being blocked. Instead you should worry more about what the site owners want, even those that has no automatic blocking in place. I think you should try to not be more of a burden than regular usage of that website would have been, especially if you are going to share your program with others. Instead of having multiple programs all polling the same pages redundantly you could set up a server that polled the website at a suitable rate and then let the programs poll your server as much as you want.

Satan (369)

Peter, that's one more thing that I hadn't considered. Okay so this is all just theory for right now, by the way, I know blank about scraping the web. I wanted to make a utility application for a specific game that would scrape information from a website and reorganize it for the user. Is there anyway I could host something like this online, for free, or for cheap? So in that way I am not disappointing the website.

By the way where's a good starting point for web scraping? I know C++ is definitely not the thing for it.

Topic archived. No new replies allowed.