
05-21-2006, 09:11 AM
|
|
Neophyte
Join Date: Jul 2003
Posts: 1
|
|
FYI, Mobipocket is now owned by Amazon. You could try that route. In any case, the enews scraper feature is now abandonned by Mobipocket. That was a nice feature which never quite took off. With RSS now so prevalent and web services from both Microsoft and Google to mobilize web pages, the need for the scraper is much less than before.
The next version of the eNews creator only supports RSS, so it's a matter of time until the problem goes away... Until then your solution is probably the best one. If you support full text RSS no user would ever need to use the screen scraper anyway.
|
| |
|
|
|

05-21-2006, 09:28 AM
|
|
Intellectual
Join Date: Jul 2007
Posts: 172
|
|
Hitting about every 2.5 seconds? That really is some dumbass programming.
|
| |
|
|
|

05-21-2006, 01:36 PM
|
|
Swami
Join Date: Jun 2007
Posts: 4,909
|
|
As the author of HTTP filter & Mobipocket Web Companion Support Pack, I know the inner secrets of the Web Companion, the communication and the way it collects news quite well.
(Note that I don't know any 4.8+ versions of Web Companion. They may have switched protocol in the meantime. I don't even know if they have introduced a central server-based solution instead. When I, some 3 years ago, worked on the Support Pack, they denied any kind of central cache server because of the EU laws that don't allow for any kind of content caching.)
Indeed Web Companion may be a real pain in the back as it downloads new content quite often. This is why for example The Register banned Web Companion some 37-38 months ago.
There is a solution, though. Just ask users that would still want to stick to stick to Web Companion to collect news to manually disable Web Companion and only manually sync to minimize the impact. Of course, it'll still be impossible to distuingish between automatized and manual download. To help this, you could implement some kind of login - that is, MWC-based eNews download could be account-based. MWC is capable of this - even without my additional tools - see the POSTLogin section in the User Manual of my tools. And, if a particular user takes too much bandwidth (because she or he doesn't disable the automatic news download), just ban his account.
RSS is indeed nice, but, especially when used with the right additional tools (for example, my pack), an advanced user can achieve FAR more with Web Companion than with anywhere else - downloading, filtering forum pages at once for offline reading and as he or she wants, for example. (That's only one of the additional features I've added to Web Companion.)
__________________
Microsoft MVP - Mobile Devices. You may want to check out my Smartphone & Pocket PC Magazine Expert Blog.
|
| |
|
|
|

05-21-2006, 01:46 PM
|
|
Swami
Join Date: Jun 2007
Posts: 4,909
|
|
Quote:
|
Originally Posted by frankenbike
Hitting about every 2.5 seconds? That really is some dumbass programming.
|
Not really - the bot runs (at least in older version it ran - dunno if the new version is centralized. Because of the stupid EU laws, I don't think so) on users' desktop PC. That is, if say 5000 users add a site to their MWC and synchronize content, say, 5 times a day, it'll be 25000 hit total.
__________________
Microsoft MVP - Mobile Devices. You may want to check out my Smartphone & Pocket PC Magazine Expert Blog.
|
| |
|
|
|

05-21-2006, 01:50 PM
|
|
Swami
Join Date: Jun 2007
Posts: 4,909
|
|
Quote:
|
Originally Posted by francks
If you support full text RSS no user would ever need to use the screen scraper anyway.
|
Yup, if full RSS is offered (or a well-done RSS client that is also able to collect linked pages), RSS can be pretty good (particularly because there's no HTML layout markup overhead). In a lot of (advanced) cases, however, Web Companion offers unique capabilities.
(Not that I would use the latter any more - GPRS/EDGE is cheap/fast enough to be able to get news almost always online.)
__________________
Microsoft MVP - Mobile Devices. You may want to check out my Smartphone & Pocket PC Magazine Expert Blog.
|
| |
|
|
|

05-21-2006, 04:03 PM
|
|
Executive Editor
Join Date: Aug 2006
Posts: 23,548
|
|
Quote:
|
Originally Posted by francks
If you support full text RSS no user would ever need to use the screen scraper anyway.
|
And if I support full text RSS, no user would have a reason to come to the site. There's no revenue model in full text RSS feeds on mobile devices.
__________________
Thanks for visiting our forums!
|
| |
|
|
|

05-21-2006, 04:15 PM
|
|
Executive Editor
Join Date: Aug 2006
Posts: 23,548
|
|
Quote:
|
Originally Posted by Menneisyys
There is a solution, though. Just ask users that would still want to stick to stick to Web Companion to collect news to manually disable Web Companion and only manually sync to minimize the impact.
|
No solution is viable if it relies on people voluntarily doing something because I ask them to.
Since you know so much about this, do know you if their eNews Client respects a robots.txt ban? I sure hope it does, because if all this traffic is coming from 5000 different MobiPocket users running the scraper on their desktop PC, it will be impossible to do an IP address block.
I've now posted in their forum:
http://www.mobipocket.com/forum/viewtopic.php?t=1724
__________________
Thanks for visiting our forums!
|
| |
|
|
|

05-21-2006, 04:18 PM
|
|
Executive Editor
Join Date: Aug 2006
Posts: 23,548
|
|
I banned it yesterday evening, but in looking at a log file analysis of the first few hours after midnight, the eNews Client is still hitting our server. :evil: So it seems they do not respect a robots.txt ban. I'm not surprised, considering how little respect the have for content publishers. So the question is, how do I stop them? Do you know how The Register stopped them? Do they have a blacklist of URLs that their client will not scrape that I can get myself added to?
__________________
Thanks for visiting our forums!
|
| |
|
|
|

05-21-2006, 04:45 PM
|
|
Swami
Join Date: Jun 2007
Posts: 4,909
|
|
Quote:
|
Originally Posted by Jason Dunn
I banned it yesterday evening, but in looking at a log file analysis of the first few hours after midnight, the eNews Client is still hitting our server. :evil: So it seems they do not respect a robots.txt ban. I'm not surprised, considering how little respect the have for content publishers. So the question is, how do I stop them? Do you know how The Register stopped them?
|
They didn't make an IP ban - just haven't returned anything (except the Forbidden HTTP status code) when they sensed the eNews Creator HTTP User Agent. That is, everything (all IP's) passed to the server; of them, only ones that weren't using the eNews Creator-specific User Agent were (are) served with actual content.
__________________
Microsoft MVP - Mobile Devices. You may want to check out my Smartphone & Pocket PC Magazine Expert Blog.
|
| |
|
|
|

05-21-2006, 04:53 PM
|
|
Swami
Join Date: Jun 2007
Posts: 4,909
|
|
Quote:
|
Originally Posted by Jason Dunn
Since you know so much about this, do know you if their eNews Client respects a robots.txt ban? I sure hope it does, because if all this traffic is coming from 5000 different MobiPocket users running the scraper on their desktop PC, it will be impossible to do an IP address block.
|
Dunno if the structure of the communication changed in the last three years (I've stopped working on the MWC Support Pack almost exactly three years ago when I completely switched to online content access) - I don't think much has been changed because last time I installed the latest Mobi, it installed almost exactly the same client on my desktop PC as 3-4-year-old versions.
These clients directly connect to the subscribed Web site, download entire HTML (not RSS or anything more machine-friendly) pages and parse the useful content out of them.
Most sites (see the example of The Register) defend themselves against Mobi clients by returning the bandwidth-friendly simple Forbidden header wen they encounter the User-Agent.
User-Agents, fortunately, can't be set in MWC - at least this was the case 3 years ago. that is, casual MWC/Mobi users don't have a chance at making MWC collect articles from sites that ban them. It's only using proxy servers that offer transparent User-Agent spoofing (which sits between MWC and the Web server and uses a standard desktop IE header to identify itself to the Web server, completely overriding the User-Agent sent out by MWC) that anyone can download anything from a protected Web site.
__________________
Microsoft MVP - Mobile Devices. You may want to check out my Smartphone & Pocket PC Magazine Expert Blog.
|
| |
|
|
|
|
|