As the owner of a large website, I don't care what you think. I block by default and whitelist when I decide it's in my interest.
If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.
Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.
As the user of large websites I don't care. I'm not going to read the TOS and I will continue to scrape what I like since it makes my life more convenient. Like OP when blocked I'll just drive my scraping through a web browser which is the same as I've done for years on various sites that never provided APIs.
"As the user of large websites I don't care". Are you sure ? Do you want your OK Cupid or LinkedIn profile to be crossposted on another website without your knowledge.
Putting it behind a signup page with terms that don't allow sharing is not "making it public".
And while in the US that may "just" be treated as unauthorized access, in the EU, if you make the data public it's also a violation of the Data Protection Directive, putting you at risk of prosecution in every EU country from which you have included data.
You may be right from a risk minimisation perspective. But for a lot of data the risk in the case of exposure is low enough that it is a totally valid risk management strategy to assume that legal protections will be a sufficient deterrent to prevent enough of the most blatant abuses.
Eh, not really. The Data Protection Directive doesn’t even apply here – if the first party (OKCupid) made it available to a third party (the scraper), then the first party can be held in violation, but not the third party.
If you have control of personally identifiable data, it's likely that at least some of the EU data protection rules will apply to you regardless of how you got it.
Yes but as you say, they apply regardless. More specifically, they apply to data that you have (and are storing), not the act of obtaining it.
As a private individual it's not hard to comply either, for private use. If you publish it, it becomes a different story, because it's PII. And, as soon as it's in possession of a company, they need to comply with more rules about securely storing it, etc. (this isn't enforced very well, though). Private individuals can't be held to that because there's (in theory) no legal way to check it.
Why is it a double standard? Google scraping usually benefits the site with increased traffic and revenue, in a way most other scraping does not. Saying "you can scrape me if it benefits me" isn't totally in keeping with the principles of the open web, but it's not hypocritical.
With a risk of stating the obvious, this is a double standard simply because there are two standards - one for Google and one for others. I can't speak for the poster you were replying to, but whilst I see it as logical self-interested behaviour by site owners, it still feels unfair.
There isn't: the function for this standard includes expected benefit as an input. Every standard has inputs, so that certainly isn't the quality for making something a double standard. The only remaining quality is how unfair it feels, so it would probably be better to just address that, since it is obviously the only thing you disagree about.
Your example is a case of discrimination, but the economic rationale is unquestionable. There is a tremendous upfront cost for new employees, who are not valuable contributors for some lengthy ramp up period and furthermore accrue experience over the course of employment. So the lifetime value curve for any given employee is typically skewed left.
My point was that when you call something a double standard, you're arguing two things of equal value have been judged differently under the same standard. But by acknowledging they've been judged differently, you're acknowledging that there is a judgement, a standard, that applies the same to both, and produces the results you object to. What you really object to is the fairness of the qualities checked by the standard.
Since the outcome of calling things that, vs calling them a double standard is the same, I think most people already know and have no trouble with this. My protests were worthless.
It could gain value if there were certain whitelisted judgable aspects (like expected value), and judgements that aren't based on things from the whitelist are considered outside the scope of a standard. Then, calling the standard unfair and calling it a double standard would have a different meaning (if only in some contrived way, since any aspect is just an argument away from the whitelist)
It's their site they can block whatever they want. The problem is the stupid far reaching conclusion that this is trespassing.
Even normal trespassing laws are way too overreaching (see how it is handled in the UK for a saner example) but now you have the amazing possibility of remote trespassing.
The fun part is that it's just a matter of someone hiding something that says you cannot access the site in a place that you have to access the site in order to read -- the ToS. Suing people over this is idiotic.
The real problem is the involvement of Govt, and this kind of absurdity regarding ToS, EULAs and so on, is something that has been going on for decades. If you have the money you can make Govt your personal watch dogs.
That's a terrible analogy. Your home is private, websites are not. The fact is that websites are posted online for all to see, so it's more like saying certain people at a park may take pictures while others are not allowed. That's unfair. If everyone could take pictures, it would be fair. Yes, someone with an old bright bulb camera might be annoying people, but nobody said "fair" meant all players would be nice or that having a "fair" policy would somehow be more beneficial to the website owner. It's not, that's why site owners are selective. So they have a double standard, but it's for their benefit, not that of the site visitors (be they human or bot).
How about the analogy of an art gallery disallowing photography? Is the gallery being hypocritical when they allow the local paper to take photos for publicity, or when they permit an archivist that has a known reputation to take photos for archival purposes?
You can still deal with the old bright bulb cameras: you can have rules which apply to everyone. So you can make a rule at the park that pictures are allowed, but only without flash, or that only digital cameras are allowed, or only digital cameras with the fake-shutter noises turned off, etc. As long as the rule applies to everyone equally, it's fair, even if you think the rule is silly.
For websites, it's not fair to have different rules for Google than others. What would be fair is some kind of rule about how often visitors can visit, how much they're allowed to download, etc.
Personally, though, I think all this is total BS. Sites are open to the public, but they also serve the whims of their owners. If the site wants to prevent access to people from a certain IP range, that should be their right. If they don't want any scrapers, that should be their right too, or if they want to allow Google and not anyone else, that should also be their right. What isn't right is that they can use the government to enforce these arbitrary rules. If they want to block my scraper, that's fine, if they can do it on their end technologically. If they want to block my IP, they can do that too. But suing me or having the cops come to my door because they're too incompetent or lazy to do these things technologically is unacceptable. The role of government is not to enforce arbitrary policies made up by business owners.
To be fair, many companies which take anti-scraping seriously will also take inputs like geographic origin of a request into consideration when applying request throttling and filtering.
Google is basically algorithms built on top of a scraping service. It's unfair to competitors (and potential disruptors) to restrict access to data that Google can fetch without limits.
And all smart websites should include a ToS that says you are not allowed to access their data, so they can sue for trespassing anyone that they don't like selectively.
The far reaching of government into this, and also the pirating stuff (which I do not condone but think that arresting people for that is waay too much) is what makes me want for the system to collapse under it's own weight. Like some website suing members of congress for visiting it while violating the ToS in this case.
I also secretly wanted Oracle to win vs Google so that cloning an API was piracy and that would extend to being a crime to purchase pirated goods which would make all clean room reverse engineering a criminal activity. That would lead to anyone that uses a PC without an authentic IBM BIOS (look up Phoenix BIOS) to be arrested, in theory, so even the US president would have to fall into that. It would have been a glorious shitstorm if Oracle won and IBM took that precedent to it's logical implications, the computer world would have failed, and the law would either be made even more arbitrary or be fixed, but at least it would be shown how idiotic the state of affairs was.
Your idea about Oracle winning and society coming crashing to a halt is ridiculous and wouldn't have happened. Your flaw is believing that the law and the government will work with logical precision, so that a flaw in the law will, like an infinite recursive loop in programming code, cause complete disaster. It doesn't work that way. There's plenty of cases where the law is clearly broken (see civil forfeiture vs. the 4th Amendment to the US Constitution), yet nothing is done. That's because the government is run by humans, and they'll enforce things the way they want. Double standards happen all the time with law, and it takes big, expensive court cases to sort them out, and of course that only happens when some moneyed interest wants to fix it (which is why civil forfeiture is still a big thing--they're not going after extremely wealthy people or corporations with it). While IBM is certainly large enough to bring a big case like you suggested, the US government is far bigger and can simply invent a legal way of ignoring them, just as was done when the SCOTUS decided to rule in favor of using Eminent Domain to seize private property to hand over to commercial interests.
because i may also come to the point where i am a direct competitor to google, but i will never get there because i can't scrap any site like they can.
your next argument may very well be a very racist one with the very same excuse you used above.
And if you have some way to identify yourself as a potential competitor to google and not some jackass trying to scrape email addresses or spam comments forms, I'm all ears.
Worse is that Google tries to stop scraping. It's like they don't want anyone to see past the first page of results.
They could scrape your website and then they prevent you form scraping your own data back.
The whole process is silly; it reflects the duct tape and chicken wire nature of the www.
No one should have to "scrape" or "crawl".
Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.
This to bridge the gap until we reach a more content addressable system (cf. location based).
Clients (text readers, media players, whatever) can download and transform the universally formatted data into markup, binary, etc. -- whatever they wish, but all the design creativity and complexity of "web pages" or "web apps" can be handled at the network edge, client-side.
"Crawling" should not be necessary.
No one should have to store HTML tags and other window dressing for data.
To give an example, there is a lot of free open source software mirrored all over the internet, mostly on ftp servers, but also on http, rsync, etc.
If you use Linux or BSD you probably are using some of this software. If you use the www, then you are probably accessing computers that use this software. If you drive a new Mercedes you are probably using some of this software. There are a lot of copies of this code in a lot of places.
Is that centralized? Does anyone hosting a mirror ("repository") "own" the software? Is it the same person or entity hosting every mirror?
Compare Google's copies of everyone else's data, also replicated in a lot of places around the world. Who "owns" this data?
Double standard? The difference is that Google Bot is built on being unobtrusive. I can easily built a scraper that will quickly ddos a site. Linkedin for example...if they allow 10,000 people to send 100 scraping requests per second everyday then that is stolen bandwidth that Linkedin has to pay for and the scrapers get free data. The difference is that Google has standards in which site's unusually benefit from, not to mention that they allow for you to disallow their bot. It just doesn't work the same way with some random developer building a scraper.
I agree that Googlebot is well behaved. When it detects your site is slowing down, it will back itself off. Unfortunately, this is often to your detriment.
In my experience, on a large site, Google will often slurp as much as you let it, upwards of hundreds of pages per second.
That is a lot! It's also still an order of magnitude less than big content sites. Not taking anything away from what must be a successful website to get a consistent 300 pages/minute crawl rate, but only to illustrate magnitude.
I was curious, so I just checked the stats through webmaster tools. For the last 90 days, the low is 450,000 daily crawled pages, average is 650,000, and yesterday was the high of 1,130,000 (780 per minute). Ouch.
This particular site is top 5,000 Alexa. The content changes every minute, and Google is fast at picking up those changes. The last cache of the homepage was 7 minutes ago from Google.
There's definitely a correlation between my sites' Google rankings, their organic traffic, and their crawl rate. The other sites I run are Alexa top 30,000 and top 100,000. They all feature dynamically changing content, but Google is definitely using a higher crawl rate on my higher ranking sites. This isn't a surprise though, Google has limited resources like everyone, and they'll focus those resources in a way that provides the most benefit.
Edit: If you're talking about the correlation between daily ranking and daily crawl rate for an individual site, then no, I'm not aware of any patterns. For example, the graph is flat for organic traffic and total indexed pages, but the crawl rate jumps up and down as mentioned, and it doesn't appear to relate on a daily basis.