One of the great things about the social web is the culture of sharing that it fosters. One person writes a blog post; another quotes it, disagrees with some parts, corrects a passage or two, and adds some more information; a third synthesizes it all into a cool infographic. It's a little like the coolest potluck dinner in history.
But every great potluck dinner seems to attract the folks who have no intention of cooking a damn thing. They're there to gorge on as much Jell-o salad and chicken fingers as they can before someone notices they didn't bring any dishes themselves.
And in the Web 2.0 world, we call one particular breed of these people scrapers.
Scrapers use automated software to prowl the web for content, scoop it up and drop it in their own blogs - often without attributing it or linking back to the original source. Vendors of that software advertise it as a way of creating a fully-populated web site out of thin air - and gloss over the fact that it's essentially automated plagiarism.
One such site is the Bookmark Devil blog. Bookmark Devil itself is, well, kind of a baffling site - kind of a very low-traffic Digg - built, it appears, mainly to promote the author's search-engine optimization software.
But it also includes a blog, and that's where the scraping comes in.
The front page of their blog is an article that bears what we'll call a striking similarity to the Wikipedia entry on social bookmarking:
Scroll down a little, and you'll see the list of "last posts". And if you've been following our blog recently, many of those titles will look mighty familiar. They sure look familiar to Alex and me: we wrote them.
Each of those titles links to a page that gives no hint that the author was anyone other than Bookmark Devil. And each is a blog post lifted from this very site.
So who's getting hurt?
Like most authors, we're usually happy to see our work reaching a wider audience. That's why we blog, and why we publish a news feed.
But authors deserve credit for their work, and thinkers deserve credit for their ideas. (For that matter, we deserve to be held responsible if our ideas end up a little wonky.)
People who use search engines ought to be able to find what they're looking for without having to sift through a morass of link farms and spam blogs.
And when you do land on a page, you deserve to know you're dealing with the real thing... especially if you want to engage the writers in conversation, which is ultimately what makes the social web social.
How to tell if your blog is being scraped
Fortunately, scrapers usually want their work to be visible to search engines. And that makes those same search engines your allies in hunting down sites that are taking liberties with your content.
Services like Google Blog Search and Technorati allow you to create RSS feeds from searches on terms such as your name, your blog's name and your blog's URL - any of which a scraper can inadvertently include in a post they lift from your blog's news feed. Monitor those feeds (which you should be doing anyway, to see who's talking about you and your blog), check out any hits, and you'll know as soon as a scraper strikes.
Been scraped before, or feeling especially vigilant? Conduct periodic manual searches on distinctive phrases in your most recent blog posts, and see if they're turning up somewhere they shouldn't.
When you do find hits, ask yourself if you're really being scraped, or if this qualifies as fair use (fair dealing in Canada). A lawyer can help you clear up any gray areas. One big question to ask: is the site that uses your content claiming it for its own, or acknowledging you as the writer and linking back to you? If the latter, they may still be violating your rights, but at least they aren't plagiarists.
How to fight back
Before we go any further: if you think you may want to seek compensation or pursue legal remedies, then stop reading and call a lawyer. They'll be able to advise you.
Still with us? Great.Preventive medicine is the best kind. Add a copyright notice to your site, spelling out just what you do and don't permit. Depending on your needs, Creative Commons could have the solution for you.
Designing your blog template to add a byline and a link to your site to every post means that, even if it gets scraped, people will be able to see where the original article came from.
If you're feeling extra-geeky, add that attribution to just your news feed (so it doesn't bother visitors to your site). If you're feeling less-than-extra-geeky, FeedBurner lets you do that easily with its FeedFlare feature - look for the Attribution FeedFlare, and add it to your feed.
Okay - let's say you're being scraped. First, take a breath and decide what outcome you want. Do you want the content removed? Do you just want an attribution to you and a link to your site? Do you want something in between - say, attribution and a link, but also the removal of all but an excerpt?
Now look at the site. If it isn't jammed to the gills with banner ads, AdSense blocks and text links, there's actually a chance the site's owner doesn't understand that what they're doing is wrong.
Look for contact information, and drop them a polite but firm note to let them know you're unhappy, and to tell them what you want them to do about it. Give them a deadline - 48 hours is reasonable.
If they respond, great. They may ask for more time; use your judgement in deciding whether to give it to them. They may throw a fit - "don't you know everything on the Internet is free, dude?!" - in which case point to the notice on your site and restate your expectation for their action.
If that doesn't get you anywhere - if you receive no response or an unsatisfactory one - then you have a few options:
- You can call a lawyer, if you think it's really worth fighting over. (Chances are it isn't - especially if the scraper is located in a faraway country, as so many are.)
- You can decide it's not worth the fight, and chalk it up to human nature.
- You can leave comments on the posts on the scraper's site that indicate they are being republished without permission - but those are liable to being deleted by the scraper.
- You can report them to the search services (Technorati, for example, uses their troubleshooting form to allow you to report spam blogs), and - if the service decides your complaint is well-founded - have them removed from those services' databases. If they're using your content to drive traffic and search results, you'll be hitting them where it hurts.
- Or you can publish a post that explains why scraping is wrong and tells people what they can do about it... and wait to see if it turns up on the scraper's site. (Hola!) Just be sure you don't link to them and boost their Google ranking - although mentioning their web address in text (e.g. "bookmarkdevil.com/socialbookmarking") should be fine.
Whatever you decide, you may want to use your social media search services to stay on the lookout for people linking to and discussing the scraper's posts. A comment on those third-party posts may be all it takes to correct the record and assert your rightful ownership over your content.
And remember to keep it in perspective. You don't want to get so obsessive over this stuff that you forget why you share your thoughts and creativity in the first place (otherwise, you're in danger of becoming a record company).




August 19, 2008 - 1:54pm
reply | email this page »