The Article Scraper (Part 1 of the BH MMS)
Here’s the first part of a four-part-series about building your own blackhat money-making system (BH MMS).
The Article Scraper simply scrapes articles from sites and outputs them in SQL format ready to be imported into your Engine database. First, let’s cover the Scraper user-interface and then usage details after.

1. Table Name: this is simply a name of a table under which your articles will be stored. You’ll create a new table for every niche you target. For instance, you could create tables for: golf, playstation, mortgage, insurance, internet_marketing, furniture, jewellery, etc.
2. SQL Output File Name: The name of the output file with an SQL extension.
3. Maximum Pages Per URL: This simply limits the number of pages scraped from a specific site. Some sites contain forums, so limiting the number of pages spidered prevents the Scraper from scraping an entire forum which could take days!
4. Maximum Page Size: Some sites have very large pages which would certainly slow down the Scraper, so this option limits the size of scraped pages.
5. URLs: A list of sites to be scraped. One URL per line with the ‘http’ prefix.
6. Start (button): After entering the above settings, the Start button starts the process of scraping. When the scraping process is complete and the SQL file created, a ‘finished’ dialog box is displayed.
The Scraper can be downloaded here, unzip and then click Setup to install. If you have problems with the SpiderXlib ActiveX control, download it from here and install.
Your firewall may block the Scraper from accessing the Web, in which case you might get an ‘Invalid URL’ message which indicates that a specified URL is invalid or it cannot access the Web, if this happens, modify your firewall settings.
Here’s the Article Scraper source code (written in vb.net). Note there is little error checking, and I’ve kept it as simple as possible.
If you start playing with the Scraper soon, you’ll be able to use it properly when the other parts of the BH MMS are published. Simply, create a MySQL database, add a table with two fields – ID and articles: ‘ID’ is auto-increment and the primary key, while ‘articles’ is simply a text field to hold the articles. Then scrape a few sites and import the articles into the database.
When you’ve got some experience of using the Scraper, your goal is to scrape about several thousand articles per niche, then import them into the Engine datbase. This will provide the Engine algorithm with enough content to generate hundreds of thousands (or even millions) of unique posts. I’ll explain more when I publish Part 2.
Update: I recommend you install the SpiderXlib ActiveX control here before running the Scraper (and obviously you must have .NET 2.0 installed as well). Let me know how you get on.
26
This post (and the knowledge of forthcoming posts) has made my day. Gonna start scraping away with this tool. Looking forward to the rest of the BHMMS posts
You da man, Brad!
Hi Brad,
Regarding your post:
” TDNAM for Cheap Domains with PageRank”
The download link to the PrChecker file in defunct.
Can you point me to a valid link?
Thank you or the BH MMS.
Almost immediately when I start the Article Scraper, it’s giving an Unhandled Exception. It says that it “attempted to read or write protected memory. This is often an indication that other memory is corrupt”.
I’ve tried this on 2 Windows XP SP2 machines and it’s doing the same thing on both.
Any idea…? Thanks again for this tool.
I actually wasn’t able to get it to run either, due to a COM error. However, Brad, drop me a message when you can - I’m writing a tool to assist with this.
The link for the scraper is not working.
I too have encountered the Unhandled Exception issue.
I’m unable to replicate the ‘Unhandled Exception’ error. However, download and install the spider activeX control then try again (and let me know): http://www.chilkatsoft.com/SpiderActiveX.asp
I’d downloaded & installed the spider ActiveX control before running Article Scraper, and got the Unhandled Exception.
Googling it seems it has something to do with .NET 2.0 ?? *shrug*
I found a .NET 2.0 Hotfix that seemed to have something to do with this (KB923028) but that didn’t seem to help.
@Donovan: I’ve updated the link to the prchecker.zip file.
@Chris B: I’ll look into this over the next few days - thanks for the error dump.
thanks Brad. made my day also
Just an update on what I’ve tried:
-Uninstalled The Article Scraper
-Downloaded the .NET 2.0 Framework from MS & reinstalled it
-Rebooted
-Re-ran the SpiderLib ActiveX
-Installed The Article Scraper
-Still getting the same Unhandled Exception
Thanks Brad.
same error issues here also.
Working great for me. Thank you Brad!
Just wondering if there’d been any update on the “Unhandled Exception” error that several people are getting. I’m gonna try it on another XP SP2 machine this weekend and see if it happens to a 3rd machine for me.
For anyone who’s gotten it to work - what’s the configuration (hardware/OS/.NET version) that you’re using?
No luck on 4 different XP SP2 computers that I’ve tried.
Anyone else?
will add it to the collection
Chris, pls solve unhandled problem of the program! Pls
No luck using windows xp sp1 or sp2, any ideas?
Sorta hung out to dry on this one it seems. I’d love to get this working, but I’ve tried multiple systems with no luck. Any new suggestions, Brad? Thanks!
Damn.. Thought it was going to work. I got the error at first. Downloaded the spiderxlib. THen I get this error: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at SPIDERXLib.SpiderClass.CrawlNext()
The status bar moves but nothing happens
*nudge*
I found your blog tonight and I’m busy reading every post and downloading everything.
I just wanted to say - you have the best captchas
any tips to find URLs to scrape?
Working great for me. Thank you Brad!
Its amazing, working great to me ,but some error are found ,i will easy clear that error. some links does not work.In this article scraper will very useful to me. thanks Brad!