The Article Scraper (Part 1 of the BH MMS)

Here’s the first part of a four-part-series about building your own blackhat money-making system (BH MMS).

The Article Scraper simply scrapes articles from sites and outputs them in SQL format ready to be imported into your Engine database. First, let’s cover the Scraper user-interface and then usage details after.

Article Scraper UI

1. Table Name: this is simply a name of a table under which your articles will be stored. You’ll create a new table for every niche you target. For instance, you could create tables for: golf, playstation, mortgage, insurance, internet_marketing, furniture, jewellery, etc.

2. SQL Output File Name: The name of the output file with an SQL extension.

3. Maximum Pages Per URL: This simply limits the number of pages scraped from a specific site. Some sites contain forums, so limiting the number of pages spidered prevents the Scraper from scraping an entire forum which could take days!

4. Maximum Page Size: Some sites have very large pages which would certainly slow down the Scraper, so this option limits the size of scraped pages.

5. URLs: A list of sites to be scraped. One URL per line with the ‘http’ prefix.

6. Start (button): After entering the above settings, the Start button starts the process of scraping. When the scraping process is complete and the SQL file created, a ‘finished’ dialog box is displayed.

The Scraper can be downloaded here, unzip and then click Setup to install. If you have problems with the SpiderXlib ActiveX control, download it from here and install.
Your firewall may block the Scraper from accessing the Web, in which case you might get an ‘Invalid URL’ message which indicates that a specified URL is invalid or it cannot access the Web, if this happens, modify your firewall settings.

Here’s the Article Scraper source code (written in vb.net). Note there is little error checking, and I’ve kept it as simple as possible.

If you start playing with the Scraper soon, you’ll be able to use it properly when the other parts of the BH MMS are published. Simply, create a MySQL database, add a table with two fields – ID and articles: ‘ID’ is auto-increment and the primary key, while ‘articles’ is simply a text field to hold the articles. Then scrape a few sites and import the articles into the database.

When you’ve got some experience of using the Scraper, your goal is to scrape about several thousand articles per niche, then import them into the Engine datbase. This will provide the Engine algorithm with enough content to generate hundreds of thousands (or even millions) of unique posts. I’ll explain more when I publish Part 2.

Update: I recommend you install the SpiderXlib ActiveX control here before running the Scraper (and obviously you must have .NET 2.0 installed as well). Let me know how you get on.



24 Comments so far

  1. Chris B on October 9th, 2007

    This post (and the knowledge of forthcoming posts) has made my day. Gonna start scraping away with this tool. Looking forward to the rest of the BHMMS posts :) You da man, Brad!

  2. Donovan on October 9th, 2007

    Hi Brad,

    Regarding your post:

    ” TDNAM for Cheap Domains with PageRank”

    The download link to the PrChecker file in defunct.

    Can you point me to a valid link?

    Thank you or the BH MMS.

  3. Chris B on October 9th, 2007

    Almost immediately when I start the Article Scraper, it’s giving an Unhandled Exception. It says that it “attempted to read or write protected memory. This is often an indication that other memory is corrupt”.

    I’ve tried this on 2 Windows XP SP2 machines and it’s doing the same thing on both.

    Any idea…? Thanks again for this tool.

  4. Bofu2U on October 9th, 2007

    I actually wasn’t able to get it to run either, due to a COM error. However, Brad, drop me a message when you can - I’m writing a tool to assist with this. :)

  5. gordon gekko on October 9th, 2007

    The link for the scraper is not working.

  6. Donovan on October 9th, 2007

    I too have encountered the Unhandled Exception issue.

  7. Brad on October 9th, 2007

    I’m unable to replicate the ‘Unhandled Exception’ error. However, download and install the spider activeX control then try again (and let me know): http://www.chilkatsoft.com/SpiderActiveX.asp

  8. Chris B on October 9th, 2007

    I’d downloaded & installed the spider ActiveX control before running Article Scraper, and got the Unhandled Exception.

    Googling it seems it has something to do with .NET 2.0 ?? *shrug*

    I found a .NET 2.0 Hotfix that seemed to have something to do with this (KB923028) but that didn’t seem to help.

  9. Brad on October 9th, 2007

    @Donovan: I’ve updated the link to the prchecker.zip file.

    @Chris B: I’ll look into this over the next few days - thanks for the error dump.

  10. jeff on October 9th, 2007

    thanks Brad. made my day also

  11. Chris B on October 9th, 2007

    Just an update on what I’ve tried:
    -Uninstalled The Article Scraper
    -Downloaded the .NET 2.0 Framework from MS & reinstalled it
    -Rebooted
    -Re-ran the SpiderLib ActiveX
    -Installed The Article Scraper
    -Still getting the same Unhandled Exception

  12. Donovan on October 10th, 2007

    Thanks Brad.

  13. jeff on October 10th, 2007

    same error issues here also.

  14. MORO on October 11th, 2007

    Working great for me. Thank you Brad!

  15. Chris B on October 12th, 2007

    Just wondering if there’d been any update on the “Unhandled Exception” error that several people are getting. I’m gonna try it on another XP SP2 machine this weekend and see if it happens to a 3rd machine for me.

    For anyone who’s gotten it to work - what’s the configuration (hardware/OS/.NET version) that you’re using?

  16. Chris B on October 16th, 2007

    No luck on 4 different XP SP2 computers that I’ve tried.

    Anyone else?

  17. roguespammer on October 16th, 2007

    will add it to the collection

  18. Aslan on October 18th, 2007

    Chris, pls solve unhandled problem of the program! Pls

  19. Skino on October 30th, 2007

    No luck using windows xp sp1 or sp2, any ideas?

  20. Chris B on October 30th, 2007

    Sorta hung out to dry on this one it seems. I’d love to get this working, but I’ve tried multiple systems with no luck. Any new suggestions, Brad? Thanks!

  21. B on November 8th, 2007

    Damn.. Thought it was going to work. I got the error at first. Downloaded the spiderxlib. THen I get this error: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
    at SPIDERXLib.SpiderClass.CrawlNext()

    The status bar moves but nothing happens

  22. Chris B on November 9th, 2007

    *nudge*

    :)

  23. vingold on November 11th, 2007

    I found your blog tonight and I’m busy reading every post and downloading everything.

    I just wanted to say - you have the best captchas

  24. Tim Elfelt on November 18th, 2007

    any tips to find URLs to scrape?

Leave a reply

*
To prove you're a person (and not a spam bot - although we do like cute bots round here), type the security word shown in the picture.
Anti-Spam Image