WebDew

PHP Web Scraper: the Easiest Way to Parse Web Pages

PHP Web Scraper

Sometimes developers have to scrape webpages in order to find the information they’re looking for. Sometimes it can be done easily, but other times it can be very difficult. In our situation, we’re going to write a simple PHP Web Scraper that will save lots of your time and money.

Basically, web scraping is a process of extracting data from a webpage. The two most popular techniques for parsing webpages are:

However, today we will have a look only at the first technique because it’s the most easiest one.

PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parser is an extremely simple library for parsing webpages, and the only one requirement for it is a PHP 5+.

Now it’s time to download this library; after this you can place it whenever you want, but make sure that your PHP script is able to access this file with include or require functions.

Here are the beauty of this library.. Here are a few example uses of the PHP Simple HTML DOM Parser:

// Don't forget to include this library!
require_once 'simple_html_dom.php';

// Option 1. Get the DOM from a string
$html = str_get_html($webpage);

// Option 2. Get the DOM from a given URL
$html = file_get_html('https://webdew.tech/');

// Find all links ("a" tags) and print their HREF attributes
foreach($html->find('a') as $e)
    echo $e->href.'<br />';

// Find all images and print their SRC attributes
foreach($html->find('img') as $e)
    echo $e->src.'<br />';

// Find the DIV element with the ID of "foo"
foreach($html->find('div#foo') as $e)
    echo $e->innertext;

// Find all paragraphs with "testClass" class
foreach($html->find('p.testClass') as $e)
    echo $e.'<br />';

// Find all nested "li" tags inside of the "ul" list
foreach($html->find('ul li') as $e)
    echo $e->innertext.'<br />';

The above code snippet contains only the most useful methods, but if you want to check all methods of the PHP Simple HTML DOM Parser, here is an official manual and the API Reference.

Example: How to Parse Articles with a PHP Web Scraper

This example does a very simple thing: the script goes to the WebDew.tech homepage and parses all recent articles, and then just prints all scraped information.

// Don't forget to include this library!
require_once 'simple_html_dom.php';

// Get the DOM from a given URL
$html = file_get_html('https://webdew.tech/');

// Step 1. Find all article links
foreach($html->find('main article.preview h2 a') as $link) {

    // Step 2. Get the DOM of each article from the WebDew.tech homepage
    $article = file_get_html($link->href);

    // Step 3. Display the first H1 tag from each webpage
    echo $article->find('article h1')[0];

    // Step 4. Display each paragraph from article below the H1 tag
    foreach($article->find('article p') as $paragraph) {
        echo $paragraph;
    }

}

PHP Web Scraper Doesn’t work on a localhost?

If you have the problem “Warning: file_get_contents(): stream does not support seeking“, try to use the following code:

$html = file_get_html('URL'); // Don't use it!

// This could fix your problem
$html = str_get_html(file_get_contents('https://webdew.tech/'));

PHP Web Scraper: Conclusion

In this article we’ve shown you how to find the specific element or a tag within the Document Object Model (DOM). Thus, in the most situations you should be fine with this library; however, all onward data processing is your job because only you know the reason why you’ve just read the article about developing the PHP Web Scraper utility.

2 Comments

  1. Alexander says:

    What a great article!

  2. Felipe says:

    Thanks man, I’ve been looking for a solution to this problem for a long time.

Leave a Reply