Как найти div в php - Исправление недочетов и поиск решений вместе с Examum.ru

This is my code:

<?php
    include('simple_html_dom.php');
    $html = file_get_html('http://www.google.com/search?q=BA236',false);
    $title=$html->find('div#ires', 0)->innertext;
    echo $title;
?>

It outputs all result of the Google Search Page under the Search «BA236».

The problem is I dont need all of them and the Information I need is inside a div that has no id or class or anything else.

The div I need is inside the first

<div class="g">

on the Page, so maybe I should try something like this:

<?php
    include('simple_html_dom.php');
    $html = file_get_html('http://www.google.com/search?q=BA236',false);
    $title=$html->find('div[class=g], 0')->innertext;
    echo $title;
?>

But the Problem of that is, if I load the page it shows me nothing except this:

Notice: Trying to get property of non-object in
C:xampphtdocs…simpletest2.php on line 4

So how can i get the div i´m searching for and what am I doing wrong ?

Edit:

Solution:

<?php
    include('simple_html_dom.php');
    $html = file_get_html('http://www.google.com/search?q=BA236',false);
    $e = $html->find("div[class=g]");
    echo $e[0]->innertext;
?>

Or:

<?php
    include('simple_html_dom.php');
    $html = file_get_html('http://www.google.com/search?q=BA236',false);
    $title=$html->find('div[class=g]')[0]->innertext;
    echo $title;
?>

Источник

КРАТКО

PHP Simple HTML DOM — библиотека парсинга на PHP, переведена исключительно для удобства пользования (из-за плохого знания буржуйского ускользают некоторые нужные вещи)

Быстрый старт

Получить элементы HTML

// Создать DOM из URL или файла
$html = file_get_html('http://www.google.com/');

// Найти все изображения
foreach($html->find('img') as $element)
echo $element->src . '<br>';

// Найти все ссылки
foreach($html->find('a') as $element)
echo $element->href . '<br>';

Изменить элементы HTML

// Создать DOM из строки
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; // Вывод: <div id="hello">foo</div><div id="world" class="bar">World</div>

Извлечь содержимое из HTML

// Дамп содержимого (без тегов) из HTML
       echo file_get_html('http://www.google.com/')->plaintext;

   Скрапинг Sladshot

// Создать DOM из URL
$html = file_get_html('http://slashdot.org/');

// Найти все блоки статей по каждому элементу
           foreach($html->find('div.article') as $article) {
               $item['title']     = $article->find('div.title', 0)->plaintext;
               $item['intro']    = $article->find('div.intro', 0)->plaintext;
               $item['details'] = $article->find('div.details', 0)->plaintext;
               $articles[] = $item;
           }

print_r($articles);

Быстрый способ

// Создать объект DOM из строки
$html = str_get_html('<html><body>Hello!</body></html>');

// Создать объект DOM из URL
$html = file_get_html('http://www.google.com/');

// Создать объект DOM из HTML-файла
$html = file_get_html('test.htm');

Объектно-ориентированный способ

// Создать объект DOM
$html = new simple_html_dom();

// Загрузить HTML из строки
$html->load('<html><body>Hello!</body></html>');

// Загрузить HTML с URL-адреса
$html->load_file('http://www.google.com/');

// Загрузить HTML из файла HTML
       $html->load_file('test.htm');

Как найти элементы HTML

Основы

// Найти все ссылки, возвращает массив объектов элементов
$ret = $html->find('a');

// Найти (N)-ю ссылку, объект возвращает элемент или нуль , если не найден (с нуля)
$ret = $html->find('a', 0);

// Найти последнюю ссылку , возвращает объект элемента или нуль , если не найден ( с нуля)
$ret = $html->find('a', -1);

// Найти все <div> с атрибутом id
$ret = $html->find('div[id]');

// Найти все <div>, атрибут которых id = foo
$ret = $html->find('div[id=foo]');

Продвинутый

// Найти все элементы, которые id = foo
$ret = $html->find('#foo');

// Найти все элементы, которые class = foo
$ret = $html->find('.foo');

// Найти все элементы с ID атрибута
$ret = $html->find('*[id]');

// Найти все ссылки и изображения
$ret = $html->find('a, img');

// Находим все ссылки и изображения с атрибутом "title"
       $ret = $html->find('a[title], img[title]');

   Селекторы потомков

// Находим все <li> в <ul>
$es = $html->find('ul li');

// Находим вложенные теги <div>
$es = $html->find('div div div');

// Находим все <td> в <table> which class = hello
$es = $html->find('table.hello td');

// Находим все теги td с атрибутом align = center в тегах таблиц
$es = $html->find(''table td[align=center]');

Вложенные селекторы

// Находим все <li> в <ul>
       foreach($html->find('ul') as $ul)
       {
               foreach($ul->find('li') as $li)
               {
                   // что-то делаем. ..
               }
       }

// Находим первый <li> в первом <ul>
$e = $html->find('ul', 0)->find('li', 0);

Фильтры атрибутов
Поддерживает эти операторы в селекторах атрибутов:

Фильтр	Описание
[атрибут]	Соответствует элементам , которые имеют указанный атрибут.
[!атрибут]	Соответствует элементам, у которых нет указанного атрибута.
[атрибут=значение]	Соответствует элементам, имеющим указанный атрибут с определенным значением.
[атрибут!=значение]	Соответствует элементам, не имеющим указанного атрибута, с определенным значением.
[атрибут^=значение]	Соответствует элементам, имеющим указанный атрибут, и начинается с определенного значения
[атрибут$=значение]	Соответствует элементам, имеющим указанный атрибут, и заканчивается определенным значением.
[атрибут*=значение]	Соответствует элементам с указанным атрибутом и содержит определенное значение.

Текст и комментарии

// Находим все текстовые блоки
$es = $html->find('text');

// Найдите все блоки комментариев (<! --...-->)
$es = $html->find('comment');

Как получить доступ к атрибутам HTML-элемента

Получить, установить и удалить атрибуты

// Получить атрибут (если атрибут не имеет значения (например, отмечен, выбран ...), он вернет true или false )
$value = $e->href;

// Установить атрибут (если атрибут не имеет значения (например, отмечен, выбран ...), установить его значение как true или false )
$e->href = 'my link';

// Удаляем атрибут, устанавливаем его значение как null!
$e->href = null;

// Определить, существует ли атрибут?
if(isset($e->href))
echo 'href exist!';

Магические атрибуты

// Пример
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

       echo $e->tag; // Возвращает: " div "
       echo $e->outertext; // Возвращает: " <div> foo <b>bar</b> </div> "
       echo $e->innertext; // Возвращает: " foo <b> bar </b> "
       echo $e->plaintext; // Возвращает: "foo bar "

Имя атрибута	Значение
$e->tag	Чтение или запись имени тега элемента.
$e->outertext	Чтение или запись внешнего HTML-текста элемента.
$e->innertext	Чтение или запись внутреннего HTML-текста элемента.
$e->plaintext	Чтение или запись простого текста элемента.

Примеры

// Извлечь содержимое из HTML
echo $html->plaintext;

// Оборачиваем элемент
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';

// Удалить элемент, установив его outertext как пустая строка
$e->outertext = '';

// Добавляем элемент
$e->outertext = $e->outertext . '<div>foo<div>';

// Вставить элемент
$e->outertext = '<div>foo<div>' . $e->outertext;

Как пройти по дереву DOM

Примеры

// Пример
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;

// или
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');

Методы

Вы также можете вызывать методы с преобразованием имен в camelCase ..

Метод	Описание
смешанный	$e->children ( [int $index] )	Возвращает N-й дочерний объект, если установлен индекс , в противном случае возвращает массив потомков
элемент	$e->parent ()	Возвращает родителя элемента.
элемент	$e->first_child ()	Возвращает первый дочерний элемент элемента или null, если не найден.
элемент	$e->last_child ()	Возвращает последний дочерний элемент элемента или null, если не найден.
элемент	$e->next_sibling ()	Возвращает следующего родственного элемента или null, если не найден.
элемент	$e->prev_sibling ()	Возвращает предыдущий родственный элемент или null, если не найден.

Как сбросить содержимое DOM-объекта

Быстрый способ

// Сохраняет внутреннее дерево DOM обратно в строку
$str = $html;

// Распечатать!
echo $html;

Объектно-ориентированный способ

// Сохраняет внутреннее дерево DOM обратно в строку
$str = $html->save();

// Сохраняет внутреннее дерево DOM обратно в файл
$html->save('result.htm');

Как настроить поведение парсинга

Функция обратного вызова

// Записываем функцию с параметром " $ element "
       function my_callback($element) {
               // Скрываем все теги <b>
               if ($element->tag=='b')
                       $element->outertext = '';
       }

// Регистрируем функцию обратного вызова с ее именем
$html->set_callback('my_callback');

// Функция обратного вызова будет вызвана при выгрузке
echo $html;

Напоминаю, что вопросы можно задать в группе https://t.me/newqosgroup

Источник

Answer by Joy Winters

The right code to get a div with class is:,Basically you can get elements as you were using a CSS selector.,I retrieve the code with curl and create a simple html dom object:,The to find the following elements: DIV -> class(product-inner clearfix) -> class(price) the following XPath can be used:

I retrieve the code with curl and create a simple html dom object:

$cl = curl_exec($curl);  
$html = new simple_html_dom();
$html->load($cl);

Then I wanted to add the content of the div into an array called divs:

$divs = $html->find('div[.ClearBoth Box]');

Like this:

Array
(
    [0] => simple_html_dom_node Object
        (
            [nodetype] => 1
            [tag] => br
            [attr] => Array
                (
                    [class] => ClearBoth
                )

            [children] => Array
                (
                )

            [nodes] => Array
                (
                )

            [parent] => simple_html_dom_node Object
                (
                    [nodetype] => 1
                    [tag] => div
                    [attr] => Array
                        (
                            [class] => SocialMedia
                        )

                    [children] => Array
                        (
                            [0] => simple_html_dom_node Object
                                (
                                    [nodetype] => 1
                                    [tag] => iframe
                                    [attr] => Array
                                        (
                                            [id] => ShowFacebookButtons
                                            [class] => SocialWeb FloatLeft
                                            [src] => http://www.facebook.com/plugins/xxx
                                            [style] => border:none; overflow:hidden; width: 250px; height: 70px;
                                        )

                                    [children] => Array
                                        (
                                        )

                                    [nodes] => Array
                                        (
                                        )

Here is an example of the source code at the site:

<div class="ClearBoth Box">
          <div>
<i class="Icon SmallIcon ProductRatingEnabledIconSmall" title="gute peppige Qualität: Sehr empfehlenswert"></i>
<i class="Icon SmallIcon ProductRatingEnabledIconSmall" title="gute peppige Qualität: Sehr empfehlenswert"></i>
<i class="Icon SmallIcon ProductRatingEnabledIconSmall" title="gute peppige Qualität: Sehr empfehlenswert"></i>
<i class="Icon SmallIcon ProductRatingEnabledIconSmall" title="gute peppige Qualität: Sehr empfehlenswert"></i>
<i class="Icon SmallIcon ProductRatingEnabledIconSmall" title="gute peppige Qualität: Sehr empfehlenswert"></i>

              <strong class="AlignMiddle LeftSmallPadding">gute peppige Qualität</strong> <span class="AlignMiddle">(17.03.2013)</span>
          </div>
          <div class="BottomMargin">
            gute Verarbeitung, schönes Design,
          </div>
        </div>

Answer by Ana Andrade

I want now simply find in a sourcecode tne content of a div with a class ClearBoth Box,

Meta Stack Overflow

,Stack Overflow en español,Stack Overflow em Português

The right code to get a div with class is:

$ret = $html->find('div.foo');
//OR
$ret = $html->find('div[class=foo]');

Answer by Russell Malone

Finding elements by class name or id,Finding elements by tag name,Finding elements by attribute,Finding nested elements

Finding elements by tag name

// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find all anchors and images, returns an array of element objects
$ret = $html->find('a, img');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find last anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1);

Finding elements by class name or id

// Find all element which id=foo
$ret = $html->find('#foo');

// Find all element which class=foo
$ret = $html->find('.foo');

Finding elements by attribute

// Find all <div> with the id attribute
$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');

// Find all anchors and images with the "title" attribute
$ret = $html->find('a[title], img[title]');

// Find all element has attribute id
$ret = $html->find('*[id]');

Finding descendants

// Find all <li> in <ul>
$es = $html->find('ul li');

// Find Nested <div> tags
$es = $html->find('div div div');

// Find all <td> in <table> which class=hello
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');

Finding nested elements

// Find all <li> in <ul>
foreach($html->find('ul') as $ul)
{
       foreach($ul->find('li') as $li)
       {
             // do something...
       }
}

// Find first <li> in first <ul>
$e = $html->find('ul', 0)->find('li', 0);

Finding text blocks and comments

// Find all text blocks
$es = $html->find('text');

// Find all comment (<!--...-->) blocks
$es = $html->find('comment');

Answer by Cassius Lane

Check the source of the webpage. Find out whether the hyperlinks are following some kind of pattern. If you look closely you will find that all of them have class=”postlink”. This will make extracting them, a piece of cake. Read the code below to see how to filter html elements based on values of attributes.,Let’s say you want to change the value of attribute of particular element. For e.g. if you wished to change all the hyperlinks having class=postlink to class=topiclink, you can do so as follows :,Suppose you wanted to find each and every image on a webpage or say, each and every hyperlink. We will be using “find” function to extract this information from the object. Here’s how to do it using Simple HTML DOM Parser :,Similarly, say if you want to find all links containing phpbb.com then you can filter using “contains” filter as follows :

Data can be obtained from mainly three different sources : URL, Static File or HTML String. Use the following code to create a DOM from three different alternatives.

<?php
 
include('simple_html_dom.php');
 
//to parse a webpage
$html = file_get_html("http://nimishprabhu.com");
 
//to parse a file using relative location
$html = file_get_html("index.html");
 
//to parse a file using absolute location
$html = file_get_html("/home/admin/nimishprabhu.com/testfiles/index.html");
 
//to parse a string as html code
$html = str_get_html("<html><head><title>Cool HTML Parser</title></head><body><h2>PHP Simple HTML DOM Parser</h2><p>PHP Simple HTML DOM Parser is the best HTML DOM parser in any programming language.</p></body></html>");
 
//to fetch a webpage in a string and then parse
$data = file_get_contents("http://nimishprabhu.com"); //or you can use curl too, like me <img src="http://nimishprabhu.com/wp-content/plugins/lazy-load/images/1x1.trans.gif" data-lazy-src="http://nimishprabhu.com/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley"><noscript><img src="http://nimishprabhu.com/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" /></noscript>
// Some manipulation with the $data variable, for e.g.
$data = str_replace("Nimish", "NIMISH", $data);
//now parsing it into html
$html = str_get_html($data);
 
?>

Suppose you wanted to find each and every image on a webpage or say, each and every hyperlink. We will be using “find” function to extract this information from the object. Here’s how to do it using Simple HTML DOM Parser :

<?php
 
include('simple_html_dom.php');
 
$html = file_get_html('http://nimishprabhu.com/');
 
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
 $links[] = $a->href;
}
print_r($links);
 
//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
 $images[] = $img->src;
}
print_r($images);
 
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
 $headlines[] = $header->plaintext;
}
print_r($headlines);
?>

Suppose you want to get names of all input fields on a webpage, let’s say for e.g., http://nimishprabhu.com/chrome-extension-hello-world-example.html. Now if you see the webpage you will notice that there is a comment form on the page which has input fields. Please note that the comment box is a textarea element and not input element, so it will not be detected. But to detect rest of the visible as well has hidden fields you can use following code :

<?php
 
include('simple_html_dom.php');
 
$url = 'http://nimishprabhu.com/chrome-extension-hello-world-example.html';
 
$html = file_get_html($url);
 
foreach($html->find('input') as $input) {
 echo $input->name.'<br />';
}
 
// Output for above script :
// author
// email
// url
// submit
// comment_post_ID
// comment_parent
 
?>

<?php
 
include('simple_html_dom.php');
 
$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
 
$html = file_get_html($url);
$links = array();
foreach($html->find('a[class="postlink"]') as $a) {
 $links[] = $a->href;
}
 
print_r($links);
 
?>

There is something worth noting here, you can use “.” and “#” prefixes to filter class and id attributes respectively. So the above code will work without any change if you use the filter as :

foreach($html->find('a.postlink') as $a)

Consider the above example where we are extracting all links from the post. Say you want to find only the links of the sub forums in the community. If you notice all of them begin with http://www.phpbb.com/community/viewforum.php. So let’s filter the hyperlinks using “starts with” filter to fetch only the links starting with http://www.phpbb.com/community/viewforum.php

<?php
include('simple_html_dom.php');
 
$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
 
$html = file_get_html($url);
$links = array();
foreach($html->find('a[href^="http://www.phpbb.com/community/viewforum.php"]') as $a) {
 $links[] = $a->href;
}
 
print_r($links);
 
?>

Similarly, say if you want to find all links containing phpbb.com then you can filter using “contains” filter as follows :

foreach($html->find('a[href*="phpbb.com"]') as $a)

If you are sure about only the end part of the value of an attribute. Let’s say, for e.g., you are scrapping a webpage which contains numerous div elements. These div elements have the id attribute something like :
<div id=”1_message_id”>content here</div>
<div id=”2_message_id”>content here</div>
and so on.
Then you can find such div elements using the “ends with” filter as follows :

foreach($html->find('div[id$="_message_id"]' as $div)

Let’s say you want to change the value of attribute of particular element. For e.g. if you wished to change all the hyperlinks having class=postlink to class=topiclink, you can do so as follows :

<?php
 
include('simple_html_dom.php');
 
$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
 
$html = file_get_html($url);
 
foreach($html->find('a.postlink') as $a) {
 $a->class = 'topiclink';
}
 
echo $html;
?>

Note that the numbering of elements starts from 0 and not 1. Thus the first element will be found at 0th location. Let’s assume that you want to extract the hyperlink of the 3rd link with class postlink on a webpage, you can use the following approach :

<?php
include('simple_html_dom.php');
$url = 'https://www.phpbb.com/community/viewtopic.php?f=46&t=543171';
$html = file_get_html($url);
echo $html->find('a.postlink',2)->href;
?>

If you wish to clear the inner contents of the div with id as content, you can do so as follows :

$html->find('div#content',0)->innertext = '';

If you wish to append text to existing content, you can do so as follows :

$appendcode = '<p>This is the text to append to existing innertext</p>';
$html->find('div#content',0)->innertext .= $appendcode;

Inorder to prepend text to existing content, you can use the following code :

$prependcode = '<h2>Nice article below</h2>';
$html->find('div#content,0)->innertext = $prependcode . $html->find('div#content',0)->innertext;

Say you have an existing div with id content, now you made a wrapper div and want to enclose the content div in the wrapper div. Here’s how you do it :

$html->find('div#content',0)->outertext = '<div id="wrapper">' . $html->find('div#content',0)->outertext. '</div>';

Last but definitely not the least, handling the memory leak issue. Once you start using this script extensively you will encounter memory exhausted errors and will keep wondering what’s wrong with your script. The problem might be due to not handling the memory leak issue. I will not talk in detail about what memory leak is or how this issue is caused but you can read quite a bit about it here.To handle this issue don’t forget to clear the $html variable created and unset it once it’s not required further.

$html->clear();
unset($html);

Answer by Ada Correa

EDIT2:
As this is a bug in the dom parser (tested on version 1.5), there is no simple way of doing this.
Solution I could think of:,edit:
I tried your code and found that the solutions above do not work.
The solution that does work however is as follows:,basically you find all the elements with class one than iterate through those elements to find the ones that have class two (in this case three).,Simple answer (should work according to html spec):

EDIT2:
As this is a bug in the dom parser (tested on version 1.5), there is no simple way of doing this.
Solution I could think of:

$find = $html->find(".class1");
$ret = array();
foreach ($find as $element) {
    if (strpos($element->class, 'class3') !== false) {
        $ret[] = $element;
    }
}
$find = $ret;

Simple answer (should work according to html spec):

find(".class1.class2")

this will look for any type of element (div,img,a etc..) that has both class1 and class2. If you want to specify the type of element to match add it to the beginning without a . like:

find("div.class1.class2")

If you have a space between the two specified classes it will match elements with both the classes or elements nested in the element with the first class:

find(".class1 .class2")

will match

<div class="class1">
  <div class="class2">this will be returned</div>
</div>

<div class="class1 class2">this will be returned</div>

edit:
I tried your code and found that the solutions above do not work.
The solution that does work however is as follows:

$html->find("div[class=class1 class2]")

Answer by Everett Shepard

I am new to HTML DOM parsing with PHP, there is one page which is having different content in its but having same ‘class’, when I am trying to fetch content I am able to get content of last div, Is it possible that somehow I could get all the content of divs having same class request you to please have a look over my code:,basically you find all the elements with class one than iterate through those elements to find the ones that have class two (in this case three).,EDIT2:
As this is a bug in the dom parser (tested on version 1.5), there is no simple way of doing this.
Solution I could think of:,The link of the next page to scrape is included in the Next button, so we’ll stop when this link cannot be found

<?php
    include(__DIR__."/simple_html_dom.php");
    $html = file_get_html('http://campaignstudio.in/');
    echo $x = $html->find('h2[class="section-heading"]',1)->outertext; 
?>

Answer by Aliyah Waller

PHP Fast Simple HTML DOM Parser — fast and low mamory usage HTML DOM Parser with syntax like PHP Simple HTML DOM Parser,
Fast and low mamory usage HTML DOM Parser with syntax like PHP Simple HTML DOM Parser. Find tags on an HTML page with selectors just like jQuery.
,
Fast and low mamory usage HTML DOM Parser with syntax like PHP Simple HTML DOM Parser. Find tags on an HTML page with selectors just like jQuery.

composer require dimabdc/php-fast-simple-html-dom-parser

Answer by Edwin Bryan

$e->children ( [int $index] ) — Returns the Nth child object if index is set, otherwise return an array of children.,find ( string $selector [, int $index] ) — Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.,find ( string $selector [, int $index] ) — Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object.,$e->getElementsById ( $id [,$index] ) — $e->find ( «#$id» [, int $index] )

<?php
include('simplehtmldom/simple_html_dom.php');

// Create DOM from URL or file
$html = file_get_html('http://coursesweb/');

// Find all links, and their text
foreach($html->find('a') as $elm) {
  echo $elm->href .' ('.$elm->plaintext. ')<br/>';
}
?>

<?php
include('simplehtmldom/simple_html_dom.php');

// Create a DOM object from a string
$html = str_get_html('<div><img src="image1.jpg" alt="Img1" class="cls" /><br/>
 <img src="image2.png" alt="Img2" /></div><p>Some text</p>
 <img src="image3.gif" alt="Img3" class="cls" />');

// Find all images with class="cls"
foreach($html->find('img.cls') as $elm) {
  echo $elm->src. '<br/>';
}
?>

<?php
include('simplehtmldom/simple_html_dom.php');

// Create a DOM object from a string
$html = str_get_html('<nav><ul>
 <li id="idli1" class="cls">List 1</li><li>List 2</li><li class="cls">List 3</li>
 </ul></nav>');

// Get the id of the first LI in UL, and change its content
$idli = $html->find('li', 0)->id;
if($idli) echo 'First LI id: '. $idli;
$html->find('ul li', 0)->innertext = '<b>PHP Simple HTML DOM</b>';
echo $html;
?>

<?php
include('simplehtmldom/simple_html_dom.php');

// Create a DOM object from a HTML file
$html = file_get_html('test.htm');

// Write a function with parameter "$elm"
function changeCls($elm) {
  // if LI with class="cls", change the class
  if ($elm->tag=='li' && $elm->class=='cls') {
    $elm->setAttribute('class', 'class_2');
  }
} 
$html->set_callback('changeCls');
echo $html;
?>

Источник

Время на прочтение
3 мин

Количество просмотров 144K

Добрый день, уважаемые хабровчане. В данном посте речь пойдет о совместном проекте S. C. Chen и John Schlick под названием PHP Simple HTML DOM Parser (ссылки на sourceforge).

Идея проекта — создать инструмент позволяющий работать с html кодом используя jQuery подобные селекторы. Оригинальная идея принадлежит Jose Solorzano’s и реализована для php четвертой версии. Данный же проект является более усовершенствованной версией базирующейся на php5+.

В обзоре будут представлены краткие выдержки из официального мануала, а также пример реализации парсера для twitter. Справедливости ради, следует указать, что похожий пост уже присутствует на habrahabr, но на мой взгляд, содержит слишком малое количество информации. Кого заинтересовала данная тема, добро пожаловать под кат.

Получение html кода страницы

$html = file_get_html('http://habrahabr.ru/'); //работает и с https://

Товарищ Fedcomp дал полезный комментарий насчет file_get_contents и 404 ответа. Оригинальный скрипт при запросе к 404 странице не возвращает ничего. Чтобы исправить ситуацию, я добавил проверку на get_headers. Доработанный скрипт можно взять тут.

Поиск элемента по имени тега

foreach($html->find('img') as $element) { //выборка всех тегов img на странице
       echo $element->src . '<br>'; // построчный вывод содержания всех найденных атрибутов src
}

Модификация html элементов

$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>'); // читаем html код из строки (file_get_html() - из файла)
$html->find('div', 1)->class = 'bar'; // присвоить элементу div с порядковым номером 1 класс "bar"
$html->find('div[id=hello]', 0)->innertext = 'foo'; // записать в элемент div с id="hello" текст foo

echo $html; // выведет <div id="hello">foo</div><div id="world" class="bar">World</div>

Получение текстового содержания элемента (plaintext)

echo file_get_html('http://habrahabr.ru/')->plaintext;

Целью статьи не является предоставить исчерпывающую документацию по данному скрипту, подробное описание всех возможностей вы можете найти в официальном мануале, если у сообщества возникнет желание, я с удовольствием переведу весь мануал на русский язык, пока же приведу обещанный в начале статьи пример парсера для twitter.

Пример парсера сообщений из twitter

require_once 'simple_html_dom.php'; // библиотека для парсинга
            $username = 'habrahabr'; // Имя в twitter
            $maxpost = '5'; // к-во постов
            $html = file_get_html('https://twitter.com/' . $username);
            $i = '0';
            foreach ($html->find('li.expanding-stream-item') as $article) { //выбираем все li сообщений
                $item['text'] = $article->find('p.js-tweet-text', 0)->innertext; // парсим текст сообщения в html формате
                $item['time'] = $article->find('small.time', 0)->innertext; // парсим время в html формате
                $articles[] = $item; // пишем в массив
                $i++;
                if ($i == $maxpost) break; // прерывание цикла
            }

Вывод сообщений

                for ($j = 0; $j < $maxpost; $j++) {
                    echo '<div class="twitter_message">';
                    echo '<p class="twitter_text">' . $articles[$j]['text'] . '</p>';
                    echo '<p class="twitter_time">' . $articles[$j]['time'] . '</p>';
                    echo '</div>';
                }

Благодарю за внимание. Надеюсь, получилось не очень тяжеловесно и легко для восприятия.

PHP Fast Simple HTML DOM Parser

License

PHP Fast Simple HTML DOM Parser — fast and low mamory usage HTML DOM Parser with syntax like PHP Simple HTML DOM Parser

Установка

Для установки выполните команду:

composer require dimabdc/php-fast-simple-html-dom-parser

Быстрый старт

require_once "vendor/autoload.php";
use FastSimpleHTMLDomDocument;

// Create DOM from URL
$html = Document::file_get_html('https://habrahabr.ru/interesting/');

// Find all post blocks
$post = [];
foreach($html->find('div.post') as $post) {
    $item['title']   = $post->find('h1.title', 0)->plaintext;
    $item['hubs']    = $post->find('div.hubs', 0)->plaintext;
    $item['content'] = $post->find('div.content', 0)->plaintext;
    $post[] = $item;
}

print_r($post);

Как создать HTML DOM объект

// Create a DOM object from a string
$html = new Document('<html><body>Hello!</body></html>');

// Create a DOM object from a string
$html = new Document();
$html->loadHtml('<html><body>Hello!</body></html>');

// Create a DOM object from a HTML file
$html = new Document();
$html->loadHtmlFile('test.htm');

// Create a DOM object from a URL
$html = new Document(file_get_contents('https://habrahabr.ru/interesting/'));

Как искать HTML DOM элементы?

Основа

// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1); 

// Find all <div> with the id attribute
$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');

Часто используемое

// Find all element which id=foo
$ret = $html->find('#foo');

// Find all element which class=foo
$ret = $html->find('.foo');

// Find all element has attribute id
$ret = $html->find('*[id]'); 

// Find all anchors and images 
$ret = $html->find('a, img'); 

// Find all anchors and images with the "title" attribute
$ret = $html->find('a[title], img[title]');

Слекторы потомков

// Find all <li> in <ul> 
$es = $html->find('ul li');

// Find Nested <div> tags
$es = $html->find('div div div'); 

// Find all <td> in <table> which class=hello 
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags 
$es = $html->find('table td[align=center]');

Вложенные селекторы

// Find all <li> in <ul> 
foreach($html->find('ul') as $ul) 
{
       foreach($ul->find('li') as $li) 
       {
             // do something...
       }
}

// Find first <li> in first <ul> 
$e = $html->find('ul', 0)->find('li', 0);

Фильтр атрибутов

Filter	Description
[attribute]	Matches elements that have the specified attribute.
[!attribute]	Matches elements that don’t have the specified attribute.
[attribute=value]	Matches elements that have the specified attribute with a certain value.
[attribute!=value]	Matches elements that don’t have the specified attribute with a certain value.
[attribute^=value]	Matches elements that have the specified attribute and it starts with a certain value.
[attribute$=value]	Matches elements that have the specified attribute and it ends with a certain value.
[attribute*=value]	Matches elements that have the specified attribute and it contains a certain value.

Текст, комментарии

// Find all text blocks 
$es = $html->find('text');

// Find all comment (<!--...-->) blocks 
$es = $html->find('comment');

Доступ к атрибутам

Получение, установка и удаление атрибутов

// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;

// Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's value as true or false)
$e->href = 'my link';

// Remove a attribute, set it's value as null! 
$e->href = null;

// Determine whether a attribute exist? 
if(isset($e->href)) 
        echo 'href exist!';

«Магические» атрибуты

// Example
$html = str_get_html('<div>foo <b>bar</b></div>'); 
$e = $html->find('div', 0);

echo $e->tag; // Returns: "div"
echo $e->outertext; // Returns: "<div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: "foo <b>bar</b>"
echo $e->plaintext; // Returns: "foo bar"

Attribute Name	Usage
$e->tag	Read or write the tag name of element.
$e->outertext	Read or write the outer HTML text of element.
$e->innertext	Read or write the inner HTML text of element.
$e->plaintext	Read or write the plain text of element.

Трюки

// Extract contents from HTML 
echo $html->plaintext;

// Wrap a element
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';

// Remove a element, set it's outertext as an empty string 
$e->outertext = '';

// Append a element
$e->outertext = $e->outertext . '<div>foo<div>';

// Insert a element
$e->outertext = '<div>foo<div>' . $e->outertext;

Прогон по DOM-дереву

// If you are not so familiar with HTML DOM, check this link to learn more... 

// Example
echo $html->find('#div1', 0)->children(1)->children(1)->children(2)->id;
// or 
echo $html->getElementById('div1')->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');

Method	Description
`mixed` $e->children([int $index])	Returns the Nth child object if index is set, otherwise return an array of children.
`Element` $e->parent()	Returns the parent of element.
`Element` $e->first_child()	Returns the first child of element, or null if not found.
`Element` $e->last_child()	Returns the last child of element, or null if not found.
`Element` $e->next_sibling()	Returns the next sibling of element, or null if not found.
`Element` $e->prev_sibling()	Returns the previous sibling of element, or null if not found.

API-справочник

Методы и свойства DOM

Name	Description
`void` __construct([string	Element $html])
`string` plaintext	Returns the contents extracted from HTML.
`mixed` find (string $selector [, int $index])	Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

Методы и свойства элементов

Name	Description
`string` [attribute]	Read or write element’s attribure value.
`string` tag	Read or write the tag name of element.
`string` outertext	Read or write the outer HTML text of element.
`string` innertext	Read or write the inner HTML text of element.
`string` plaintext	Read or write the plain text of element.
`mixed` find (string $selector [, int $index])	Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object.

Прогон по дереву DOM

Name	Description
`mixed` $e->children([int $index])	Returns the Nth child object if index is set, otherwise return an array of children.
`element` $e->parent()	Returns the parent of element.
`element` $e->first_child()	Returns the first child of element, or null if not found.
`element` $e->last_child()	Returns the last child of element, or null if not found.
`element` $e->next_sibling()	Returns the next sibling of element, or null if not found.
`element` $e->prev_sibling()	Returns the previous sibling of element, or null if not found.

camelCase эквиваленты

string $e->getAttribute($name)
string $e->attribute

void $e->setAttribute($name, $value)
void $value = $e->attribute

bool $e->hasAttribute($name)
bool isset($e->attribute)

void $e->removeAttribute($name)
void $e->attribute = null

element $e->getElementById($id)
mixed $e->find("#$id", 0)

mixed $e->getElementsById($id [,$index])
mixed $e->find("#$id" [, int $index])

element $e->getElementByTagName($name)
mixed $e->find($name, 0)

mixed $e->getElementsByTagName($name [, $index])
mixed $e->find($name [, int $index])

element $e->parentNode()
element $e->parent()

mixed $e->childNodes([$index])
mixed $e->children([int $index])

element $e->firstChild()
element $e->first_child()

element $e->lastChild()
element $e->last_child()

element $e->nextSibling()
element $e->next_sibling()

element $e->previousSibling()
element $e->prev_sibling()

Источник

КРАТКО

Быстрый старт

Как найти элементы HTML

Как получить доступ к атрибутам HTML-элемента

Как пройти по дереву DOM

Как настроить поведение парсинга

Напоминаю, что вопросы можно задать в группе https://t.me/newqosgroup

Answer by Joy Winters

Answer by Ana Andrade

Answer by Russell Malone

Finding elements by tag name

Finding elements by class name or id

Finding elements by attribute

Finding descendants

Finding nested elements

Finding text blocks and comments

Answer by Cassius Lane

Answer by Ada Correa

Answer by Everett Shepard

Answer by Aliyah Waller

Answer by Edwin Bryan

Получение html кода страницы

Поиск элемента по имени тега

Модификация html элементов

Получение текстового содержания элемента (plaintext)

Пример парсера сообщений из twitter

Вывод сообщений

Похожие библиотеки

PHP Fast Simple HTML DOM Parser

Установка

Быстрый старт

Как создать HTML DOM объект

Как искать HTML DOM элементы?

Основа

Часто используемое

Слекторы потомков

Вложенные селекторы

Фильтр атрибутов

Текст, комментарии

Доступ к атрибутам

Получение, установка и удаление атрибутов

«Магические» атрибуты

Трюки

Прогон по DOM-дереву

API-справочник

Методы и свойства DOM

Методы и свойства элементов

Прогон по дереву DOM

camelCase эквиваленты