BeautifulSoup

Finding Children Nodes With Beautiful Soup

Finding Children Nodes With Beautiful Soup
The task of web scraping is one that requires the understanding of how web pages are structured. To get the needed information from web pages, one needs to understand the structure of web pages, analyze the tags that hold the needed information and then the attributes of those tags.

For beginners in web scraping with BeautifulSoup, an article discussing the concepts of web scraping with this powerful library can be found here.

This article is for programmers, data analysts, scientists or engineers who already have the skillset of extracting content from web pages using BeautifulSoup. If you do not have any knowledge of this library,  I advise you to go through the BeautifulSoup tutorial for beginners.

Now we can proceed - I want to believe that you already have this library installed.  If not, you can do this using the command below:

pip install BeautifulSoup4

Since we are working with extracting data from HTML, we need to have a basic HTML page to practice these concepts on.  For this article, we would use this HTML snippet for practice. I am going to assign the following HTML snippet to a variable using the triple quotes in Python.

sample_content = """

LinuxHint



To make an unordered list, the ul tag is used:
 




To make an ordered list, the ol tag is used:
 


    Here's an ordered list
  1. Number One

  2. Number Two



Linux Hint, 2018



"""

Now that we have sorted that, let's move right into working with the BeautifulSoup library.

We are going to be making use of a couple of methods and attributes which we would be calling on our BeautifulSoup object. However, we would need to parse our string using BeautifulSoup and then assign to an “our_soup” variable.

from bs4 import BeautifulSoup as bso
our_soup = bso(sample_content, "lxml")

Henceforth, we would be working with the “our_soup” variable and calling all of our attributes or methods on it.

On a quick note, if you do not already know what a child node is, it is basically a node (tag) that exists inside another node. In our HTML snippet for example, the li tags are children nodes of both the “ul” and the “ol” tags.

Here are the methods we would be taking a look at:

findChild():

The findChild method is used to find the first child node of HTML elements. For example when we take a look at our “ol” or “ul” tags, we would find two children tags in it. However when we use the findChild method, it only returns the first node as the child node.

This method could prove very useful when we want to get only the first child node of an HTML element, as it returns the required result right away.

The returned object is of the type bs4.element.Tag. We can extract the text from it by calling the text attribute on it.

Here's an example:

first_child = our_soup.find("body").find("ol")
print(first_child.findChild())

The code above would return the following:

  • Number One
  • To get the text from the tag, we call the text attribute on it.

    Like:

    print(first_child.findChild().text)

    To get the following result:

    'Number One'
    findChildren():

    We have taken a look at the findChild method and seen how it works. The findChildren method works in similar ways, however as the name implies, it doesn't find only one child node, it gets all of the children nodes in a tag.

    When you need to get all the children nodes in a tag, the findChildren method is the way to go. This method returns all of the children nodes in a list, you can access the tag of your choice using its index number.

    Here's an example:

    first_child = our_soup.find("body").find("ol")
    print(first_child.findChildren())

    This would return the children nodes in a list:

    [
  • Number One
  • ,
  • Number Two
  • ]

    To get the second child node in the list, the following code would do the job:

    print(first_child.findChildren()[1])

    To get the following result:

  • Number Two
  • That's all BeautifulSoup provides when it comes to methods. However, it doesn't end there. Attributes can also be called on our BeautifulSoup objects to get the child/children/descendant node from an HTML element.

    contents:

    While the findChildren method did the straightforward job of extracting the children nodes, the contents attributes does something a bit different.

    The contents attribute returns a list of all the content in an HTML element, including the children nodes. So when you call the contents attribute on a BeautifulSoup object, it would return the text as strings and the nodes in the tags as a bs4.element.Tag object.

    Here's an example:

    first_child = our_soup.find("body").find("ol")
    print(first_child.contents)

    This returns the following:

    ["\n   Here's an ordered list\n   ",
  • Number One
  • ,
    '\n',
  • Number Two
  • , '\n']

    As you can see, the list contains the text that comes before a child node, the child node and the text that comes after the child node.

    To access the second child node, all we need to do is to make use of its index number as shown below:

    print(first_child.contents[3])

    This would return the following:

  • Number Two
  • children:

    Here is one attribute that does almost the same thing as the contents attribute. However, it has one small difference that could make a huge impact (for those that take code optimization seriously).

    The children attribute also returns the text that comes before a child node, the child node itself and the text that comes after the child node. The difference here is that it returns them as a generator instead of a list.

    Let's take a look at the following example:

    first_child = our_soup.find("body").find("ol")
    print(first_child.children)

    The code above gives the following results (the address on your machine doesn't have to tally with the one below):

    As you can see, it only returns the address of the generator. We could convert this generator into a list.

    We can see this in the example below:

    first_child = our_soup.find("body").find("ol")
    print(list(first_child.children))

    This gives the following result:

    ["\n        Here's an ordered list\n        ",
  • Number One
  • ,
    '\n',
  • Number Two
  • , '\n']

    descendants:

    While the children attribute works on getting only the content inside a tag i.e. the text, and nodes on the first level, the descendants attribute goes deeper and does more.

    The descendants attribute gets all of the text and nodes that exist in children nodes. So it doesn't return only children nodes, it returns grandchildren nodes as well.

    Asides returning the text and tags, it also returns the content in the tags as strings too.

    Just like the children attribute, descendants returns its results as a generator.

    We can see this below:

    first_child = our_soup.find("body").find("ol")
    print(first_child.descendants)

    This gives the following result:

    As seen earlier, we can then convert this generator object into a list:

    first_child = our_soup.find("body").find("ol")
    print(list(first_child.descendants))

    We would get the list below:

    ["\n   Here's an ordered list\n   ",
  • Number One
  • ,
    'Number One', '\n',
  • Number Two
  • , 'Number Two', '\n']

    Conclusion

    There you have it, five different ways to access children nodes in HTML elements. There could be more ways, however with the methods and attributes discussed in this article one should be able to access the child node of any HTML element.

    Linux Oyunlarını Otomatikleştirmek için AutoKey Nasıl Kullanılır?
    AutoKey, Linux ve X11 için Python 3, GTK ve Qt'de programlanmış bir masaüstü otomasyon aracıdır. Komut dosyası oluşturma ve MAKRO işlevselliğini kulla...
    How to Show FPS Counter in Linux Games
    Linux gaming got a major push when Valve announced Linux support for Steam client and their games in 2012. Since then, many AAA and indie games have m...
    How to download and Play Sid Meier's Civilization VI on Linux
    Introduction to the game Civilization 6 is a modern take on the classic concept introduced in the series of the Age of Empires games. The idea was fai...