Web scraping using Beautiful Soup & Python: Wikipedia (I)

While the presence of a challenging task creates chaos, the most straightforward approach to solve a problem is usually to make our best effort, as the well-known quotes say: the world belongs to those who dare to dream.

So, I am here trying to understand without almost no time, how to use BeautifulSoup (and Python) using simple and easy examples until the moment when I might make more complex projects.

.

What is Beautiful Soup?

A Python library which gave the opportunity to execute web scraping.

.

Simple example: Webscraping Wikipedia

# import the library we use to open URLs
import urllib.request

# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

# import pandas to manipulate data
import pandas as pd

# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/Serie_A"

# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

all_tables = soup.find_all("table")

A=[]
B=[]
C=[]
D=[]
E=[]

# because I need the first table: all_tables[1]
for row in all_tables[1].findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==5:
        A.append(cells[0].text.strip())
        B.append(cells[1].text.strip()) 
        C.append(cells[2].text.strip())
        D.append(cells[3].text.strip())
        E.append(cells[4].text.strip())

df=pd.DataFrame(A,columns=['Team'])
df['Home City']=B
df['Stadium']=C
df['Capacity']=D
df['2018-2019_Season']=E

Results

The complete table from wikipedia without images and odd symbols in your Python environment ready to be analyzed!

.

General considerations

Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

.

Libraries that you need

URL lib, BeautifulSoup and Panda.

# import the library we use to open URLs
import urllib.request

# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

# import pandas to manipulate data
import pandas as pd

.

Using LXML

page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

At the moment you call the page, you can use either way three different parsers.

The basic reasoning why would you prefer one parser instead of others. From theĀ docs‘s summarized table of advantages and disadvantages:

  1. html.parser – BeautifulSoup(markup, "html.parser")
    • Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
  2. lxml – BeautifulSoup(markup, "lxml")
    • Advantages: Very fast, Lenient
    • Disadvantages: External C dependency
  3. html5lib – BeautifulSoup(markup, "html5lib")
    • Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
    • Disadvantages: Very slow, External Python dependency

So, resuming the basic reasoning why would you prefer one parser instead of others:

  • html.parser– built-in – no extra dependencies needed
  • html5lib – the most lenient – better use it if HTML is broken
  • lxml – the fastest

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.