Web Scraping HandBook

Kawshik Kumar Paul
5 min readNov 24, 2020

--

Getting Data from website HTML

Learned From:

https://www.codewithharry.com/videos/python-web-scraping-tutorial-in-hindi

My Written Source Codes:

https://paste.ubuntu.com/p/dfpwWwJjsG/

View codes from this link because medium DOT com doesn’t manage indentation.

Setup a virtual environment and activate it.

virtualenv myenv
.\myenv\Scripts\activate

Install the packages

pip install requests
pip install bs4
pip install html5lib

Initially type it and run it. This is the hello world of web scraping.

import requestsfrom bs4 import BeautifulSoup
url = "https://codewithharry.com/"r = requests.get(url)htmlContent = r.contentprint(htmlContent)

Output will be messy. Like this

You we will parse the html source code to view it in a good structure.

soup = BeautifulSoup(htmlContent, 'html.parser')print(soup)

Output will be parsed, that means organized. Like this. This is the source code of that html.

Lets view the title of the html page

title = soup.titleprint(title)print(type(title))

Lets view the data type. It’s not normal string. It must be datatypes provided by BS4

print(title.string)print(type(title.string))
print(type(soup))print(type(title))print(type(title.string))

To get all paragraphs from the html page. find() will get the first one. find_all() will get all.

paras = soup.find_all('p')print(paras)

Output will be like this:

to find anchor tags <a href=…></a>

anchors = soup.find_all(‘a’)print(anchors)

Getting first paragraph

print(soup.find(’p’))

Getting first element class

print(soup.find(’p’)[’class’])

Getting first element id

print(soup.find(’p’)[’id’])

Find all the element with class name lead

print(soup.find_all(‘p’, class_=’lead’))
Scroll the terminal if arrives lately

Getting text of a paragraph (here first) tag

print(soup.find(’p’).get_text())

Getting text all full page (without tags, only text)

print(soup.get_text())

Getting all links of this page

anchors = soup.find_all('a')for link in anchors:print(link.get('href'))

Getting all links of this page (BETTER)

anchors = soup.find_all('a')all_links = set()for link in anchors:if(link.get('href') != '#'):linkText = "https://codewithharry.com"+link.get('href')all_links.add(linkText)print(all_links)

Comment related parsing

Find a element with id

navbarSupportedContent = soup.find(id='navbarSupportedContent')print(navbarSupportedContent)

Find the children of an element (which is searched by id)

navbarSupportedContent = soup.find(id=’navbarSupportedContent’)print(navbarSupportedContent.children)

Contents are kind of same like children, but there are some differences.

navbarSupportedContent = soup.find(id='navbarSupportedContent')print(navbarSupportedContent.contents)

Print content elements

for elem in navbarSupportedContent.contents:print(elem)

Print children elements

for elem in navbarSupportedContent.children:print(elem)

Difference between.contents and .children

.contents = stores in memory (list)

.children = doesn’t store in memory (but can be saved manually)

Wise to use for loop in children. It will save much memory and run faster. Contents will be memory consuming, so this is not recommended.

To view the strings of an element (searched by id previously). This will show the strings without the tags.

for item in navbarSupportedContent.strings:print(item)

For better view, use stripped_strings. It will reduce the spaces.

for item in navbarSupportedContent.stripped_strings:print(item)

If we want to see the parent of an element.

print(navbarSupportedContent.parent)

To see all the parents of the element. All parents means all of upward.

for item in navbarSupportedContent.parents:print(item)

To get parents item name (actually tag names)

for item in navbarSupportedContent.parents:print(item.name)

To get the next or previous sibling of an element

print(navbarSupportedContent.next_sibling)print(navbarSupportedContent.next_sibling.next_sibling)print(navbarSupportedContent.previous_sibling)print(navbarSupportedContent.previous_sibling.previous_sibling)

Select an element by an id. # is used to get id in javascript.

elem = soup.select(‘#loginModal’)print(elem)

To select a class with its name. DOT is used to get a class with class name.

elem = soup.select('.modal-footer')print(elem)

For Documentation

--

--