Getting Data from website HTML

Learned From:

https://www.codewithharry.com/videos/python-web-scraping-tutorial-in-hindi

My Written Source Codes:

https://paste.ubuntu.com/p/dfpwWwJjsG/

View codes from this link because medium DOT com doesn’t manage indentation.

Setup a virtual environment and activate it.

Install the packages

Initially type it and run it. This is the hello world of web scraping.

Output will be messy. Like this

Image for post
Image for post

You we will parse the html source code to view it in a good structure.

Output will be parsed, that means organized. Like this. This is the source code of that html.

Image for post
Image for post

Lets view the title of the html page

Image for post
Image for post

Lets view the data type. It’s not normal string. It must be datatypes provided by BS4

Image for post
Image for post
Image for post
Image for post

To get all paragraphs from the html page. find() will get the first one. find_all() will get all.

Output will be like this:

Image for post
Image for post

to find anchor tags <a href=…></a>

Image for post
Image for post

Getting first paragraph

Image for post
Image for post

Getting first element class

Image for post
Image for post

Getting first element id

Find all the element with class name lead

Image for post
Image for post
Scroll the terminal if arrives lately

Getting text of a paragraph (here first) tag

Image for post
Image for post

Getting text all full page (without tags, only text)

Image for post
Image for post

Getting all links of this page

Image for post
Image for post

Getting all links of this page (BETTER)

Image for post
Image for post

Comment related parsing

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Find a element with id

Image for post
Image for post

Find the children of an element (which is searched by id)

Image for post
Image for post

Contents are kind of same like children, but there are some differences.

Image for post
Image for post

Print content elements

Image for post
Image for post

Print children elements

Image for post
Image for post

Difference between.contents and .children

.contents = stores in memory (list)

.children = doesn’t store in memory (but can be saved manually)

Wise to use for loop in children. It will save much memory and run faster. Contents will be memory consuming, so this is not recommended.

To view the strings of an element (searched by id previously). This will show the strings without the tags.

Image for post
Image for post

For better view, use stripped_strings. It will reduce the spaces.

Image for post
Image for post

If we want to see the parent of an element.

Image for post
Image for post

To see all the parents of the element. All parents means all of upward.

Image for post
Image for post

To get parents item name (actually tag names)

Image for post
Image for post

To get the next or previous sibling of an element

Image for post
Image for post

Select an element by an id. # is used to get id in javascript.

Image for post
Image for post

To select a class with its name. DOT is used to get a class with class name.

Image for post
Image for post

For Documentation

Written by

CS Undergrad | BUET

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store