Webscraping Indeed Job Portal

webscraping with python
webscraping
code
ETL
python
Author

Aakash Basnet

Published

February 3, 2024

Building URL

After navigating the developer toolbar for Indeed job listing, I found the pattern in the url query for each job title search and location. We can use this info to build the url. The link printed from the code below will take you to the Indeed page having listing for python developer in Dalla, TX

Code
import pandas as pd
import requests
import time

from selenium import webdriver
from selenium.webdriver.common.by import By



def url_builder(job_title, location, page_number=10 ):
    job_title = "+".join(job_title.split(" "))
    location = "+".join(location.split(" "))
    base_url = "https://www.indeed.com/jobs"
    query_str = f"?q={job_title}&l={location}"
    url = f"{base_url}{query_str}"
     
    return url

print(url_builder(job_title="python developer", location="Dallas, TX"))
https://www.indeed.com/jobs?q=python+developer&l=Dallas,+TX

Scraping the indeep page with selenium

The script below scrapes the data for the given job title and location. It uses selenium web driver to automate the data scraping. The web driver clicks ‘next’ button on pagination until the end of the page.

Code


def get_data(job_title, location):
    

    url = url_builder(job_title=job_title, location=location)
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(5)

    jobs = []
    has_next = True
    count = 1
    while has_next:
        
        time.sleep(10)
        cards = driver.find_elements(By.CLASS_NAME,'cardOutline')
        for card in cards:
            job_title = card.find_element(By.CLASS_NAME,'jobTitle')
            job_title_text = job_title.text
            job_id = job_title.find_element(By.TAG_NAME, 'a').get_attribute('data-jk')
            location = card.find_element(By.CLASS_NAME,'company_location').text
            job_description = card.find_element(By.CLASS_NAME,'underShelfFooter').text
            
            try:
                pay,*_metadata = card.find_element(By.CLASS_NAME,'heading6').text.split('\n')
            except Exception as e:
                pay = 'NA'
                _metadata = []
        
            
            jobs.append({
                'job_title':job_title_text,
                'location': location,
                'description': job_description,
                'pay rate': pay,
                'metadata': _metadata,
                'job_id':job_id,
                'job_url': f"https://www.indeed.com/viewjob?jk={job_id}",
                
            })

        try:
            driver.find_element(By.CSS_SELECTOR,"[data-testid='pagination-page-next']").click()
            count += 1
        except Exception as e:
            print(f"Ending at page {count}")
            has_next = False

    driver.close()
    return jobs
Code
job_data = get_data(job_title='python developer', location='Fort Worth,TX')
Ending at page 2
Code

df = pd.DataFrame(job_data)
df.head(40)
job_title location description pay rate metadata job_id job_url
0 Python Programmer who enjoys helping people sm... Lifecorp\nArlington, TX Python Programmer who enjoys helping people sm... $70,000 - $100,000 a year [Full-time, Choose your own hours] ff1fe7ff0d3d04ad https://www.indeed.com/viewjob?jk=ff1fe7ff0d3d...
1 Software Developer lead4ward\nTexas As part of a small, focused team, you’ll provi... $100,000 - $120,000 a year [Full-time, Monday to Friday, +1] 6c7f467bada6cc22 https://www.indeed.com/viewjob?jk=6c7f467bada6...
2 Software Developer Dream Entertainment Labs\nDallas, TX 75252 Projects will center around family entertainme... $60,000 - $75,000 a year [Full-time] c32a429e080a48f0 https://www.indeed.com/viewjob?jk=c32a429e080a...
3 Python Developer || Dallas, TX (Local only) ||... ANB Sourcing LLC\nDallas, TX 75201 Mid-level (5 or more years) in Python Developm... [] d5dfbda8e8a781b1 https://www.indeed.com/viewjob?jk=d5dfbda8e8a7...
4 Python Developer E-Business International Inc\nPlano, TX 75024 Create Golang based microservices and librarie... [] a0caa2b4d714913d https://www.indeed.com/viewjob?jk=a0caa2b4d714...
5 Python FullStack Developer with Node Inclusion Cloud\nDallas, TX Provide technical guidance and support to juni... [] ec46c96556d4c31d https://www.indeed.com/viewjob?jk=ec46c96556d4...
6 Python Developer _ (Local to Dallas, TX)- Onsi... ANB Sourcing LLC\nDallas, TX 75201 Dallas, TX (Onsite job)-- Local only.\nEmploye... [] fbe285b925e08a10 https://www.indeed.com/viewjob?jk=fbe285b925e0...
7 Python Developer InfoQuest Consulting Group Inc.\nFort Worth, TX Duration & Type: 6 months Contract with a majo... [] e3232c051e279108 https://www.indeed.com/viewjob?jk=e3232c051e27...
8 Python Developer Tek Ninjas\nFort Worth, TX 76120 Position: Python Developer Location: Fort Wort... [] a002a17ed558d9a8 https://www.indeed.com/viewjob?jk=a002a17ed558...
9 Python Developer Qatalys Software Technologies\nIrving, TX Analyzes business and technical requirements t... [] a2a8cbd917cbf446 https://www.indeed.com/viewjob?jk=a2a8cbd917cb...
10 Lead Software Engineer Capital One\n3.9\nPlano, TX 75023 These advancements will then allow high-qualit... Pay information not provided [Full-time] 92491b6cd1e2158f https://www.indeed.com/viewjob?jk=92491b6cd1e2...
11 Python Developer- $100-110K Fults & Associates\nGrapevine, TX 76051 Need someone who works a lot with Software.\nM... [] b308d7d83046bef9 https://www.indeed.com/viewjob?jk=b308d7d83046...
12 Sr. Python Developer Amazee Global Ventures Inc\nPlano, TX Experience in leading development teams and me... $55 - $60 an hour [Full-time, +1, Monday to Friday] 85e34afe849476f4 https://www.indeed.com/viewjob?jk=85e34afe8494...
13 Java Python Developer Infosys\nRichardson, TX Experience in python development and libraries... Pay information not provided [] fbf2ba53d3fc0f3e https://www.indeed.com/viewjob?jk=fbf2ba53d3fc...
14 Python Sr Developer Inclusion Cloud\nDallas, TX Active participation in agile (scrum) developm... Pay information not provided [] b4fe0b9fcc1b77d5 https://www.indeed.com/viewjob?jk=b4fe0b9fcc1b...
Code
df.shape
(15, 7)

Rotating Proxies

The proxies needs to be rotated to not be detected by anti scrapping tools used by the servers. For this we will scrape the list of free available ip address and test them using multithreading. This will filter the working proxies. Later on, we will use working proxies to make the request

Code



def extract_proxies():
    print("Extracting proxies...")
    proxy_url  = "https://www.us-proxy.org/"
    r = requests.get(proxy_url)
    dfs  = pd.read_html(r.text)
    df = dfs[0]
    print(df.shape)
    return df
proxies_df = extract_proxies()
proxies_df.head(20)
   
/var/folders/22/2rvpv_m90c30mhtk77k1jd440000gn/T/ipykernel_39218/3797928275.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Extracting proxies...
(200, 8)
/var/folders/22/2rvpv_m90c30mhtk77k1jd440000gn/T/ipykernel_39218/3797928275.py:9: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  dfs  = pd.read_html(r.text)
IP Address Port Code Country Anonymity Google Https Last Checked
0 104.225.220.233 80 US United States elite proxy yes no 3 secs ago
1 23.254.231.55 80 US United States elite proxy yes no 3 secs ago
2 50.217.226.42 80 US United States anonymous no no 4 secs ago
3 50.223.239.185 80 US United States anonymous no no 4 secs ago
4 50.174.145.15 80 US United States anonymous no no 4 secs ago
5 50.174.214.220 80 US United States anonymous no no 4 secs ago
6 50.217.226.46 80 US United States anonymous no no 4 secs ago
7 50.200.12.83 80 US United States anonymous no no 4 secs ago
8 50.168.72.118 80 US United States anonymous no no 4 secs ago
9 50.207.199.85 80 US United States anonymous no no 4 secs ago
10 50.174.214.219 80 US United States anonymous no no 4 secs ago
11 68.185.57.66 80 US United States anonymous no no 4 secs ago
12 50.221.74.130 80 US United States anonymous no no 4 secs ago
13 50.174.145.8 80 US United States anonymous no no 4 secs ago
14 50.173.140.151 80 US United States anonymous no no 4 secs ago
15 50.168.72.115 80 US United States anonymous no no 4 secs ago
16 50.170.90.27 80 US United States anonymous no no 4 secs ago
17 50.168.163.176 80 US United States anonymous no no 4 secs ago
18 50.223.246.226 80 US United States anonymous no no 4 secs ago
19 50.207.199.84 80 US United States anonymous no no 4 secs ago
Back to top