Webscraping Indeed Job Portal

webscraping with python
webscraping
code
ETL
python
Author

Aakash Basnet

Published

February 3, 2024

Building URL

After navigating the developer toolbar for Indeed job listing, I found the pattern in the url query for each job title search and location. We can use this info to build the url. The link printed from the code below will take you to the Indeed page having listing for python developer in Dalla, TX

Code
def url_builder(job_title, location, page_number=10 ):
    job_title = "+".join(job_title.split(" "))
    location = "+".join(location.split(" "))
    base_url = "https://www.indeed.com/jobs"
    query_str = f"?q={job_title}&l={location}"
    url = f"{base_url}{query_str}"
     
    return url

print(url_builder(job_title="python developer", location="Dallas, TX"))
https://www.indeed.com/jobs?q=python+developer&l=Dallas,+TX

Scraping the indeep page with selenium

The script below scrapes the data for the given job title and location. It uses selenium web driver to automate the data scraping. The web driver clicks ‘next’ button on pagination until the end of the page.

Code
from selenium import webdriver
from selenium.webdriver.common.by import By

import time

def get_data(job_title, location):

    url = url_builder(job_title=job_title, location=location)
    driver = webdriver.Chrome()
    driver.get(url)

    jobs = []
    has_next = True
    count = 1
    while has_next:
        
        time.sleep(2)
        cards = driver.find_elements(By.CLASS_NAME,'cardOutline')
        for card in cards:
            job_title = card.find_element(By.CLASS_NAME,'jobTitle')
            job_title_text = job_title.text
            job_id = job_title.find_element(By.TAG_NAME, 'a').get_attribute('data-jk')
            location = card.find_element(By.CLASS_NAME,'company_location').text
            job_description = card.find_element(By.CLASS_NAME,'underShelfFooter').text
            
            try:
                pay,*_metadata = card.find_element(By.CLASS_NAME,'heading6').text.split('\n')
            except Exception as e:
                pay = 'NA'
                _metadata = []
        
            
            jobs.append({
                'job_title':job_title_text,
                'location': location,
                'description': job_description,
                'pay rate': pay,
                'metadata': _metadata,
                'job_id':job_id,
                'job_url': f"https://www.indeed.com/viewjob?jk={job_id}",
                
            })

        try:
            driver.find_element(By.CSS_SELECTOR,"[data-testid='pagination-page-next']").click()
            count += 1
        except Exception as e:
            print(f"Ending at page {count}")
            has_next = False

    driver.close()
    return jobs
Code
job_data = get_data(job_title='python developer', location='Fort Worth,TX')
Ending at page 40
Code
import pandas as pd
df = pd.DataFrame(job_data)
df.head(40)
job_title location description pay rate metadata job_id job_url
0 Software Developer - AI Trainer (Contract) DataAnnotation\n4.6\nRemote in Dallas, TX You will work with the chatbots that we are bu... $40 an hour [Contract, 1 to 40 hours per week, Choose your... cd937da8c0f30efd https://www.indeed.com/viewjob?jk=cd937da8c0f3...
1 Python Developer Robert Half\n3.9\nPlano, TX 75024 Periodically exercise code-review and band tog... $76 - $88 an hour [Temp-to-hire] aeb409d79f39f2e1 https://www.indeed.com/viewjob?jk=aeb409d79f39...
2 Back End Developer IWP Services, LLC\nHybrid remote in Fort Worth... Integration of user-facing elements developed ... $100,000 - $110,000 a year [Full-time, +1, Monday to Friday, Employee sto... dc3e79c2af261bbe https://www.indeed.com/viewjob?jk=dc3e79c2af26...
3 DevOps Engineer (309801) Internal Data Resources\n3.7\nRemote in Coppel... Our client is looking for a DevOps Engineer to... $55 - $63 an hour [Full-time, 40 hours per week, 8 hour shift] 1ebb8e02e8eb7b07 https://www.indeed.com/viewjob?jk=1ebb8e02e8eb...
4 Python Developer Emonics LLC\nFort Worth, TX 76107 \n(Arlington... 8.Coordinating with front-end developers..\nTo... $70,000 - $130,000 a year [Full-time, +1] 8d0dfc501ff3fb89 https://www.indeed.com/viewjob?jk=8d0dfc501ff3...
5 Software Engineer III (API / Scripting / Python) JPMorgan Chase & Co\n3.9\nPlano, TX 75024 Your collaboration will be crucial in advancin... Pay information not provided [Full-time] b7109d2b1ce04d16 https://www.indeed.com/viewjob?jk=b7109d2b1ce0...
6 Software Developer Boston Enterprises Investment Group LLC\nDeSot... The ideal candidate will be passionate about d... $68,000 - $77,000 a year [Full-time, +1, 8 hour shift] 8e5cb88d329ff384 https://www.indeed.com/viewjob?jk=8e5cb88d329f...
7 Principal Artificial Intelligence / Machine Le... Raytheon\n3.9\nRichardson, TX 75082 In this role, you will work with data scientis... $96,000 - $200,000 a year [Full-time] ab702435728c8649 https://www.indeed.com/viewjob?jk=ab702435728c...
8 Python Developer Qatalys Software Technologies\nIrving, TX Analyzes business and technical requirements t... [] a2a8cbd917cbf446 https://www.indeed.com/viewjob?jk=a2a8cbd917cb...
9 Software Engineer - Mid-Career (HYBRID TELEWORK) Lockheed Martin Corporation\nFort Worth, TX Design, modify, develop, write, and implement ... Pay information not provided [Full-time, 4x10] 91cdaee932850a17 https://www.indeed.com/viewjob?jk=91cdaee93285...
10 Full Stack Software Engineer (hybrid) Raytheon\n3.9\nRichardson, TX 75082 Experience with the Linux shell scripting, pyt... $77,000 - $163,000 a year [Full-time] 536f5c4b6cd40276 https://www.indeed.com/viewjob?jk=536f5c4b6cd4...
11 Lead Software Developer McKesson\nIrving, TX 75039 \n(Freeport/Hackber... Java, Python, VB Scripts, Linux, SQL developer... $140,000 - $221,900 a year [] ec178dacdbfc1516 https://www.indeed.com/viewjob?jk=ec178dacdbfc...
12 Java AWS Full Stack Developer CGI Group, Inc.\n3.6\nTexas We are seeking an AWS/JAVA Full Stack Develope... Pay information not provided [Full-time] 2e5a7704804e0172 https://www.indeed.com/viewjob?jk=2e5a7704804e...
13 Python Developer NLB Technology Services\nFort Worth, TX 76102 Minimum 5 Years of relevant experience in Pyth... [] ad783a919af67c7f https://www.indeed.com/viewjob?jk=ad783a919af6...
14 Python Developer (AWS) Integrated Technology Strategies, Inc.\nDallas... You will participate and effectively contribut... Pay information not provided [] cb1fe8a6b26f986f https://www.indeed.com/viewjob?jk=cb1fe8a6b26f...
15 Experienced Software Engineer Java / Python (F... JPMorgan Chase & Co\n3.9\nPlano, TX 75024 Depending on the team that you join, you could... Pay information not provided [Full-time] d2b8758d3e73f691 https://www.indeed.com/viewjob?jk=d2b8758d3e73...
16 Front End Web Developer PCI Enterprises\n4.3\nDallas, TX 2+ years experience with Javascript, HTML5 & C... $65,000 - $75,000 a year [Full-time, Monday to Friday, +1, Bonus opport... 8f182ee73fc19368 https://www.indeed.com/viewjob?jk=8f182ee73fc1...
17 Application Developer- Java /Python HCL Tech\nIrving, TX US Citizen or GC Holder required due to ITAR R... $50.20 - $60.46 an hour [Full-time, Monday to Friday, +1] a1434f80b52b2d66 https://www.indeed.com/viewjob?jk=a1434f80b52b...
18 Engineering Aide - Programmer Lockheed Martin Corporation\nFort Worth, TX Description:• Perform code analysis of existin... Pay information not provided [Full-time, 4x10] b5c3dac533e8837f https://www.indeed.com/viewjob?jk=b5c3dac533e8...
19 Web Developer Topgolf\nDallas, TX 75231 \n(Northeast Dallas ... Design, develop, test, debug, and implement hi... [] 7b274886df7b33e0 https://www.indeed.com/viewjob?jk=7b274886df7b...
20 Python Developer Kinetix Trading Solutions Inc\nIrving, TX 75039 Organize with end users, business analysts, an... [] 026deb02c0cf998d https://www.indeed.com/viewjob?jk=026deb02c0cf...
21 Embedded Firmware Developer - IoT Apps WAC Lighting\nCedar Hill, TX 75104 Develop well-tested, efficient, and maintainab... Pay information not provided [] aa2f1edfd956689a https://www.indeed.com/viewjob?jk=aa2f1edfd956...
22 Middle/Senior Python Developer ScienceSoft USA Corporation\nRemote in Dallas, TX Write high-quality, reusable and documented co... [] 792d6bcda6668825 https://www.indeed.com/viewjob?jk=792d6bcda666...
23 Python Developer The Beneficient Company Group USA LLC\nDallas,... Strong collaboration skills to work effectivel... [] 4a341dd2109d5e28 https://www.indeed.com/viewjob?jk=4a341dd2109d...
24 Junior Developer CALL BOX\nDallas, TX 75231 \n(Northeast Dallas... Create and shape new products from the ground ... [] 8ee9d97e2585c552 https://www.indeed.com/viewjob?jk=8ee9d97e2585...
25 Python Developer MRoads\nDallas, TX JOB TYPE: Contract(12+ months).\nTeam is respo... [] 018d5de40ecce815 https://www.indeed.com/viewjob?jk=018d5de40ecc...
26 Sr. PYTHON DEVELOPER AGILEWIT SOLUTIONS\nLewisville, TX 75067 Experience with project management and deliver... Pay information not provided [] 46c798ebcda00d7d https://www.indeed.com/viewjob?jk=46c798ebcda0...
27 FullStack Software Python/Node/React JS Developer Elm Street Technology LLC\nFrisco, TX 75033 As a FullStack Developer, you will develop and... [] 1b46309123b68e9e https://www.indeed.com/viewjob?jk=1b46309123b6...
28 Software Developer Minol M T R Lp\nAddison, TX 75001 The developer would leverage technical experti... Pay information not provided [Full-time] 3a0a3c9b2ac3af15 https://www.indeed.com/viewjob?jk=3a0a3c9b2ac3...
29 Software Web Developer Lockheed Martin Corporation\nFort Worth, TX The candidate should be a well-rounded softwar... Pay information not provided [Full-time, 4x10] d282a472f4692ed2 https://www.indeed.com/viewjob?jk=d282a472f469...
30 Python Developer Robert Half\n3.9\nPlano, TX 75024 Frequently complete code-review and join force... $76 - $88 an hour [Temp-to-hire] 9485e1c7aca04dda https://www.indeed.com/viewjob?jk=9485e1c7aca0...
31 DevOps Automation Engineer I GM Financial\n3.6\nHybrid remote in Arlington,... Finally, the engineer will be responsible for ... $76,400 - $141,300 a year [Full-time, On call, Bonus opportunities] 73216f1f3f181f05 https://www.indeed.com/viewjob?jk=73216f1f3f18...
32 Principal Artificial Intelligence / Machine Le... Raytheon\n3.9\nRichardson, TX 75082 In this role, you will work with data scientis... $96,000 - $200,000 a year [Full-time] ab702435728c8649 https://www.indeed.com/viewjob?jk=ab702435728c...
33 Full Stack Engineer PCI Enterprises\n4.3\nDallas, TX 2+ years experience using any major backend pr... $65,000 - $80,000 a year [Full-time, Monday to Friday, +1, Bonus opport... 693494f2bd167ac3 https://www.indeed.com/viewjob?jk=693494f2bd16...
34 Hadoop Developer Matlen Silver\n3.3\nPlano, TX Mid Level Hadoop Developer (4-6 years of exper... $55 - $60 an hour [Contract, Hourly pay] 48784392d6893ec6 https://www.indeed.com/viewjob?jk=48784392d689...
35 Quantitative Risk, Model Developer- Market Ris... Citi\n3.9\nIrving, TX 75061 Developed communication and diplomacy skills a... $125,760 - $188,640 a year [Full-time] 0b51d40d49ef8a0a https://www.indeed.com/viewjob?jk=0b51d40d49ef...
36 Software Developer Seerist, Inc\nHybrid remote in Dallas, TX Coding knowledge and experience with several l... Pay information not provided [Full-time] 251f54f954e7dead https://www.indeed.com/viewjob?jk=251f54f954e7...
37 Python Engineer CybeCys\nPlano, TX Reqs Master’s degree* in Information Systems o... Pay information not provided [] feb351ab034b044c https://www.indeed.com/viewjob?jk=feb351ab034b...
38 Cloud Engineer CVS Health\nIrving, TX Cloud Engineer of Data Engineering will own th... $72,100 - $175,100 a year [Full-time] 350b76a3bad7fff1 https://www.indeed.com/viewjob?jk=350b76a3bad7...
39 Azure .NET Lead Developer Vichara\nDallas, TX 75211 \n(Oak Cliff area) Work with development teams to provide estimat... $150,000 - $180,000 a year [] f74e65eb35fc2a33 https://www.indeed.com/viewjob?jk=f74e65eb35fc...
Code
df.shape
(518, 7)

Rotating Proxies

The proxies needs to be rotated to not be detected by anti scrapping tools used by the servers. For this we will scrape the list of free available ip address and test them using multithreading. This will filter the working proxies. Later on, we will use working proxies to make the request

Code
import pandas as pd
import requests


def extract_proxies():
    print("Extracting proxies...")
    proxy_url  = "https://www.us-proxy.org/"
    r = requests.get(proxy_url)
    dfs  = pd.read_html(r.text)
    df = dfs[0]
    print(df.shape)
    return df
proxies_df = extract_proxies()
proxies_df.head(20)
   
/var/folders/22/2rvpv_m90c30mhtk77k1jd440000gn/T/ipykernel_39218/3797928275.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
/var/folders/22/2rvpv_m90c30mhtk77k1jd440000gn/T/ipykernel_39218/3797928275.py:9: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  dfs  = pd.read_html(r.text)
Extracting proxies...
(200, 8)
IP Address Port Code Country Anonymity Google Https Last Checked
0 104.225.220.233 80 US United States elite proxy yes no 3 secs ago
1 23.254.231.55 80 US United States elite proxy yes no 3 secs ago
2 50.217.226.42 80 US United States anonymous no no 4 secs ago
3 50.223.239.185 80 US United States anonymous no no 4 secs ago
4 50.174.145.15 80 US United States anonymous no no 4 secs ago
5 50.174.214.220 80 US United States anonymous no no 4 secs ago
6 50.217.226.46 80 US United States anonymous no no 4 secs ago
7 50.200.12.83 80 US United States anonymous no no 4 secs ago
8 50.168.72.118 80 US United States anonymous no no 4 secs ago
9 50.207.199.85 80 US United States anonymous no no 4 secs ago
10 50.174.214.219 80 US United States anonymous no no 4 secs ago
11 68.185.57.66 80 US United States anonymous no no 4 secs ago
12 50.221.74.130 80 US United States anonymous no no 4 secs ago
13 50.174.145.8 80 US United States anonymous no no 4 secs ago
14 50.173.140.151 80 US United States anonymous no no 4 secs ago
15 50.168.72.115 80 US United States anonymous no no 4 secs ago
16 50.170.90.27 80 US United States anonymous no no 4 secs ago
17 50.168.163.176 80 US United States anonymous no no 4 secs ago
18 50.223.246.226 80 US United States anonymous no no 4 secs ago
19 50.207.199.84 80 US United States anonymous no no 4 secs ago
Back to top