Project - Published: 25 Jul 2024

Combining ChatGPT and Housing Data to Enhance the Home-Finder's Experience

Building a simple web-scraper that emails houses daily (with a little bit of added ChatGPT magic)

I've always been passionate about using the latest technology to solve real-world problems. Recently, I started experimenting on a project that's relevant to a new stepping stone in my life; moving home. This article delves into the journey of creating a Node application to scrape estate agent sites and integrate ChatGPT, providing a unique solution for homebuyers.

Please note, this is a simple prototype to show proof of concept for my own personal development.

The Challenges of Online House-Hunting

The primary issue with property listing sites like Rightmove and Zoopla is that they frequently miss out on the best properties, especially those exclusive to specific estate agents. Manually browsing through smaller estate agent websites is time-consuming. I wanted to develop a solution that automates the search process, saving time and helping me discover properties that might otherwise be missed.

The second issue is that these platforms lack personalisation, making it difficult to know if a property is in the right location without extensive research on Google Maps or searching through falsely-advertised information. I wanted to develop a system that delivers daily email updates with new properties, each scored based on my specific preferences.

Developing the Scraper

To get started, I started by creating an array of objects to hold the necessary data for scraping. Each object contained:

The URL of the page to scrape
The Estate Agent Name (as a reference)
The class name of the address to be scraped
The container class holding all relevant house details

const sites = [
  {
    url: "https://www.examplesite.com/detached/location/?max-price=X&min-bedrooms=3&min-price=X",
    estateAgent: "Estate Agent Name",
    addressClass: ".address",
    container: ".container",
  },

];

Using Playwright to scrape the necessary data

Using Playwright, I then launched a headless Chromium instance to loop through each URL and extract the required information.

const { chromium } = require("playwright");

async function scrapeWebsite(sites) {
  try {
    const browser = await chromium.launch({
      headless: true,
      args: ["--no-sandbox", "--disable-setuid-sandbox"],
    });
    const page = await browser.newPage();


// This is where you loop through the sites and target the elements on the page that contain the data you want to scrape.

Prompting

Prompting for this project involved creating a list of criteria that ChatGPT could review. I created a set of criteria for ChatGPT to review each property, for example: 'being detached', 'having ample space', 'proximity to good schools and transport links', and so on. Once the data was scraped, I pushed it to a simple data base and it was subsequently fed into a template literal that was sent to the OpenAI endpoint.

const prompt = `Please review the following house based on these criteria: 
  - Detached
  - Unique and interesting properties
  - Spacious and lots of room
  - Be near a good school and have decent transport links 
  - Be near shops, parks or towns
  - Be in a good area with low crime rates
  - Be in a good condition with no major repairs needed
  - Ideally be around £x or less, but can be more if it's a unique property
  - A new kitchen and two toilets would be a bonus
  - 4 bedrooms would be ideal but 3 is acceptable
  - A garden would be nice but not essential
  - A garage would be nice but not essential

  Here are the details of the house:

  Estate Agent: ${property.estateAgent}
  Address: ${property.address}
  Price: £${property.price}
  Additional context: ${property.context}

Please provide a simple easy to read 50-80 word review for this house. Start the review with a score of 0-100 based on criteria.`;

Storing the data

For the sake of UX, it was important to try and avoid duplicates in the data. The address was used as a unique identifier to prevent the same property from being sent more than once. (In retrospect, formatting and using the postcode might have been a better choice). I extracted the necessary information, and stored it on a firestore database temporarily.

The results of combining GPT with house data

I'm using the latest 'light' model of OpenAI's chat API as of August 2024 - which is ChatGPT4o. I was genuinely impressed by its ability to provide detailed information about local schools, crime rates, and amenities - details sometimes missed (or deliberately ignored) by estate agents. ChatGPT could effectively understand the specified criteria and generate a comprehensive ranking system (0-100) for each listing.

The result was a comprehensive list of properties sent to my email address daily.

(Note, this image omits any information or imagery that might single out a particular property for the sake of privacy).

Issues and scalability

One challenge was dealing with client-side rendered images. Initially, I attempted to make simple HTTP requests to return the HTML of each site. However, many estate agent websites use client-side rendering for images and content, complicating the scraping process. Playwright helped emulate a Chromium browser to accurately fetch image data.

To scale this tool to become a fully fledged application, it would require a fairly substantial effort to accurately scrape information from all of the different estate agents across the UK. One of the issues with scraping is manually targetting DOM elements which could be unreliable. Another issue around this is when some websites obfuscate their classnames using randomly hashed letters - as seen in some front-end systems. This makes targeting elements unreliable. To overcome manually targetting DOM elements, it would be interesting to explore searching for content on the page that will always be consistent, like a postcode for example, and use a regular expression to locate the element. You could use this element as an 'anchor' point, and find the relevant house details in sibling or parent nodes.

Practical Application & Findings

ChatGPT was able to accurately score my preferences based on the scraped content from estate agents
It was extremely useful getting more contextual understanding of properties (where they were located, their surroundings, etc)
I feel like companies could leverage AI not just to generate content, but also to make it easier to easily create a tailored user experience.

I was surprised to see that none of the larger real-estate agencies use AI to enhance the user experience. I feel that combining personalisation and AI can radically enhance a user's experience by tailoring an experience without having to create exhaustive 'buckets' of users and to manually assign content and imagery to users. An automated system that provides relative information to it's users could be extremely beneficial.

Please note, it's always important to check a website's Terms of Service before targeting it for web-scraping - and where possible seek permission from the website owner. It should be preferable to use an API for accessing data, and if in doubt - carry out some due diligence before performing any activity that could potentially land you in hot water.