Hrishi's Blog

Steps involved in the query tagger

input: “software-engineer-jobs-in-san-francisco”

Tokenization & segmentation: Split on hyphens -> [‘software’, ‘engineer’, ‘jobs’, ‘in’, ‘san’, ‘francisco’]
Entity recognition: Detect that “software engineer” is a job title and “san francisco” is a location
Normalization: Convert ‘software engineer’ to ‘software-engineer’ or whichever normalization scheme the system uses to de-dupe similarities
Id resolution/lookup: Map recognized entities to database ids (e.g: jobId: 345, locationId: 4562)
JSON structuring: Collect the above results into the structured output.

We will look at how a simple text classifier works in the next couple of slides.