My works in DIG Group for Memex Summer Hack 2016

MEMEX HACHATHON


Hosted in Arlington, start from July 10 for four weeks.

Next Section

Python, Big Data & Human Trafficking

MEMEX Hackathon

Memex tasks to build a query answering system for Human Trafficking using crawl data provided at the workshop. Implement all elements of the pipeline: extraction, data cleaning and normalization, clustering, indexing and query answering.

Assumptions

Long Tail Crawls

The data will consist of HTML pages coming from a large collection of web sites, so there is no time to write extractors for all web sites

SPARQL Questions

The questions will be provided in SPARQL using a schema that is different from ours, and most likely richer than ours, so it will be necessary to combine structured queries with information retrieval

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus hendrerit. Pellentesque aliquet nibh nec urna. In nisi neque, aliquet vel, dapibus id, mattis vel, nisi. Sed pretium, ligula sollicitudin laoreet viverra, tortor libero sodales leo, eget blandit nunc tortor eu nibh. Nullam mollis. Ut justo. Suspendisse potenti.

Tasks

Page Classification

We assume that the crawls will contain irrelevant pages. Our pipeline needs a page classifier to identify human trafficking pages

Extraction

We will use a combination of structured extraction and text extraction to extract all pages

Data Cleaning, Schema Mapping

Use Karma to do data cleaning, normalization and mapping to a common schema, allowing us to consume extractions from any team regardless of the schema.

Entity Resolution

Create product/provider clusters that identify the characteristics of the products/providers being advertised

Search Engine

Implement a flexible, table-driven method to map SPARQL queries to a combination of structured and information retrieval queries

Plug-and-Play Extractors

Build two indices using DIG and lattice extractions to evaluate the influence of extraction in search performance

Augmentation

Investigate approaches to augment original queries with additional information (like query expansion) to enable more accurate answers

Evaluation

Evaluate a range of query mapping schemes spanning the spectrum from pure information retrieval to pure structured query. Our hypothesis is that a combination of structured and information retrieval will produce the best results.

Extraction Pipeline

The extraction pipeline will include two types of extractors:

  • Semi-structured extractors that leverage repeated structures among pages in a web site to identify and extract information.
  • Text extractors that identify and extract specific pieces of information from text

 

Semi-Structured Extractors (Inferlink)

We are developing a semi-structured extraction pipeline that will allow us to define high quality extractors for a web site in 10 minutes or less per web site.

Given a set of pages from a web site, the extraction process will proceed as follows:

1. Tool will automatically identify one or multiple templates for pages in a top-level domain, assign pages to templates, and automatically infer rules to extract data from each page.

  • Input: directory containing a collection of CDR objects from a web site
  • Output: several subdirectories, one per template, containing pages for that template, a rules file to extract data, and a JSON-lines file with extractions for the pages assigned to the template

2. Tool will automatically identify the schema for the extractions in each template:

  • Tool automatically identifies and removes irrelevant data
  • Tool automatically identifies and normalizes wanted fields (posted-date, age, telephone, email, web site, city/state/country, ethnicity, hair color, eye color, and possibly others)
  • Tool automatically identifies extractions from sponsored ads and removes them
  • Tool lumps all remaining fields into a text field

3. User curates a set of rules for a web site in 10 minutes or less:

  • User can quickly page through the cleaned up, schema mapped extractions from step 2 to choose the best one, and curate it:
    • User removes unwanted data
    • User renames attributes that were not mapped to the schema correctly
  • Tool applies the curated rules to all pages in the directory for a top-level domain, created a new file of extractions
  • User loads the extractions in Karma to do a final check on the generated rules, performing one of the following actions:
    • User can accept the rules, which then will be used to extract from every page from the top-level domain
    • User tweaks the rules once more
    • User discards the rules

When used in production, the rules have a failure protection mode: they are designed to produce no extractions when less than 50% of the rules are able to produce extractions, so they produce no data applied to pages whose template differs significantly from the template expected by the rules.

Text Extractors

We have extractors for the following types of information:

  • Person name: extract woman names using a dictionary and simple regex rules
  • Email: identifies and extracts well-formed and malformed email addresses
  • Phone: identifies and extracts well-formed and obfuscated phone numbers
  • Hair color: identifies, extracts and normalizes hair colors to wikipedia list of hair colors
  • Eye color: identifies extracts and normalizes eye colors to wikipedia list of eye colors
  • Ethnicity: identifies ethnicities and nationalities, normalizing to a few categories
  • Price: identifies escort prices, optimized to recognize hourly prices
  • City/state/country: dictionary based extraction with spelling correction and normalization to geonames

Hybrid Extractor

The extraction pipeline behaves as follows:

  1. Given a page, determine the top-level domain (TLD) and apply the TLD-specific rules if available
  2. If the rules produce extractions (i.e., more than 50% of columns), accept the extractions
  3. If the rules produce no extractions, or no rules are defined for the TLD then
    • Apply algorithm to remove boilerplate and sponsored material, and generate text for the main content of the page
    • Apply text extractors to the page
  4. Complement extractions with title and description extractions from Tika

Knowledge Graph Building Pipeline 

The pipeline to build the knowledge graph after extraction is the same pipeline we use in the production code, enhanced to support the question answering evaluation:

  • Include several levels of textual information to support multiple levels of fall-back to information retrieval techniques when appropriate extractions are not available or noisy
  • Include indexing support for counting queries and fuzzy searching 
  • Include provider/product entity resolution

Search Engine

The search engine will take as input a SPARQL query written in a subset of SPARQL 1.1 (subset still TBD), and return as a result the answers to the SPARQL query. The format of the answers is still TBD.

We envision a spectrum of search engines:

  • Pure information retrieval: a search component that uses information retrieval techniques on the pages and does not use extractions.
  • Structured: a search component that views the corpus as a database and answers queries by translating input queries into structured queries in the underlying database built using extractions from the page corpus.

Our goal is to evaluate the two extremes of the spectrum as baselines, and one point in the middle, which we call hybrid using both information retrieval and structured query.

The following table lists the two ends of the search spectrum (ir and st) as well as multiple plug-ins that implement alternative strategies for producing hybrid queries. We will attempt to implement a composable query translation engine to enable us to produce a variety of search engines. For example, the simplest hybrid search engine would be st+text. 

StrategyExplanation
+flexsame as st, but in addition use analyzed fields for terms to allow partial matches and provide fuzzy term matching using levenshtein distance
+rerankre-rank the output results to implement re-ranking strategies that cannot be implemented in ES as part of the query translator
+textqueries target structured extraction, like st, and targets the ad-text field when the index does not have an extraction
+layeredlike -text, but use other fields besides the ad-text to enhance recall
baselineElasticSearch querystring with the SPARQL query against the CDR index
+expanddo keyword expansion in the input literals to enhance recall
stassemble structured ES queries where each clause translates literals in the SPARQL queries into term queries in ElasticSearch so that extractions must match exactly the terms in the queries. When the index doesn't have an extraction mentioned in the input query, the constraint gets dropped from the query.
ira combination of ElasticSearch queries that target a field containing the main content of a WebPage

The search engine will be table driven, mapping each pair of Class/Property used in the input queries into an list of strategies for converting the SPARQL clause into ES queries. During the workshop we will implement an initial version of the table driven search engine and implement as many strategies as time allows.

Evaluation

We would like to submit at least search engines for evaluation to test the two extremes of the query spectrum as well as one or several hybrid configurations.

Our hypothesis in decreasing confidence are:

  • ir is a robust baseline and will be hard to beat
  • st will perform poorly
  • st+text will perform better than ir, but |st+text – ir| > |ir – st|
  • +flex, +layered will provide only marginal improvements
  • +expand will be useful in some queries and may harm others
  • no bets on +rerank (we probably worn’t get to it)

In addition, it would be interesting to run a +lattice strategy where we replace or augment our extractors with the lattice extractors. The short timeline may make impractical for us to to implement this strategy.bit we would like to work on it during the last week while the other results are being evaluated.

Persona Matching

Apply our entity resolution algorithm: Unsupervised Entity Resolution on Multi-type Graphs