Phone Number Matcher


A pure python utility to extract phone number from url or text

Next Section

This library is compatible with spark, and please check at here for more detail.

Next Section

Pure python utility for phone number extraction

Phone Number Matcher

The phone number matcher aims to extract phone numbers from url and text for DIG project.

Problem

Our DIG project needs a phone number extractor in python to get phone numbers from url or text, precision is more important than recall, and it should be compatible with spark.

Solution

phone-number-matcher is a utility in python to extract phone numbers from url or text. The precision is more important than recall, and thus a phone number validator is added at the end of extraction process based on Google’s libphonenumber.

It contains five components:

  • Preprocessor: get rid of digits that must not belong to phone number
  • Tokenizer: removepunctuations and tokenize original content
  • Cleaner: clean misspelling number words and replace numeral words
  • Extractor: extract phone numbers
  • Validator: validate if phone number is valid

Spark Support

This library is available to use in Spark environment, and can process around 55 GB data within 25 minutes.

Ideation

Since the phone number format in original dataset is very annoying, and thus the punctuations cannot be used for regular expression script to extract useful information. (The samples of annoying phone number format can be found here) Instead, I get rid of punctuations and clean data before processing, and add extra validation process after.