Tuesday, October 27, 2020

Data are Like Oil or Sunlight. Process it First

 

Data has been compared to a well-known phrase: new oil that business needs to run. IBM CEO Ginni Rometty explains it on the World Economic Forum in Davos in 2019, “I think the real point to that metaphor,” Rometty said, “is value goes to those that actually refine it, not to those that just hold it.”

Another different view of data came from Alphabet CFO Ruth Porat. “Data is actually more like sunlight than it is like oil because it is actually unlimited,” she said during a panel discussion in Davos. “It can be applied to multiple applications at one time. We keep using it and regenerating.”

An article entitled “Are data more like oil or sunlight?” published in the Economist in February 2020 has highlighted different aspects of data: it is considered as the “most valuable resource” and in the meantime, data can be also a public asset that people should share and make the most use of collectively.

AI is booming yet the data labeling behind is inefficient

Many industries are actively embracing AI to integrate it into their structure transformation. From autonomous driving to drones, from medical systems that assist in diagnosis to digital marketing, AI has empowered more and more areas to be more efficient and intelligent.

Turing Award winner Yann LeCun once expressed that developers need labeled data to train the AI model and more quality labeled data brings more accurate AI systems from the perspective of business and technology. LeCun is one of the godfathers of deep learning and the inventor of convolutional neural networks (CNN), one of the key elements that have spurred a revolution in AI in the past decade.

In the face of AI blue ocean, a large number of data providers have poured in. The data service companies recruit a large amount of data labelers, get them trained on each specific task and distribute the workload to different teams. Or they subcontract the labeling project to smaller data factories that again recruit people to intensively process the divided datasets. The subcontractors or data factories are usually located in India or China due to cheap labor. When the subcontractors complete the first data inspection, they collect the labeled datasets and pass on to the final data service provider who goes through its own data inspection once again and deliver the data results to the AI team.

Complicated, right? Unlike the AI industry, such a traditional working process is inefficient as it takes longer processing time and higher overhead costs, which unfortunately is wasted in secondary and tertiary distribution stages. ML companies are forced to pay high yet the actual small labeling teams could hardly benefit.

ByteBridge: an automated data annotation platform to empower AI

ByteBridge.io has made a breakthrough with its automated data labeling platform in order to empower data scientists and AI companies in an effective and engaged way.

With a completely automated data service system, ByteBridge.io has developed a mature and transparent workflow. In ByteBridge’s dashboard, developers can create the project by themselves, check the ongoing process simultaneously on a pay-per-task model with clear estimated time and price.

ByteBridge.io thinks highly of application scenarios, such as autonomous driving, retail, agriculture and smart households. It is dedicated to providing the best data solutions for AI development and unleashing the most value of data. “We focus on addressing practical issues in different application scenarios for AI development though one-stop, automated data solutions. Data labeling industry should take technology-driven as core competitiveness with efficiency and cost advantage,” said Brian Cheong, CEO and founder ByteBridge.io

It is undeniable that data has become a rare and precious social resource. Whatever metaphors of data are, such as new gold, oil, currency or sunlight, raw data is meaningless at first. It needs to be collected, cleaned and labeled before it grows into valuable goods. ByteBridge.io has realized the magic power of data and aimed at providing the best data labeling service to accelerate the development of AI with accuracy and efficiency.

Monday, October 26, 2020

Better Data for Smarter Chatbots

 Chatbots, computer programs that interact with users through natural language, have become extraordinarily popular due to technological advances. Among various types of chatbots, the need of conversational AI chatbots has become acute in order to facilitate human computer interactions through messaging applications, phones, websites, or mobile apps. In fact, a chatbot is just one of the typical examples of AI systems, many of which are powered by machine learning.


Not-so-intelligent Chatbots

According to a survey by Usabilla in 2019, 54% of respondents said they would prefer a chatbot to a human customer support representative if it saves them 10 minutes. Moreover, 59% of consumers in a PWC survey mentioned they want to have more humanized experiences with chatbots. Although customers have positive feelings toward AI solutions for the efficiency, many not-so-intelligent chatbots which are not smart enough to engage in fundamental conversations are seen in several industries.

Most chatbots that we see today are based on machine learning. They incorporate the ability to understand human language and get themselves trained. With machine learning, the computer systems can learn by being exposed to many examples: the training dataset. The dataset can be thought of as a set of examples. The chatbot’s algorithms extract and save patterns with each input of data. In this way, a chatbot uses training data to understand user behavior and presents the most applicable conversation for a personalized experience.

If not properly considered and developed, chatbots may contain massive potential failures. For example, when a customer starts a conversation with the word “howdy”, if the chatbot only has greeting words “hello”, “hi”programmed in the training dataset, unfortunately it doesn’t have a clue how to respond.

The quality of training data is key

Ben Virdee-Chapman, head of product at Kairos.com, once said that “the quality of the training data that allows algorithms to learn is key.” Preparing the training dataset for chatbots is not easy. For a customer service chatbot, a dataset that contains a massive amount of discussion text between customers and human-based customer support need to be collected, cleaned and labeled to make it understandable for NLP and develop the AI-enabled chatbot so as to communicate with people.

Conversational AI agents such as Alexa and Siri are built with manually annotated data. The accuracy of the ML models can be constantly improved by manually transcribed and annotated data. However, large-scale manual annotation is usually expensive and time-consuming. Thus, abundant and useful datasets are valuable assets for chatbot development.

Manual annotation gives the chatbot a competitive advantage and differentiates it from other competitors. AI and ML companies are seeking high quality datasets to train their algorithms. The choices among different labeling services can make an enormous impact on the quality of the training data, the amount of time and cost required.

Chatbots needs sufficient data to understand human intention. Traditional data providers collect text data or transcribe audio data offline from all available resources and upload the total data onto a certain software, which in turn takes unnecessary communication cost. On the other hand, data quality is often not guaranteed. Thus, obtaining task-oriented text datasets and getting them annotated in a massive amount remains a bottleneck for developers.

ByteBridge.io is one of the leading data service companies that aim to transform the data labeling industry. With a unique and user-friendly platform, Bytebridge.io enables users to complete data labeling tasks online conveniently. Moreover, the blockchain-driven data company substitutes the traditional model with an innovative and precise consensus mechanism, which dramatically improves working efficiency and accuracy.

Partnered with over 30 different language speaking communities across the globe, ByteBridge.io now provides data collection and annotation services covering languages such as English, Chinese, Spanish, Korean, Bengali, Vietnamese, Indonesian, Turkish, Arabic, Russian and more. With rich access to contractors worldwide, it ensures training data quality while expanding its service to a wider range of locations. ByteBridge’s high-quality data collecting and labeling service has empowered various industries such as healthcare, retail, robotics and self-driving, making it possible to integrate AI into such fields.

Chatbots are evolving and becoming increasingly sophisticated in an endeavor to mimic how people talk. Good chatbot applications can not only enhance customer experience, but also improve operational efficiency by reducing cost. To be successful, obtaining crucial datasets is valuable in training and optimizing the chatbot system.

No Bias Labeled Data — the New Bottleneck in Machine Learning

  The Performance of an AI System Depends More on the Training Data Than the Code Over the last few years, there has been a burst of excitem...