PolyU X RADICA BigDatathon 2018 - UBS Challenge (26-27 May 2018)

May 27, 2018

PolyU X RADICA BigDatathon 2018 - UBS Challenge (26-27 May 2018)

It was a great honour for us to present to David Ong, Executive Director of UBS AG for the UBS challenge on 27th May 2018. With only 12 hours of working, it was difficult to finish the task. It was a great experience for us to know our strengths and weakness and to learn from others. It was a great surprise that we were awarded the Best Presentation Award and our classification had the highest accuracy among all the competitors!

There is a lot of content generated from social media channels such as Weibo in China every day. For marketing purposes, UBS would like to know what is the content that the netizens feel interested most in different cities of China. With the results, UBS could focus to host more relevant client outreach events based on their interests.

Over 6,000 Weibo posts (unstructured data) as extracted from 8 KOLs between 1st Feb and 30th Apr 2018 were provided. These posts should be grouped into different categories of interest first for easier analysis.

Challengers are required to:

A) Classify these Weibo posts into 13 categories of interest according to the below category ID:

0. Stock

1. Bond

2. Oil

3. Gold

4. Real Estate

5. Chinese Art (painting/ drawing/ calligraphy)

6. Western Art (painting/ drawing/ calligraphy)

7. Jewelry

8. Artifacts

9. Golf

10. Car

11. Overseas Education

12. Young Children Education

B) Sort out the ranking of the 13 interest categories (a) within each city and (b) across the cities according to their popularity based on the social network’s likes, retweets, comments and the city of the commentators of each post. The most popular interest should be put at first.

C) Present the business insights to further value-add to UBS business for a marketing solution.

Analysis & Solutions:

Data Preprocessing:

Remove Duplication of Data

7655 posts were provided by the organizer. However, some data were duplicated or have missing values. As there was limited time to complete the challenge, to obtain results quickly, data with low quality were deleted.

Text Preprocessing

Stopwords removal: meaningless words or phrases were removed like but, and, ...
Punctuation, white spaces and other special characters were removed
Sentences were tokenized into sequential phrases like bi-gram, tri-gram.... This was done so to retain the order of words which were important to interpret the complete meaning of the sentences.

Data Modelling:

Data were split into train dataset and test dataset.
A list of train label is given by the organizer with a sample of 10% of the original dataset.
Some categories have a very low number of counts in the training dataset.
Data manipulation was done to get a more balanced training set. Careful checking of data quality was needed beforehand so as to ensure the labelling of the training sample was mostly correct.

Model: 1-Dimension Convolutional Neural Networking

Good for training sequential data
Word embedding: Wikipedia

Model Evaluation: Results (based on validation set)

Accuracy = ~90%
Loss = ~10%

For more details: https://github.com/cydalytics/Weibo_Posts_Topic_Classification

Dashboard (Power BI):

*part of the visuals are not opened under public access
https://app.powerbi.com/view?r=eyJrIjoiZWYwYjNhYmMtMmZmYy00ZDI5LWIxMjItZDFiZmI2MjczMzU0IiwidCI6IjI1MDczMGZkLWM3MjYtNDBlZS05OTEyLTQwNjcyMWRjNDg2YSIsImMiOjEwfQ%3D%3D

Demonstration:

^ This page is an overview of the words or phrases used by the given KOLs. You can use the filter to see what kind of words are mostly used by the KOLs and what phrases usually receive a higher number of likes, comments or retweets.

^ This page is an overview of the category analysis. We proposed a special calculation formula for the score, i.e. 1 mark for each like, 3 marks for each comment, 6 marks for each retweet. Thus, if a Weibo user gave a like, a comment and a retweet for a particular post. This is counted as 10 marks.

In China, Oversea Education received the highest score among all the categories. This showed that this was the category that most Chinese Weibo users concerned about.

When a filter applied, like in the picture, Sichuan was selected, and the composition of the score was shown on the left-hand side. From the distribution shown, "Golf" was a more popular topic among Sichuan's users.

^ This page allows you to compare between categories, between regions or cities or any other combination of filters

Awards:

Best Presentation

cyda