Use SQL with the CData JDBC Driver for Twitter to Search for Users



There was a recent request from the Java community on how to extract Twitter users who have not followed a specific account, but have Tweeted about specific content. In this article, we will walk through using the CData JDBC Driver for Twitter to answer such a question.

About the CData JDBC Driver for Twitter

The CData JDBC Driver for Twitter allows users to work with data from Twitter using standard SQL statements. With the Twitter Driver, you can easily connect to Twitter in Java applications, as well as BI, reporting, and ETL tools that support the JDBC standard. The CData Driver supports working with Tweets, direct messages, followers, replies, lists, and more.

Download and Install the JDBC Driver for Twitter

First, download the Twitter JDBC Driver: (click here for a 30-day free trial). Follow the instructions in the installer to complete the installation.

Connect to Twitter with JDBC in a Java IDE

Choose your favorite Java IDE (Eclipse, IntelliJ, NetBeans, etc.) or Java tool that supports JDBC and use the native JDBC support to create a connection to Twitter. For this article, we use NetBeans.

  1. Register a new JDBC driver.
    • Driver: The driver JAR file (typically found in C:\Program Files\CData\CData JDBC Driver for Twitter\lib)
    • Driver Class: cdata.jdbc.twitter.TwitterDriver
    • Name: Any (we use CDataTwitter)
  2. Click "OK."
  3. Create a new connection. Select the newly registered driver and enter the authentication parameters in the JDBC URL.

    For example: jdbc:twitter:InitiateOAuth=GETANDREFRESH;...;

    The JDBC Driver for Twitter uses OAuth for authentication. To learn more, refer to the Twitter JDBC Driver Help Documentation.

  4. After testing the connection, click "Finish" to establish the connection and begin querying Twitter.

Query Twitter

To query Twitter for the list of users we want, we need to create a SQL query to represent the question we have asked. We will break down our question into separate parts first and combine the separate queries to find the users we want.

Find Users Who Are Not Following You

The JDBC Driver for Twitter exposes a Follower view, which is a list of Twitter accounts that are following your account. There is a Following column which is a Boolean value that can be used to find a list of accounts that you have followed but are not following you back:

SELECT ID, Screen_Name, Following FROM Followers WHERE Following = 'false';

Find Tweets About Specific Content

In the Tweets table, you can get Tweets as records. Use a pseudo-column called SearchTerms to get Tweets that contain a specific string. Since all Tweets are searched, it is important to extract only Tweets containing the target character string (in order to avoid hitting the Twitter API rate limits).

SELECT From_User_ID, From_User_Name, Text FROM Tweets WHERE SearchTerms = 'JDBC'

If your query still exceeds the rate limit, add LIMIT 10000 to the SQL statement to reduce the number of results.

Combine the SQL Queries

Once we have parsed the separate parts of the question, we can combine them into a single query. Below, we use a sub-select query to find all of the Tweets from our followers that contain the string "JDBC." From the results, we select those accounts that are not following our account.

SELECT * FROM 
(
  SELECT 
    Tweets.From_User_ID, 
    Tweets.From_User_Name, 
    Tweets.From_User_Screen_Name, 
    Followers.Following, 
    Tweets.Text
  FROM 
    Tweets 
  LEFT OUTER JOIN 
    Followers 
  ON 
    Tweets.From_User_ID = Followers.ID
  WHERE Tweets.SearchTerms = 'JDBC'
 ) 
WHERE Following = false;

Additional Restrictions

As mentioned above, SearchTerms in the Tweets table is useful because it allows you to extract only a specific character string from many Tweets. If you put a LIKE condition in the Text column (the main text of a given Tweet), the API will query the entire timeline, hitting the API rate limit immediately. The same is true for the Hashtags column. As a workaround, write a LIMIT in the query and further filter by the Text and Hashtags columns.

A limitation of the SearchTerms pseudo-column is that you can only get the most recent Tweets. This is a limitation of the Twitter API. By default, Twitter supports returning only Tweets from searches of up to 7 days. However, customers with premium accounts can retrieve older Tweets by running a 30-day or full-archive search.

To help limit your results, search for content that is relevant for a short time period, such as event-specific hashtags or phrases. Otherwise, set up a program to query Twitter every 7 days to get complete results.

NOTE: SearchTerms cannot be queried in combination with columns such as User_Id. To work around this restriction, use an OUTER JOIN to combine the result sets.

Further Querying

If you want to see distinct results based on follower ID, you can use the following query:

SELECT * FROM
(
  SELECT 
    Tweets.From_User_ID, 
    Tweets.From_User_Name, 
    Tweets.From_User_Screen_Name, 
    Followers.Following 
  FROM 
    Tweets 
  LEFT OUTER JOIN 
    Followers 
  ON 
    Tweets.From_User_ID = Followers.ID
   WHERE 
     Tweets.SearchTerms = 'JDBC'
   GROUP BY 
     Tweets.From_User_ID
) 
WHERE Following = false ;

Summary

With the CData JDBC Driver for Twitter, you can use SQL to work with Twitter data. Download a free, 30-day trial and start working with your Twitter data today. For more on using SQL to access 150+ other SaaS, Big Data, and NoSQL data sources, explore our JDBC Drivers.